Skip to content

Latest commit



145 lines (115 loc) · 4.73 KB

File metadata and controls

145 lines (115 loc) · 4.73 KB

ML@B-Intuit Collaboration: Predicting Life

Goal: Create a model that can detect when a user is going through an important event in their life using the user's emails.


Quick Start

The files of interest in terms of models are:,, and The featurization is done under the hood and is passed in as options to

The following command will generate a ridge classifier with TF-IDF for text featurization.

python models/

Gathering Data


After generating your GYB directory, invoke to store your parsed emails onto your local MongoDB in the unlabeled email collection.


python3 eparser/ [path to folder with GYB emails]


python3 eparser/ ~/Documents/Berkeley/ML/Intuit/got-your-back-1.0/


email = {'From': "Email Sender",
        'Subject': "Email Subject",
        'Text': "Email Body Text",
        'To': "Email Recipient",
        '_id': "Datum Id"


We split our data into 20% testing and 80% training. The data files are python pickle files pickeled with Python 2.7, so to ensure that the data is loaded properly, run the model with Python 2.7. When loaded the data will be in the for of a list of python dictioraries.

Training Data: models/data/intuit_data Testing Data: models/data/intuit_test_data

from pickle import load
with open('models/data/intuit_data', 'rb') as f:
    data = load(f)

Labeling Data


This is a tool that allowed for the rapid labeling of emails, to generate a labeled dataset for supervised learning.



alt text

Featurizing Data

Example usage

from featurizer import featurize
data = featurize(list_of_texts, mode='tfidf')
Word2Vec Similarity Models

Used as auxillary features as opposed to a standalone model. This model takes as input, the words which are chosen to semantically represent the labels and outputs a vector that represents the similarity scores of an email and each label.

from word2vec_model import featurize
feature_vector = featurize(email)


Principle Component Analysis

Used to investigate the underlying structure of our featurization. We would like to know how many clusters exist intrinsically and see if they align well with our given labels.

Currently we are using PCA and looking at the clusters of the top 2 principle components. The featurization that this model decomposition uses is TF-IDF and BOW. Note: to display further plots with different featurizations, X out of the previous plot window.


python models/

Used to segment data into 2 clusters, event and non-event. Computes accuracy of TF-IDF and BOW featurizations.

python models/

Used to segment dimension-reduced data into 2 clusters, event and non-event. PCA version allows clusters to be plotted. Note: to display further plots with different featurizations, X out of the previous plot window.

python models/

Model Generation

Random Forests

Random forest are an effective model to prevent overfitting to the training data by diversifying the models in the ensemble. We use then to try and predict life events given data.

To generate our scored random forest and confusion matrix evaluation, run:

python models/
Linear Classification

We attempted to use a few linear models to do the email classification. The models we used were linear ridge classification and support vector classification. We can run these models with specific featurization, such as bag of words and tfidf.

python models/ -m [svm/linear] -f [tfidf/bow]