ML@B-Intuit Collaboration: Predicting Life

Goal: Create a model that can detect when a user is going through an important event in their life using the user's emails.

Dependencies

python2.7 (For NLP Packages)
python3 (For parsing raw emails)
pymongo
- pip install pymongo
gensim (only if you want to import word2vec model)
- pip install gensim
numpy
- pip install numpy
scipy
- pip install scipy
Got Your Back (G.Y.B)
seaborn
- pip install seaborn
nltk
- pip install nltk

Quick Start

The files of interest in terms of models are: random_forest_confusion_matrix.py, linear_model.py, and pca_plot.py. The featurization is done under the hood and is passed in as options to linear_model.py.

The following command will generate a ridge classifier with TF-IDF for text featurization.

python models/linear_model.py

Gathering Data

eparser.py

After generating your GYB directory, invoke eparser.py to store your parsed emails onto your local MongoDB in the unlabeled email collection.

Usage

python3 eparser/parser.py [path to folder with GYB emails]

Example

python3 eparser/parser.py ~/Documents/Berkeley/ML/Intuit/got-your-back-1.0/GYB-GMail-Backup-matthewtrepte@gmail.com/2016

Structure

email = {'From': "Email Sender",
        'Subject': "Email Subject",
        'Text': "Email Body Text",
        'To': "Email Recipient",
        '_id': "Datum Id"
        }

Data

We split our data into 20% testing and 80% training. The data files are python pickle files pickeled with Python 2.7, so to ensure that the data is loaded properly, run the model with Python 2.7. When loaded the data will be in the for of a list of python dictioraries.

Training Data: models/data/intuit_data Testing Data: models/data/intuit_test_data

from pickle import load
with open('models/data/intuit_data', 'rb') as f:
    data = load(f)

Labeling Data

labeller.py

This is a tool that allowed for the rapid labeling of emails, to generate a labeled dataset for supervised learning.

Usage

python labeller.py

Featurizing Data

Example usage

from featurizer import featurize
data = featurize(list_of_texts, mode='tfidf')

Word2Vec Similarity Models

Used as auxillary features as opposed to a standalone model. This model takes as input, the words which are chosen to semantically represent the labels and outputs a vector that represents the similarity scores of an email and each label.

from word2vec_model import featurize
feature_vector = featurize(email)

Clustering

Principle Component Analysis

pca_plot.py

Used to investigate the underlying structure of our featurization. We would like to know how many clusters exist intrinsically and see if they align well with our given labels.

Currently we are using PCA and looking at the clusters of the top 2 principle components. The featurization that this model decomposition uses is TF-IDF and BOW. Note: to display further plots with different featurizations, X out of the previous plot window.

Usage

python models/pca_plot.py

K-Means

kmeans.py

Used to segment data into 2 clusters, event and non-event. Computes accuracy of TF-IDF and BOW featurizations.

python models/kmeans.py

kmeans_pca.py

Used to segment dimension-reduced data into 2 clusters, event and non-event. PCA version allows clusters to be plotted. Note: to display further plots with different featurizations, X out of the previous plot window.

python models/kmeans_pca.py

Model Generation

Random Forests

Random forest are an effective model to prevent overfitting to the training data by diversifying the models in the ensemble. We use then to try and predict life events given data.

To generate our scored random forest and confusion matrix evaluation, run:

python models/random_forest_confusion_matrix.py

Linear Classification

We attempted to use a few linear models to do the email classification. The models we used were linear ridge classification and support vector classification. We can run these models with specific featurization, such as bag of words and tfidf.

python models/linear_model.py -m [svm/linear] -f [tfidf/bow]

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
eparser		eparser
examples		examples
images		images
instructions		instructions
legacy		legacy
models		models
.gitignore		.gitignore
README.md		README.md
download_negatives.py		download_negatives.py
knowledge_graph.py		knowledge_graph.py
labeller.py		labeller.py
score_matrix.mat		score_matrix.mat
sensitive_info.csv		sensitive_info.csv
upload_negatives.py		upload_negatives.py

rykard95/MLAB_Intuit

Folders and files

Latest commit

History

Repository files navigation

ML@B-Intuit Collaboration: Predicting Life

Dependencies

Quick Start

Gathering Data

Structure

Data

Labeling Data

Featurizing Data

Word2Vec Similarity Models

Clustering

Principle Component Analysis

K-Means

Model Generation

Random Forests

Linear Classification

About

Resources

Stars

Watchers

Forks

Languages