Skip to content

rykard95/MLAB_Intuit

Repository files navigation

ML@B-Intuit Collaboration: Predicting Life

Goal: Create a model that can detect when a user is going through an important event in their life using the user's emails.

Dependencies

Quick Start

The files of interest in terms of models are: random_forest_confusion_matrix.py, linear_model.py, and pca_plot.py. The featurization is done under the hood and is passed in as options to linear_model.py.

The following command will generate a ridge classifier with TF-IDF for text featurization.

python models/linear_model.py

Gathering Data

  • eparser.py

After generating your GYB directory, invoke eparser.py to store your parsed emails onto your local MongoDB in the unlabeled email collection.

Usage

python3 eparser/parser.py [path to folder with GYB emails]

Example

python3 eparser/parser.py ~/Documents/Berkeley/ML/Intuit/got-your-back-1.0/GYB-GMail-Backup-matthewtrepte@gmail.com/2016

Structure

email = {'From': "Email Sender",
        'Subject': "Email Subject",
        'Text': "Email Body Text",
        'To': "Email Recipient",
        '_id': "Datum Id"
        }

Data

We split our data into 20% testing and 80% training. The data files are python pickle files pickeled with Python 2.7, so to ensure that the data is loaded properly, run the model with Python 2.7. When loaded the data will be in the for of a list of python dictioraries.

Training Data: models/data/intuit_data Testing Data: models/data/intuit_test_data

from pickle import load
with open('models/data/intuit_data', 'rb') as f:
    data = load(f)

Labeling Data

  • labeller.py

This is a tool that allowed for the rapid labeling of emails, to generate a labeled dataset for supervised learning.

Usage

python labeller.py

alt text

Featurizing Data

Example usage

from featurizer import featurize
data = featurize(list_of_texts, mode='tfidf')
Word2Vec Similarity Models

Used as auxillary features as opposed to a standalone model. This model takes as input, the words which are chosen to semantically represent the labels and outputs a vector that represents the similarity scores of an email and each label.

from word2vec_model import featurize
feature_vector = featurize(email)

Clustering

Principle Component Analysis
  • pca_plot.py

Used to investigate the underlying structure of our featurization. We would like to know how many clusters exist intrinsically and see if they align well with our given labels.

Currently we are using PCA and looking at the clusters of the top 2 principle components. The featurization that this model decomposition uses is TF-IDF and BOW. Note: to display further plots with different featurizations, X out of the previous plot window.

Usage

python models/pca_plot.py
K-Means
  • kmeans.py

Used to segment data into 2 clusters, event and non-event. Computes accuracy of TF-IDF and BOW featurizations.

python models/kmeans.py
  • kmeans_pca.py

Used to segment dimension-reduced data into 2 clusters, event and non-event. PCA version allows clusters to be plotted. Note: to display further plots with different featurizations, X out of the previous plot window.

python models/kmeans_pca.py

Model Generation

Random Forests

Random forest are an effective model to prevent overfitting to the training data by diversifying the models in the ensemble. We use then to try and predict life events given data.

To generate our scored random forest and confusion matrix evaluation, run:

python models/random_forest_confusion_matrix.py
Linear Classification

We attempted to use a few linear models to do the email classification. The models we used were linear ridge classification and support vector classification. We can run these models with specific featurization, such as bag of words and tfidf.

python models/linear_model.py -m [svm/linear] -f [tfidf/bow]

About

Collaboration project between ML@B and Intuit for the Fall 2016 school semester

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages