Goal: Create a model that can detect when a user is going through an important event in their life using the user's emails.
- python2.7 (For NLP Packages)
- python3 (For parsing raw emails)
- pymongo
pip install pymongo
- gensim (only if you want to import word2vec model)
pip install gensim
- numpy
pip install numpy
- scipy
pip install scipy
- Got Your Back (G.Y.B)
- seaborn
pip install seaborn
- nltk
pip install nltk
The files of interest in terms of models are: random_forest_confusion_matrix.py
, linear_model.py
, and pca_plot.py
. The featurization is done under the hood and is passed in as options to linear_model.py
.
The following command will generate a ridge classifier with TF-IDF for text featurization.
python models/linear_model.py
- eparser.py
After generating your GYB directory, invoke eparser.py
to store your parsed emails onto your local MongoDB in the unlabeled
email collection.
Usage
python3 eparser/parser.py [path to folder with GYB emails]
Example
python3 eparser/parser.py ~/Documents/Berkeley/ML/Intuit/got-your-back-1.0/GYB-GMail-Backup-matthewtrepte@gmail.com/2016
email = {'From': "Email Sender",
'Subject': "Email Subject",
'Text': "Email Body Text",
'To': "Email Recipient",
'_id': "Datum Id"
}
We split our data into 20% testing and 80% training. The data files are python pickle files pickeled with Python 2.7
, so to ensure that the data is loaded properly, run the model with Python 2.7
. When loaded the data will be in the for of a list of python dictioraries.
Training Data: models/data/intuit_data
Testing Data: models/data/intuit_test_data
from pickle import load
with open('models/data/intuit_data', 'rb') as f:
data = load(f)
- labeller.py
This is a tool that allowed for the rapid labeling of emails, to generate a labeled dataset for supervised learning.
Usage
python labeller.py
Example usage
from featurizer import featurize
data = featurize(list_of_texts, mode='tfidf')
Used as auxillary features as opposed to a standalone model. This model takes as input, the words which are chosen to semantically represent the labels and outputs a vector that represents the similarity scores of an email and each label.
from word2vec_model import featurize
feature_vector = featurize(email)
- pca_plot.py
Used to investigate the underlying structure of our featurization. We would like to know how many clusters exist intrinsically and see if they align well with our given labels.
Currently we are using PCA and looking at the clusters of the top 2 principle components. The featurization that this model decomposition uses is TF-IDF and BOW. Note: to display further plots with different featurizations, X out of the previous plot window.
Usage
python models/pca_plot.py
- kmeans.py
Used to segment data into 2 clusters, event and non-event. Computes accuracy of TF-IDF and BOW featurizations.
python models/kmeans.py
- kmeans_pca.py
Used to segment dimension-reduced data into 2 clusters, event and non-event. PCA version allows clusters to be plotted. Note: to display further plots with different featurizations, X out of the previous plot window.
python models/kmeans_pca.py
Random forest are an effective model to prevent overfitting to the training data by diversifying the models in the ensemble. We use then to try and predict life events given data.
To generate our scored random forest and confusion matrix evaluation, run:
python models/random_forest_confusion_matrix.py
We attempted to use a few linear models to do the email classification. The models we used were linear ridge classification
and support vector classification
. We can run these models with specific featurization, such as bag of words
and tfidf
.
python models/linear_model.py -m [svm/linear] -f [tfidf/bow]