In [2]:
"""
Katie explained in a video a problem that arose in preparing Chris and Sara's email for the author identification project; 
it had to do with a feature that was a little too powerful (effectively acting like a signature, which gives an arguably unfair 
advantage to an algorithm). You'll work through that discovery process here.

This bug was found when Katie was trying to make an overfit decision tree to use as an example in the decision tree mini-project.
A decision tree is classically an algorithm that can be easy to overfit; one of the easiest ways to get an overfit decision tree 
is to use a small training set and lots of features. 
If a decision tree is overfit, would you expect the accuracy on a test set to be very high or pretty low? - Low

A classic way to overfit an algorithm is by using lots of features and not a lot of training data. You can find the starter code 
in feature_selection/find_signature.py. Get a decision tree up and training on the training data, and print out the accuracy. How
many training points are there, according to the starter code?
"""
"""
find_signature.py
"""

#!/usr/bin/python

import pickle
import numpy
numpy.random.seed(42)


### The words (features) and authors (labels), already largely processed.
### These files should have been created from the previous (Lesson 10)
### mini-project.
words_file = "your_word_data.pkl" 
authors_file = "your_email_authors.pkl"
word_data = pickle.load( open(words_file, "r"))
authors = pickle.load( open(authors_file, "r") )

### test_size is the percentage of events assigned to the test set (the
### remainder go into training)
### feature matrices changed to dense representations for compatibility with
### classifier functions in versions 0.15.2 and earlier
from sklearn import cross_validation
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, authors, 
                                                                                             test_size=0.1, random_state=42)

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
features_train = vectorizer.fit_transform(features_train)
features_test  = vectorizer.transform(features_test).toarray()

### a classic way to overfit is to use a small number
### of data points and a large number of features;
### train on only 150 events to put ourselves in this regime
features_train = features_train[:150].toarray()
labels_train   = labels_train[:150]


In [3]:
"""
What's the accuracy of the decision tree you just made? (Remember, we're setting up our decision tree to overfit -- ideally, we 
want to see the test accuracy as relatively low.)
"""
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

clf =  DecisionTreeClassifier() #min_samples_split=40)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy = accuracy_score(pred, labels_test)

print 'DTree Accuracy:', accuracy

DTree Accuracy: 0.947667804323


In [6]:
"""
Take your (overfit) decision tree and use the feature_importances_ attribute to get a list of the relative importance of all 
the features being used. We suggest iterating through this list (it's long, since this is text data) and only printing out the 
feature importance if it's above some threshold (say, 0.2--remember, if all words were equally important, each one would give 
an importance of far less than 0.01). What's the importance of the most important feature? What is the number of this feature?
"""
labels = clf.classes_
featimp = clf.feature_importances_
maxfeat = 0
for k, feat in enumerate(featimp):
    if feat > 0.2:
        maxfeat = k
        print 'The feature %d has an importance of: %f' % (k, feat)


The feature 33614 has an importance of: 0.764706


In [10]:
"""
In order to figure out what words are causing the problem, you need to go back to the TfIdf and use the feature numbers that you 
obtained in the previous part of the mini-project to get the associated words. You can return a list of all the words in the 
TfIdf by calling get_feature_names() on it; pull out the word that's causing most of the discrimination of the decision tree. 
What is it? Does it make sense as a word that's uniquely tied to either Chris Germany or Sara Shackleton, a signature of sorts?
"""
feat_names = vectorizer.get_feature_names()
print 'The word with the highest importance is: %s' % feat_names[maxfeat]

The word with the highest importance is: sshacklensf


In [21]:
"""
This word seems like an outlier in a certain sense, so let's remove it and refit. Go back to text_learning/vectorize_text.py, and 
remove this word from the emails using the same method you used to remove "sara", "chris", etc. Rerun vectorize_text.py, and 
once that finishes, rerun find_signature.py.
"""
"""
parse_out_email_text.py
"""
#!/usr/bin/python

from nltk.stem.snowball import SnowballStemmer
import string

def parseOutText(f):
    """ given an opened email file f, parse out all text below the
        metadata block at the top
        (in Part 2, you will also add stemming capabilities)
        and return a string that contains all the words
        in the email (space-separated) 
        
        example use case:
        f = open("email_file_name.txt", "r")
        text = parseOutText(f)
        
    """
    f.seek(0)  ### go back to beginning of file (annoying)
    all_text = f.read()

    ### split off metadata
    content = all_text.split("X-FileName:")
    words = ""
    if len(content) > 1:
        ### remove punctuation
        text_string = content[1].translate(string.maketrans("", ""), string.punctuation)

        ### project part 2: comment out the line below
        words = text_string

        ### split the text string into individual words, stem each word,
        ### and append the stemmed word to words
        stemmer = SnowballStemmer("english")
        words = ''
        for word in text_string.strip().split():
            stem = stemmer.stem(word)
            words += stem+' '
    return words


"""
vectorize_text.py
"""
#!/usr/bin/python

import os
import pickle
import re
import sys

sys.path.append( "../ud120-projects/" )
#from parse_out_email_text import parseOutText

"""
    Starter code to process the emails from Sara and Chris to extract
    the features and get the documents ready for classification.

    The list of all the emails from Sara are in the from_sara list
    likewise for emails from Chris (from_chris)

    The actual documents are in the Enron email dataset, which
    you downloaded/unpacked in Part 0 of the first mini-project. If you have
    not obtained the Enron email corpus, run startup.py in the tools folder.

    The data is stored in lists and packed away in pickle files at the end.
"""


from_sara  = open("from_sara.txt", "r")
from_chris = open("from_chris.txt", "r")

from_data = []
word_data = []

for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
    for path in from_person:
        path = os.path.join('../ud120-projects/', path[:-1])
        #print path
        email = open(path, "r")

        ### use parseOutText to extract the text from the opened email
        text = parseOutText(email)

        ### use str.replace() to remove any instances of the words
        users = ["sara", "shackleton", "chris", "germani", "sshacklensf", "cgermannsf"]
        for user in users:
            text = text.replace(user, '')

        ### append the text to word_data
        word_data.append(text)

        ### append a 0 to from_data if email is from Sara, and 1 if email is from Chris
        if name == 'sara':
            from_data.append(0)
        elif name == 'chris':
            from_data.append(1)

        email.close()

print "emails processed"
from_sara.close()
from_chris.close()

pickle.dump( word_data, open("your_word_data.pkl", "w") )
pickle.dump( from_data, open("your_email_authors.pkl", "w") )

emails processed


In [22]:
"""
Rerun vectorize_text.py, and once that finishes, rerun find_signature.py. Any other outliers pop up? What word is it? Seem like 
a signature-type word? (Define an outlier as a feature with importance > 0.2, as before).
"""
"""
find_signature.py
"""
#!/usr/bin/python

import pickle
import numpy
numpy.random.seed(42)

### The words (features) and authors (labels), already largely processed.
### These files should have been created from the previous (Lesson 10)
### mini-project.
words_file = "your_word_data.pkl" 
authors_file = "your_email_authors.pkl"
word_data = pickle.load( open(words_file, "r"))
authors = pickle.load( open(authors_file, "r") )

### test_size is the percentage of events assigned to the test set (the
### remainder go into training)
### feature matrices changed to dense representations for compatibility with
### classifier functions in versions 0.15.2 and earlier
from sklearn import cross_validation
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, authors, 
                                                                                             test_size=0.1, random_state=42)

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
features_train = vectorizer.fit_transform(features_train)
features_test  = vectorizer.transform(features_test).toarray()

### a classic way to overfit is to use a small number
### of data points and a large number of features;
### train on only 150 events to put ourselves in this regime
features_train = features_train[:150].toarray()
labels_train   = labels_train[:150]

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

clf =  DecisionTreeClassifier() #min_samples_split=40)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy = accuracy_score(pred, labels_test)

print 'DTree Accuracy:', accuracy

labels = clf.classes_
featimp = clf.feature_importances_
maxfeat = 0
for k, feat in enumerate(featimp):
    if feat > 0.2:
        maxfeat = k
        print 'The feature %d has an importance of: %f' % (k, feat)
        
feat_names = vectorizer.get_feature_names()
print 'The word with the highest importance is: %s' % feat_names[maxfeat]

DTree Accuracy: 0.816837315131
The feature 21323 has an importance of: 0.363636
The word with the highest importance is: houectect
