Combines all essays from okc dataframe into one long essay with markdown removed.
Saves result to new .csv
Performs several different Tf-idf vectorization and stemming on the long essay.

This code should be able to be adapted to run preprocess each of the shorter essays and save tf-idf versions.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup    
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

In [2]:
okc = pd.read_csv('../Assets/A/train.csv')

In [3]:
# Remove redundant index column
okc = okc.drop('Unnamed: 0', axis=1)

In [4]:
# Create list of all columns that are essays
essay_list = [('essay%i') %i for i in range(10)]

In [5]:
# Replace empty essays with ' '

okc.ix[:,essay_list] = okc.ix[:,essay_list].replace(np.nan,'', regex=True)

In [6]:
def essay_to_words( raw_essay ):
    
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML
    text = BeautifulSoup(raw_essay, 'lxml').get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z0-9 ]", " ", text) 
    #
    # 3. Convert to lower case
    words = letters_only.lower()                            

    return words

In [7]:
# Write new column to df that contains all essays
okc['essays'] = (okc.essay0 + ' ' + okc.essay1 + ' ' + okc.essay2 + ' ' + okc.essay3 + ' ' + okc.essay4 + ' ' 
              + okc.essay5 + ' ' + okc.essay6 + ' ' + okc.essay7 + ' ' + okc.essay8 + ' ' + okc.essay9)
okc['essays'] = okc.essays.apply(essay_to_words)

In [8]:
# Preprocess essay short essays
for subject in essay_list:
    okc[subject] = okc[subject].apply(essay_to_words)

  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
  '"%s" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.' % markup)
  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP cl

In [9]:
okc.shape

(53951, 32)

In [10]:
okc.to_csv('../Assets/A/one_long_essay.csv')

In [11]:
len(okc.education.value_counts())

32

## Apply Porter Stemmer to OKC['essays']

In [12]:
from nltk.stem.porter import *

In [13]:
stemmer = PorterStemmer()

In [14]:
def stem(essay):
    stems = [stemmer.stem(word) for word in essay.lower().split()]
    return ' '.join(stems)

In [15]:
okc['stemmed_essays'] = okc['essays'].apply(stem)

In [16]:
okc.to_csv('../Assets/A/stemmed_essays.csv')

## Tfidf Vectorize Stemmed Essays
#### Save top performing vectorizer (I think!  This could stand to be evaluated later)
Vectorizing is slow, as is building the model.  But the best model I have for predicting sex to date came from this vectorizer.  Use as default for now?

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range = (1, 1), encoding='utf-8', stop_words = 'english', binary = False, max_features = 2000)
top_ngrams = vectorizer.fit_transform(okc['essays'])

# Save dataframe with feature names and tf-idf scores for each user
df  = pd.DataFrame(top_ngrams.todense(), columns=vectorizer.get_feature_names())

df.to_csv('../Assets/A/top_2000_words_nomax_stemmed.csv')

KeyboardInterrupt: 

## Do Truncated SVD on Top Ngrams

In [18]:
df = pd.read_csv('../Assets/A/top_2000_words_nomax_stemmed.csv')

In [19]:
# http://scikit-learn.org/stable/modules/preprocessing.html#scaling-sparse-data

svd = TruncatedSVD(n_components=100, random_state=42)
essay_svd = svd.fit_transform(df)

In [20]:
essay_svd.shape

(57809, 100)

In [21]:
essay_svd_df = pd.DataFrame(essay_svd)

In [22]:
print(svd.explained_variance_ratio_)
print(svd.explained_variance_ratio_.sum())

[  9.99999997e-01   1.95927900e-13   1.19045840e-13   9.91905880e-14
   9.57447442e-14   9.14094746e-14   8.93792811e-14   8.80001224e-14
   8.64392766e-14   8.52646941e-14   8.21052201e-14   8.17178908e-14
   7.91041986e-14   7.80085986e-14   7.78410989e-14   7.71049506e-14
   7.66750582e-14   7.54482574e-14   7.44560570e-14   7.36827419e-14
   7.20000450e-14   7.15778112e-14   7.13943985e-14   6.98770778e-14
   6.86670633e-14   6.80636892e-14   6.75334947e-14   6.70425768e-14
   6.58596979e-14   6.54314038e-14   6.52688992e-14   6.42402532e-14
   6.39039137e-14   6.27288015e-14   6.21575577e-14   6.18632394e-14
   6.15484096e-14   6.14388490e-14   6.08286898e-14   6.00431302e-14
   5.95679354e-14   5.92107423e-14   5.88536951e-14   5.82645255e-14
   5.78359153e-14   5.73604469e-14   5.69530614e-14   5.62316971e-14
   5.53844132e-14   5.49062150e-14   5.43686197e-14   5.39290058e-14
   5.35020992e-14   5.26848955e-14   5.21804810e-14   5.17714260e-14
   5.10051074e-14   5.07588372e-14

Sklearn recommends 100 components of truncated SVD for LSA
In this case, the first feature explains 99.9999997% of the variance.
WTF?

In [23]:
essay_svd_df.to_csv('../Assets/A/long_essay_SVD.csv')