Combines all essays from okc dataframe into one long essay with markdown removed.
Saves result to new .csv
Performs several different Tf-idf vectorization and stemming on the long essay.

This code should be able to be adapted to run preprocess each of the shorter essays and save tf-idf versions.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup    
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

In [2]:
okc = pd.read_csv('../Assets/A/train.csv')

In [3]:
# Remove redundant index column
okc = okc.drop('Unnamed: 0', axis=1)

In [4]:
# Create list of all columns that are essays
essay_list = [('essay%i') %i for i in range(10)]

In [5]:
# Replace empty essays with ' '

okc.ix[:,essay_list] = okc.ix[:,essay_list].replace(np.nan,'', regex=True)

In [6]:
def essay_to_words( raw_essay ):
    
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML
    text = BeautifulSoup(raw_essay, 'lxml').get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z0-9 ]", " ", text) 
    #
    # 3. Convert to lower case
    words = letters_only.lower()                            

    return words

In [7]:
# Write new column to df that contains all essays
okc['essays'] = (okc.essay0 + ' ' + okc.essay1 + ' ' + okc.essay2 + ' ' + okc.essay3 + ' ' + okc.essay4 + ' ' 
              + okc.essay5 + ' ' + okc.essay6 + ' ' + okc.essay7 + ' ' + okc.essay8 + ' ' + okc.essay9)
okc['essays'] = okc.essays.apply(essay_to_words)

In [8]:
# Preprocess essay short essays
for subject in essay_list:
    okc[subject] = okc[subject].apply(essay_to_words)

  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
  '"%s" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.' % markup)
  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP cl

In [9]:
okc.shape

(53951, 32)

In [10]:
okc.to_csv('../Assets/A/one_long_essay.csv')

In [11]:
len(okc.education.value_counts())

32

## Apply Porter Stemmer to OKC['essays']

In [12]:
from nltk.stem.porter import *

In [13]:
stemmer = PorterStemmer()

In [14]:
def stem(essay):
    stems = [stemmer.stem(word) for word in essay.lower().split()]
    return ' '.join(stems)

In [15]:
okc['stemmed_essays'] = okc['essays'].apply(stem)

In [16]:
okc.to_csv('../Assets/A/stemmed_essays.csv')

## Tfidf Vectorize Stemmed Essays
#### Save top performing vectorizer (I think!  This could stand to be evaluated later)
Vectorizing is slow, as is building the model.  But the best model I have for predicting sex to date came from this vectorizer.  Use as default for now?

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range = (1, 2), encoding='utf-8', stop_words = 'english', binary = False, max_features = 2000)
top_ngrams = vectorizer.fit_transform(okc['essays'])

# Save dataframe with feature names and tf-idf scores for each user
df  = pd.DataFrame(top_ngrams.todense(), columns=vectorizer.get_feature_names())

df.to_csv('../Assets/A/top_2000_ngrams_nomax_stemmed.csv')

## Do Truncated SVD on Top Ngrams

In [18]:
# http://scikit-learn.org/stable/modules/preprocessing.html#scaling-sparse-data

svd = TruncatedSVD(n_components=100, random_state=42)
essay_svd = svd.fit_transform(df)

In [19]:
essay_svd.shape

(53951, 100)

In [20]:
essay_svd_df = pd.DataFrame(essay_svd)

In [21]:
print(svd.explained_variance_ratio_)
print(svd.explained_variance_ratio_.sum())

[ 0.01298211  0.00823619  0.00723189  0.00614003  0.00519215  0.00478322
  0.003961    0.00387938  0.00375349  0.00349132  0.00335581  0.0032838
  0.00317054  0.00308899  0.00299966  0.00296526  0.00289378  0.0026742
  0.00264556  0.00259329  0.0024986   0.00248231  0.00243665  0.00240046
  0.00237489  0.0023301   0.00228093  0.00225145  0.0021911   0.00216639
  0.00212451  0.00209353  0.00205047  0.00203596  0.00201471  0.00196218
  0.00195136  0.00192217  0.00190914  0.00188377  0.00185887  0.00184226
  0.00183621  0.00181765  0.00180909  0.00177425  0.00175183  0.00173592
  0.00171739  0.00167146  0.00166378  0.00165274  0.00163639  0.00162453
  0.00159918  0.00158439  0.00157591  0.00156791  0.00154525  0.00154421
  0.00151826  0.00150861  0.00149678  0.00148457  0.00147404  0.00146694
  0.00145332  0.00144285  0.00143762  0.00142862  0.00142263  0.00141326
  0.00141209  0.00138217  0.00137853  0.00137482  0.00136464  0.00135783
  0.00134484  0.00133568  0.00132448  0.00131798  0.0

Sklearn recommends 100 components of truncated SVD for LSA
In this case, the first feature explains 99.9999997% of the variance.
WTF?

In [22]:
essay_svd_df.to_csv('../Assets/A/long_essay_SVD.csv')