# W207 Final Project Submission
### Ross Boberg, Sarah Neff, Sam Zaiss

This notebook documents our exploration for the <a href="http://www.kaggle.com/c/random-acts-of-pizza">Random Acts of Pizza</a> kaggle competition as part of the W207 Machine Learning course for UC Berkeley's MIDS program. We document the individual areas of exploration that we completed for this project, followed by the larger model that pulled these explorations together.

### Problem Description
The goal of the Random Acts of Pizza kaggle projects is to translate requests for pizza on the Reddit group "Random Acts of Pizza" in to predictions of whether or not they those requests are fulfilled. The data includes the text request, split in to the title and body of requests for pizza, as well as metadata about the request.

These metadata include:
<ul>
<li>time of request (UTC and local)
<li>numeric data about user activity:
    <ul>
    <li> 'requester_account_age_in_days_at_request'
    <li> 'requester_days_since_first_post_on_raop_at_request'
    <li> 'requester_number_of_comments_at_request'
    <li> 'requester_number_of_comments_in_raop_at_request'
    <li> 'requester_number_of_posts_at_request'
    <li> 'requester_number_of_posts_on_raop_at_request'
    <li> 'requester_number_of_subreddits_at_request'
    <li> 'requester_upvotes_minus_downvotes_at_request'
    <li> 'requester_upvotes_plus_downvotes_at_request'
    </ul>
<li>subreddit groups of the user</li>
<li>Reddit user id</li>
<li>request id to identify the request in submission</li>
</ul>

There are 4040 samples of requests in the exposed data, 994 of which were successful.
Theare are 1631 samples of unlabeled test data, that Kaggle will test our predictions on

<a id="top"></a>
#### Table of Contents
<ol>
<li><a href="#part1">Data Import and Base Methods</a></li>
<li><a href="#part2">Activity Features</a></li>
<li><a href="#part3">Text Bag of Words</a>
<ul>
<li>Simple</li>
<li>L1 Feature Regularization</li>
<li>Time</li><br/>
</ul>
</li>
<li><a href="#part4">Time Features</a></li>
<li><a href="#part5">Interesting Words &amp; Category Tags</a></li>
<li><a href="#part6">Request Quality</a></li>
<li><a href="#part7">Text Summary Features</a>
</li>
<li><a href="#part8">Location Features</a></li>
<li><a href="#part9">Parts of Speech</a></li>
<li><a href="#part10">Subreddits</a></li>
<li><a href="#part11">Final, Composite Model</a></li>
<br/>
<li><a href="#part12">Notes on Error Analysis</a></li>
<li><a href="#part13">Appendix - additional goodness</a></li>
</ol>

<a id="part1"></a>
## 1. Data Import and Base Methods

In [6]:
import json
import csv
import numpy as np
import random as rand
import pandas as pd
import scipy as scipy
import datetime as dt
import time
from bs4 import BeautifulSoup
from urllib import urlopen
import re

%matplotlib inline
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.mixture import GMM
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import make_scorer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.decomposition import RandomizedPCA
from sklearn.base import TransformerMixin
from sklearn.base import BaseEstimator
from sklearn.grid_search import GridSearchCV

#useful for text processing
from nltk import word_tokenize

# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [7]:
### Helper methods that will be used often in this notebook
from datautil import load_json_file
from datautil import make_submission_csv
from datautil import name2index
from mlutil import balance_samples
from mlutil import oversample_kfold
from mlutil import test_kfolds
from mlutil import print_scores

from tokenizers import LemmaTokenizer
from tokenizers import SnowballStemTokenizer
from tokenizers import PorterStemTokenizer
from tokenizers import PuncTokenizer
from tokenizers import SpaceTokenizer
from tokenizers import PTBTokenizer



Sets up the problem. We set the random seed to 207 (the class listing #!) so our results are replicable. We also shuffle our training data to make sure there are no issues of ordering that confuse the learning algorithms (e.g. all the postiive examples at the beginning). The training data has some variables that the submission data does not, so we make sure to ignore those because they will be useless for prediction.

We also set up the K-folds we will use for cross validation in the rest of the notebook. We declare it now to allow us to get consistent results on our experiments. We use 10 fold validation (train our model on 90% of our data and testing it on the other 10%, on each non overlapping 10% of the data)

In [8]:
from datautil import load_raop_data

### Set up the training and test data to work with throughout the notebook:
rseed = 207
np.random.seed(rseed)

all_train_df, all_train_labels, submit_df = load_raop_data()

# useful for sklearn scoring
roc_scorer = make_scorer(roc_auc_score)
n_all = all_train_df.shape[0]

# set up kFolds to be used in the rest of the project
kf = KFold(n_all, n_folds = 10, random_state=rseed)

y = all_train_labels
kf_over = oversample_kfold(kf, y)

<a id="part2"></a>
## 2. Activity Features

In [10]:
from transformers import ExtractColumnsTransformer
from transformers import ExtractActivities

This model for testing the activities demonstrates the importance of accounting for unbalanced classes in this problem.
With no adjustment, the model has zero value.

We adjust via two methods, either the class_weight parameter if the estimator has it, which weights errors of the less frequent class higher ot make sure the model actually tries to predict them, or by oversampling the minority class in the training set. The function that adjusts the k-folds to do that is at the beginning of the notebook: oversample_kfold

In [11]:
### Explore models using the activity features only.
#Activities = ExtractActivities(all_train_df)

# The main concern here is weighting classes appropriately, so we do an investigation of different
# approaches and see how well the resulting model performs on 10 folds of the training data.

print 'Equal Class Weights'
#pipe = Pipeline([('activity',ExtractActivitiesOld()), ('scale', StandardScaler()),('svc', SVC(random_state=rseed))])
#print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))
pipe = Pipeline([('activity',ExtractActivities()), ('scale', StandardScaler()),('svc', SVC(random_state=rseed))])
print_scores(cross_val_score(pipe, all_train_df, all_train_labels, cv=kf, scoring=roc_scorer))



print '\nReweighted Classes'
#wt_pipe = Pipeline([('activity',ExtractActivitiesOld()), ('scale', StandardScaler()),('svc', SVC(random_state=rseed, class_weight='auto'))])
#print_scores(cross_val_score(wt_pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))
wt_pipe = Pipeline([('activity',ExtractActivities()), ('scale', StandardScaler()),('svc', SVC(random_state=rseed, class_weight='auto'))])
print_scores(cross_val_score(wt_pipe, all_train_df, all_train_labels, cv=kf, scoring=roc_scorer))


print '\nRebalanced Sample'
rebal_pipe = Pipeline([('activity',ExtractActivities()), ('scale', StandardScaler()),('svc',SVC(random_state=rseed))])

# oversample training data in kfolds - function defined above
# necessary for some estimators that don't have class weight parameters
kf_over = oversample_kfold(kf, all_train_labels)
#print_scores(cross_val_score(rebal_pipe, all_train_df.values, all_train_labels, cv=kf_over, scoring=roc_scorer))
print_scores(cross_val_score(rebal_pipe, all_train_df, all_train_labels, cv=kf_over, scoring=roc_scorer))


Equal Class Weights
N: 10, Mean: 0.508836, Median: 0.509778, SD: 0.007667

Reweighted Classes
N: 10, Mean: 0.558866, Median: 0.556987, SD: 0.021346

Rebalanced Sample
N: 10, Mean: 0.555097, Median: 0.563723, SD: 0.024857


### Results Table += Activity Features

The following table documents our results so far. The mean and median scores come from taking the averge of the ROC-AUC scores from 10 k-folds in the specified model.

<table>
<tr>
<th>Method</th>
<th>Mean ROC-AUC</th>
<th>Median ROC-AUC</th>
<th>Standard Deviation</th>
</tr>
<tr>
<td>Activity Features with Reweighted Classes</td>
<td>0.5589</td>
<td>0.5570</td>
<td>0.0213</td>
</tr>
</table>

<a href="#top">Return to Table of Contents</a>

<a id="part3"></a>
## 3. Bag of Words

###3a. Simple

In [12]:
hasattr(np.array([1,2]), '__iter__')

True

In [13]:
### Define quick classes that we can use to isolate the title and body columns in our data.
from transformers import ExtractBody, ExtractTitle, ExtractAllText, ExtractUser
from transformers import ConcatStringTransformer, DesparseTransformer, TokenizeTransformer
from transformers import TwitterPrep, WordvecTransformer, AverageWordvec, MaxPool, MinPool
from transformers import PrepAndVectorize

In [15]:

#vectorize = Pipeline([('prep', TwitterPrep()),('tknzr', TokenizeTransformer(word_tokenize, rejoin_angle=True)),('wordvec', WordvecTransformer())])
body_vecs = Pipeline([('body', ExtractBody()), ('vec', PrepAndVectorize(d=50))]).fit_transform(X=all_train_df,y=1)
title_vecs = Pipeline([('title', ExtractTitle()), ('vec', PrepAndVectorize(d=50))]).fit_transform(X=all_train_df,y=1)

In [35]:
from nnutil import open_costs, cost_iter_summary, cost_iter_compare
from nnutil import save_experiment, list_experiments, ppdf
list_experiments(results=False)

*********
2deepnn_20160504.bin
---
2016-05-04_22:09:03
---
Experiment comparing a deeper NN (nnmx2 has a 2 pre-pooled layers and
2 post-pool layers) to a shallower one (1 layer each). The deeper NN did not add
value. The best learning rates for the deeper NN were .005 and .01, and it did better
for higher hidden dimensions (100 was better than 50). Dropout improved the new NN
a bit.
*********
*********
alpha_comp_hdim100_dropp0.5_20160426.bin
---
2016-04-26_21:14:10
---
Experiment comparing five alphas between 1e-4 to 1e-1
where hdim is 100 and drop prob is 50%. 10 kfolds, 50 epochs.
*********
*********
anneal_alpha_test_hdim100_dropp50_20160426.bin
---
2016-04-26_21:32:41
---
Experiment comparing 4 annealing alpha schedules to the best
static alpha from a previous experiment where hdim is 100 and drop prob is 50%.
10 kfolds, 50 epochs. The best annealing strategy (starting alpha = 0.005) peaks
around 61.2% median ROC after 120k iterations. This is a small improvement on 
a static alph

In [50]:
likes = ['.*']

results = cost_iter_compare(likes=likes, verbose=False)#.sort(['count', 'median'])
ppdf(results)

      count  \
0    0        
1    40000    
2    80000    
3    120000   
4    160000   
5    181800   
6    0        
7    40000    
8    80000    
9    120000   
10   160000   
11   181800   
12   0        
13   40000    
14   80000    
15   120000   
16   160000   
17   181800   
18   0        
19   40000    
20   80000    
21   120000   
22   160000   
23   181800   
24   0        
25   40000    
26   80000    
27   120000   
28   160000   
29   181800   
30   0        
31   40000    
32   80000    
33   120000   
34   160000   
35   181800   
36   0        
37   40000    
38   80000    
39   120000   
40   160000   
41   181800   
42   0        
43   40000    
44   80000    
45   120000   
46   160000   
47   181800   
48   0        
49   40000    
50   80000    
51   120000   
52   160000   
53   181800   
54   0        
55   40000    
56   80000    
57   120000   
58   160000   
59   181800   
60   0        
61   40000    
62   80000    
63   120000   
64   160000   
65   18180

In [49]:
import regex
fields = []
for field_str in results.id:
    x = regex.match(r'(_*([^=]+)=([^_]+)($|_))+',field_str)
    fields += [{i[0]:i[1] for i in zip(x.captures(2), x.captures(3))}]
df = pd.DataFrame(fields, results.index)
drop_all_same = True
if drop_all_same:
    for col in df.columns.values:
        if len(pd.unique(df[col])) == 1:
            df = df.drop(col,1)

pd.set_option('display.max_rows', len(df))
pd.concat((results,df), axis=1)

Unnamed: 0,count,id,mean,median,n,std,alpha,annealevery,drop_p,hdim,...,timestamp,wdim,alpha.1,annealevery.1,drop_p.1,hdim.1,model,rho,timestamp.1,wdim.1
0,0,drop_p=0.0_rseed=207_printevery=40000.0_annealevery=0_epochs=50_minibatch=0_context=1_wdim=50_rho=0.0001_alpha=0.005_model=CNN1_hdim=100_odim=2_timestamp=20160508130706,0.474289,0.473089,10,0.030941,0.005,0,0.0,100,...,20160508130706,50,0.005,0,0.0,100,CNN1,0.0001,20160508130706,50
1,40000,drop_p=0.0_rseed=207_printevery=40000.0_annealevery=0_epochs=50_minibatch=0_context=1_wdim=50_rho=0.0001_alpha=0.005_model=CNN1_hdim=100_odim=2_timestamp=20160508130706,0.590682,0.594108,10,0.023406,0.005,0,0.0,100,...,20160508130706,50,0.005,0,0.0,100,CNN1,0.0001,20160508130706,50
2,80000,drop_p=0.0_rseed=207_printevery=40000.0_annealevery=0_epochs=50_minibatch=0_context=1_wdim=50_rho=0.0001_alpha=0.005_model=CNN1_hdim=100_odim=2_timestamp=20160508130706,0.594351,0.597104,10,0.023872,0.005,0,0.0,100,...,20160508130706,50,0.005,0,0.0,100,CNN1,0.0001,20160508130706,50
3,120000,drop_p=0.0_rseed=207_printevery=40000.0_annealevery=0_epochs=50_minibatch=0_context=1_wdim=50_rho=0.0001_alpha=0.005_model=CNN1_hdim=100_odim=2_timestamp=20160508130706,0.593933,0.591766,10,0.020833,0.005,0,0.0,100,...,20160508130706,50,0.005,0,0.0,100,CNN1,0.0001,20160508130706,50
4,160000,drop_p=0.0_rseed=207_printevery=40000.0_annealevery=0_epochs=50_minibatch=0_context=1_wdim=50_rho=0.0001_alpha=0.005_model=CNN1_hdim=100_odim=2_timestamp=20160508130706,0.557688,0.549756,10,0.027903,0.005,0,0.0,100,...,20160508130706,50,0.005,0,0.0,100,CNN1,0.0001,20160508130706,50
5,181800,drop_p=0.0_rseed=207_printevery=40000.0_annealevery=0_epochs=50_minibatch=0_context=1_wdim=50_rho=0.0001_alpha=0.005_model=CNN1_hdim=100_odim=2_timestamp=20160508130706,0.553809,0.544198,10,0.024262,0.005,0,0.0,100,...,20160508130706,50,0.005,0,0.0,100,CNN1,0.0001,20160508130706,50
6,0,drop_p=0.0_rseed=207_printevery=40000.0_annealevery=0_epochs=50_minibatch=0_context=1_wdim=50_rho=0.0001_alpha=0.005_model=CNN1_hdim=50_odim=2_timestamp=20160508130707,0.513923,0.522168,10,0.036075,0.005,0,0.0,50,...,20160508130707,50,0.005,0,0.0,50,CNN1,0.0001,20160508130707,50
7,40000,drop_p=0.0_rseed=207_printevery=40000.0_annealevery=0_epochs=50_minibatch=0_context=1_wdim=50_rho=0.0001_alpha=0.005_model=CNN1_hdim=50_odim=2_timestamp=20160508130707,0.586888,0.591852,10,0.023437,0.005,0,0.0,50,...,20160508130707,50,0.005,0,0.0,50,CNN1,0.0001,20160508130707,50
8,80000,drop_p=0.0_rseed=207_printevery=40000.0_annealevery=0_epochs=50_minibatch=0_context=1_wdim=50_rho=0.0001_alpha=0.005_model=CNN1_hdim=50_odim=2_timestamp=20160508130707,0.592071,0.595818,10,0.0211,0.005,0,0.0,50,...,20160508130707,50,0.005,0,0.0,50,CNN1,0.0001,20160508130707,50
9,120000,drop_p=0.0_rseed=207_printevery=40000.0_annealevery=0_epochs=50_minibatch=0_context=1_wdim=50_rho=0.0001_alpha=0.005_model=CNN1_hdim=50_odim=2_timestamp=20160508130707,0.588914,0.58542,10,0.022585,0.005,0,0.0,50,...,20160508130707,50,0.005,0,0.0,50,CNN1,0.0001,20160508130707,50


In [106]:
df

Unnamed: 0,alpha,annealevery,drop_p,hdim,model,rho
0,0.005,0,0.0,100,CNN1,0.0001
1,0.005,0,0.5,100,CNN1,1e-05
2,0.01,0,0.5,100,CNN1,1e-05
3,0.005,10,0.5,100,CNN1,1e-05
4,0.01,10,0.5,100,CNN1,1e-05
5,0.005,0,0.5,50,CNN2,1e-05
6,0.005,0,0.5,50,CNN1,1e-05
7,0.005,10,0.5,50,CNN1,1e-05
8,0.01,10,0.5,50,CNN1,1e-05
9,0.005,0,0.5,50,CNN2,1e-08


In [None]:
likes = []
likes += ['(?=model=nnmx.*wdim=50_hdim=100.*).*alpha=0\.005.*rho=1e-05.*dropp=0\.5_rseed=207.*mb=False.*alphaiter=default.*devlen=404.*']
likes += ['(?=model=nnmx2.*hdim=50.*alpha=0\.005.*rho=0\.0001).*dropp=0\.5_rseed=209']
likes += ['(?=model=nnmx2.*hdim=50.*alpha=0\.01.*rho=0\.001).*dropp=0\.5_rseed=209']
likes += ['(?=model=nnmx2.*hdim=50.*alpha=0\.001.*rho=0\.001).*dropp=0\.5_rseed=209']
likes += ['(?=model=nnmx2.*hdim=50.*alpha=0\.005.*rho=1e-05).*dropp=0\.5_rseed=209']
likes += ['(?=model=nnmx2.*hdim=50.*alpha=0\.005.*rho=1e-06).*dropp=0\.5_rseed=209']

likes += ['(?=model=nnmx2.*hdim=100.*alpha=0\.005.*rho=0\.0001).*dropp=0\.5_rseed=209']
likes += ['(?=model=nnmx2.*hdim=100.*alpha=0\.01.*rho=0\.001).*dropp=0\.5_rseed=209']
likes += ['(?=model=nnmx2.*hdim=100.*alpha=0\.001.*rho=0\.001).*dropp=0\.5_rseed=209']
likes += ['(?=model=nnmx2.*hdim=100.*alpha=0\.005.*rho=1e-05).*dropp=0\.5_rseed=209']
likes += ['(?=model=nnmx2.*hdim=100.*alpha=0\.005.*rho=1e-06).*dropp=0\.5_rseed=209']

likes += ['(?=.*dropp=0\.0.*)model=nnmx2.*hdim=100.*alpha=0\.005.*rho=0\.0001.*_rseed=209']
likes += ['(?=.*dropp=0\.0.*)model=nnmx2.*hdim=100.*alpha=0\.01.*rho=0\.001.*_rseed=209']
likes += ['(?=.*dropp=0\.0.*)model=nnmx2.*hdim=100.*alpha=0\.001.*rho=0\.001.*_rseed=209']
likes += ['(?=.*dropp=0\.0.*)(?=model=nnmx2.*hdim=100.*alpha=0\.005.*rho=1e-05).*rseed=209']
likes += ['(?=.*dropp=0\.0.*)(?=model=nnmx2.*hdim=100.*alpha=0\.005.*rho=1e-06).*rseed=209']

results = cost_iter_compare(likes=likes, verbose=False).sort(['count', 'median'])


#save_experiment(results.sort(['count','median']), '2deepnn',
#                """Experiment comparing a deeper NN (nnmx2 has a 2 pre-pooled layers and
#2 post-pool layers) to a shallower one (1 layer each). The deeper NN did not add
#value. The best learning rates for the deeper NN were .005 and .01, and it did better
#for higher hidden dimensions (100 was better than 50). Dropout improved the new NN
#a bit.""")

In [None]:
results[results['count']==120000]

In [None]:
likes = []
likes += ['(?=.*wdim=50_hdim=100.*).*alpha=0\.005.*rho=1e-05.*dropp=0\.5_rseed=207.*mb=False.*alphaiter=default.*devlen=404.*']
likes += ['(?=.*rho=1e-05.*).*wdim=100_hdim=100.*']
likes += ['(?=.*rho=0\.0001.*).*wdim=100_hdim=100.*']
likes += ['(?=.*rho=0\.001.*).*wdim=100_hdim=100.*']
likes += ['(?=.*rho=1e-05.*).*wdim=100_hdim=200.*']
likes += ['(?=.*rho=0\.0001.*).*wdim=100_hdim=200.*']
likes += ['(?=.*rho=0\.001.*).*wdim=100_hdim=200.*']

results = cost_iter_compare(likes=likes, verbose=False).sort(['count', 'median'])

# save_experiment(results.sort(['count','median']), 'rho_test_for_wdim_100',
#                """Experiment to test improvement in results for larger word vectors with
# higher regularization constant rho. The best results for wdim=100 did not beat
# the prevailing model with wdim=50, but improved a bit on the wdim=100 results with
# the regularization that worked for the smaller models of rho=1e-05""")

In [None]:

likes = []
likes += ['(?=.*alpha=0\.005.*alphaiter=default).*hdim=100.*dropp=0\.5_rseed=207.*devlen=404.*']
likes += ['(?=.*alpha=0\.005.*alphaiter=anneal18180).*hdim=100.*dropp=0\.5_rseed=207.*devlen=404.*']
likes += ['(?=.*alpha=0\.01.*alphaiter=anneal18180).*hdim=100.*dropp=0\.5_rseed=207.*devlen=404.*']
likes += ['(?=.*alpha=0\.005.*alphaiter=anneal36360).*hdim=100.*dropp=0\.5_rseed=207.*devlen=404.*']
likes += ['(?=.*alpha=0\.005.*alphaiter=anneal72720).*hdim=100.*dropp=0\.5_rseed=207.*devlen=404.*']

results = cost_iter_compare(likes=likes, verbose=False)
results.sort(['count','median'])

#save_experiment(results.sort(['count','median']), 'anneal_alpha_test_hdim100_dropp50',
#                """Experiment comparing 4 annealing alpha schedules to the best
#static alpha from a previous experiment where hdim is 100 and drop prob is 50%.
#10 kfolds, 50 epochs. The best annealing strategy (starting alpha = 0.005) peaks
#around 61.2% median ROC after 120k iterations. This is a small improvement on 
#a static alpha which peaked at 60.5% after 80k iterations. The best part of the annealing
#strategies is that they overfit less (which make sense because they use a smaller
#learning rate as time goes on)""")

In [None]:
list_experiments(results=True)

In [None]:
MaxPool().fit_transform(X=body_vecs, y=1)

In [None]:
MaxPool().fit_transform(X=title_vecs, y=1)

In [None]:
lsvc = LinearSVC(class_weight='auto', C = 2, random_state=rseed)
etc = ExtraTreesClassifier(n_estimators = 200,
                           max_depth = 4,
                           min_samples_split=15,
                           random_state=rseed,
                           class_weight='auto')
gbc = GradientBoostingClassifier(n_estimators = 200,
                            learning_rate=0.01,
                           max_depth = 3,
                           min_samples_split=10,
                           random_state = rseed)

In [None]:
title_avg = AverageWordvec().fit_transform(title_vecs, y=1)
body_avg = AverageWordvec().fit_transform(body_vecs, y=1)
title_max = MaxPool().fit_transform(title_vecs, y=1)
body_max = MaxPool().fit_transform(body_vecs, y=1)
title_min = MinPool().fit_transform(title_vecs, y=1)
body_min = MinPool().fit_transform(body_vecs, y=1)

In [None]:
np.concatenate((title_avg, body_avg),axis=1).shape

In [None]:
print_scores(cross_val_score(Pipeline([('wvec', AverageWordvec()),('lsvc', lsvc)]), title_vecs, all_train_labels, cv=kf, scoring=roc_scorer))

In [None]:
print_scores(cross_val_score(Pipeline([('wvec', AverageWordvec()),('lsvc', lsvc)]), body_vecs, all_train_labels, cv=kf, scoring=roc_scorer))

In [None]:
print_scores(cross_val_score(Pipeline([('lsvc', lsvc)]), np.concatenate((title_avg, body_avg),axis=1), all_train_labels, cv=kf, scoring=roc_scorer))

In [None]:
print_scores(cross_val_score(Pipeline([('wvec', AverageWordvec()),('etc', etc)]), title_vecs, all_train_labels, cv=kf, scoring=roc_scorer))

In [None]:
print_scores(cross_val_score(Pipeline([('wvec', AverageWordvec()),('etc', etc)]), body_vecs, all_train_labels, cv=kf, scoring=roc_scorer))

In [None]:
print_scores(cross_val_score(Pipeline([('etc', etc)]), np.concatenate((title_avg, body_avg),axis=1), all_train_labels, cv=kf, scoring=roc_scorer))

In [None]:
print_scores(cross_val_score(Pipeline([('wvec', MaxPool()),('etc', etc)]), body_vecs, all_train_labels, cv=kf, scoring=roc_scorer))

In [None]:
print_scores(cross_val_score(Pipeline([('wvec',FeatureUnion([('max', MaxPool()), ('min', MinPool())])),('etc', etc)]), body_vecs, all_train_labels, cv=kf, scoring=roc_scorer))

In [None]:
print_scores(cross_val_score(Pipeline([('etc', etc)]), np.concatenate((title_max, body_max, title_min, body_min),axis=1), all_train_labels, cv=kf, scoring=roc_scorer))

# TODO / Notes
<ul>
<li> DONE Look at success of title + body - not a major determinant of success or not
<li> DONE Try word vec vars with L1 regularization - did not add much value over body max + min with ETC
<li> DONE Implement a nonlinear layer over each word vector then max pool the results of that layer then pass to softmax / LR layer - Ran on One Train/Test split and there seems to be significant improvement peaking at ROC AUC of 0.619 around 50 - 70 epochs.
<li> DONE Script to save experiments. Goal - recover inputs to experiments and results. Execution - save final NN, any training inputs, and occasional results on train and test.
<li> DONE Try NN on all k-folds - not an improvement over bag of words - best median kfold was about a ROC AUC of 0.58 after about 40k training steps
<li> DONE: add regularization
<li> DONE: train and test kfolds with regularizations - regularizaiton with rho = 1e-3 is a slight improvement, leading to mean/median ROC AUC of about 0.584 after 40k training steps instead of 0.58
<li> DONE: add dropout - after adding dropout and 200 dimension hidden layer, I got a median ROC AUC of 60 (after 40k examples), but the 0th iteration ROC AUC was 57 which is confusing - maybe just a really good intial weight configuration?
<li> DONE: try 200 dim hidden layer without drop out and 100 dim hidden layer with drop out for comparison - 100 dim hidden layer with drop out seemed to be best
<li> DONE: try changing the random seed and re running hdim 200 with drop out to see if results differ - indeed results were worse. with the original random seed (207) roc auc plateaued around 60 after 40k-80k examples. with another random seed (414) roc auc plateaued around 58.2 after 40k-80k examples. I think the problem is random weight initialization. Another symptom of this problem is that iteration 0 scores for train set differed in a big way between random seeds - suggsets to me that random weights were too high after increasing number of dimensions. Trying with new random weight scheme (fan in / fan out adjusted instead of just a random number).
<li> DONE: random weight adjusted by fan in / fan out - dev set results seems to fluctuate less given differnt random seed after changing this, results slightly more consistent. seed 207 still plateaus around 60 between 40k-80k. same for seed 414.
<li> DONE: minibatch training - No minibatch (mb=1) was definitely best, and progressively larger minibatches were worse. It's possible that other choices might make minibatches more attractive, for examplie different alpha or equal-weighted class samples, might leave this for later exploration.
<li> DONE: try different learning rates and annealed learning. I investigated different learning rats and annealed learning. The best static rates are between 0.005 and 0.001. Annealed learning works best starting at 0.005 and annealing every 10 epochs or so.
<li> DONE: try bigger word vectors - did not add any value. higher regularization improved results a bit for bigger word vectors. did not try different learning rates.
<li> TODO: add more convolution (bi grams etc)
<li> DONE: deeper layers before max pooling
<li> TODO: dropout softmax layer too
<li> DONE: deeper layers after max pooling
<li> DONE: IMplement CNN
<li> TODO: Test CNN
<li> TODO: Improve testing so that all kfolds are saved in a single folder which gets around the painful regex I currently do. Folder name should be the opts that were called and have the time at the end. This makes it easy to group things that were called together
<li> TODO: Implement RCNN in Lai Et.Al 2015
<li> NOTE: Overfitting is not a problem when there are two hidden layers before max pooling. The train set ROC AUC seems to max out around 60, whether I use drop out or not. If I drop one of those hidden layers it becomes a problem again (especially if no dropout), even if I keep an extra hidden layer after max pooling. 

In [34]:
X = np.array([np.random.normal(i,0.25,4) for i in range(5)])
wdim = X.shape[1]
for i in range(2):
    X = np.hstack((X,
                        np.vstack((np.zeros((i+1,wdim)), X[:-(i+1),:wdim])),
                        np.vstack((X[(i+1):,:wdim], np.zeros((i+1,wdim))))
                        ))
X

array([[-0.3574533 , -0.01407516, -0.38890564, -0.2832947 ,  0.        ,
         0.        ,  0.        ,  0.        ,  1.13978047,  0.80192316,
         1.22392316,  1.40596724,  0.        ,  0.        ,  0.        ,
         0.        ,  1.9315785 ,  2.13177116,  1.68812958,  1.41823789],
       [ 1.13978047,  0.80192316,  1.22392316,  1.40596724, -0.3574533 ,
        -0.01407516, -0.38890564, -0.2832947 ,  1.9315785 ,  2.13177116,
         1.68812958,  1.41823789,  0.        ,  0.        ,  0.        ,
         0.        ,  2.98767746,  3.09027405,  3.23420206,  3.36251272],
       [ 1.9315785 ,  2.13177116,  1.68812958,  1.41823789,  1.13978047,
         0.80192316,  1.22392316,  1.40596724,  2.98767746,  3.09027405,
         3.23420206,  3.36251272, -0.3574533 , -0.01407516, -0.38890564,
        -0.2832947 ,  3.40379251,  4.02647582,  3.6753537 ,  4.21780272],
       [ 2.98767746,  3.09027405,  3.23420206,  3.36251272,  1.9315785 ,
         2.13177116,  1.68812958,  1.41823789,  

In [21]:
'_'.join(['{k}={v}'.format(k=k,v=v) for k,v in {'a':1,'b':2}.iteritems()])

'a=1_b=2'

In [None]:
tokenvec('mum',topdir,dirdepth)

In [None]:
### Reusable method for quick BOW investigations:
def simple_text(do_all=True, do_count=False,do_tfidf=False, do_titles=False, do_bodies=False, do_both=False, lowercase=False, tokenizer=None, stop_words=None):

    # Notes
    # results slightly better w/ lowercase = False (when unigrams only)
    # bigrams added no value on unigrams

    tv = TfidfVectorizer(ngram_range=(1,1),lowercase=lowercase, tokenizer=tokenizer, stop_words=stop_words)
    cv = CountVectorizer(ngram_range=(1,1),lowercase=lowercase, tokenizer=tokenizer, stop_words=stop_words)
    lsvc = LinearSVC(class_weight='auto', C = 2, random_state=rseed)
    
    body_cv = Pipeline([('body',ExtractBody()),('cv', cv)])
    body_tv = Pipeline([('body',ExtractBody()),('tv', tv)])
    
    title_cv = Pipeline([('title',ExtractTitle()),('cv', cv)])
    title_tv = Pipeline([('title',ExtractTitle()),('tv', tv)])

    if do_titles or do_all:
        if do_count or do_all:
            # Count Vectorizer Titles
            print '\nCount Vectorizer on Titles'
            
            pipe = Pipeline([('tranform',title_cv),('model',lsvc)])
            print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

        if do_tfidf or do_all:
            # TFIDF Vectorizer TItles
            print '\nTFIDF Vectorizer on Titles'
            
            pipe = Pipeline([('tranform',title_tv),('model',lsvc)])
            print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))


    if do_bodies or do_all:
        if do_count or do_all:
            # Count Vectorizer Bodies
            print '\nCount Vectorizer on Bodies'
            
            pipe = Pipeline([('tranform',body_cv),('model',lsvc)])
            print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))


        if do_tfidf or do_all:
            # TFIDF Vectorizer Bodies
            print '\nTFIDF Vectorizer on Bodies'
            
            pipe = Pipeline([('tranform',body_tv),('model',lsvc)])
            print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

        
    if do_both or do_all:
        if do_count or do_all:

            # Count Vectorizer Titles and Bodies
            print '\nCount Vectorizer on Titles and Bodies'
            
            pipe = Pipeline([
                ('features',FeatureUnion([
                    ('tranform_title',title_cv),
                    ('tranform_body',body_cv)
                ])),
                ('model',lsvc)])
            print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

            
        if do_tfidf or do_all:
            # TFIDF Vectorizer Titles and Bodies
            print '\nTFIDF Vectorizer on Titles and Bodies'
            
            pipe = Pipeline([
                ('features',FeatureUnion([
                    ('tranform_title',title_tv),
                    ('tranform_body',body_tv)
                ])),
                ('model',lsvc)])
            print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))
            
            

Our text model counts how many times each word in a vocabulary (learned from the text) is present in each request. We adjust that term frequency by inverse document frequency (how often that term occurs throughout the document) to overweight uncommon words, which often have more explanatory power.

We also play around with a few different tokenizers to see which get the best results. These tokenizers turn lots of text (like pizza requests) in to a series of words (tokens). The tokens can be "stemmed" which adjusts the word so that, for example, different tenses of the same verb have the same representation.

In [None]:
### Experimentation with different tokenizers

print "Examination of best vectorizer:"

print '\nDefault Vectorizer:'
simple_text(do_all = False, do_both=True, do_tfidf=True, do_count=True)

print '=================='

print 'Examination of best tokenizer'

print "\nSnowball Stem Tokenizer:"
simple_text(do_all=False, do_both=True, do_tfidf=True, tokenizer=SnowballStemTokenizer())

print "\Lemma Tokenizer:"
simple_text(do_all=False, do_both=True, do_tfidf=True, tokenizer=LemmaTokenizer())

### Results Table += Simple Bag of Words

The following table documents our results so far. The mean and median scores come from taking the averge of the ROC-AUC scores from 10 k-folds in the specified model.

<table>
<tr>
<th>Method</th>
<th>Mean ROC-AUC</th>
<th>Median ROC-AUC</th>
<th>Standard Deviation</th>
</tr>

<tr>
<td>Activity Features with Reweighted Classes</td>
<td>0.5589</td>
<td>0.5570</td>
<td>0.0213</td>
</tr>
<tr>
<td>Simple BOW, Titles + Bodies, Snowball Stem Tokenizer</td>
<td>0.5615</td>
<td>0.5604</td>
<td>0.0243</td>
</tr>

</table>

<a href="#top">Return to Table of Contents</a>

### 3b. L1 Feature Regularization

In [None]:
### Reusable class to process important terms
class LinearWeightFeatureThreshold(TransformerMixin):
    def __init__(
        self,
        model = LinearSVC(class_weight='auto', loss='squared_hinge', penalty='l1', dual=False, random_state=rseed),
        return_dense = True, #dense or sparse matrix
        C = 1, # C for L1
        threshold = 0.01, # threshold to keep
        verbose = 1 #tell how many features were kept
        ):
        self.model = model
        self.return_dense = return_dense
        self.C = C
        self.threshold = threshold
        self.verbose = verbose
    
    def fit(self, X, y):
        model = self.model
        threshold = self.threshold
        verbose = self.verbose
        C = self.C
        
        model.set_params(C=C)
        
        model.fit(X, y)
        
        # check which coefficients to keep
        coef = model.coef_
        sig_coef = (np.abs(coef) > threshold)[0]
        n_coef = np.sum(sig_coef)
        
        
        if verbose > 0:
            print 'kept %d/%d features' % (n_coef, coef.shape[1])
        
        # so we never return an empty vector if C was too low
        if n_coef == 0:
            sig_coef[0] = 1
        
        # save the significant coefficients
        self.sig_coef_  = sig_coef
        return self
    
    def transform(self, X, **transform_params):
        sig_coef = self.sig_coef_
        return_dense = self.return_dense
        
        X_new = X[:,sig_coef]
        
        if return_dense and (type(X_new) != type(np.array(1))):
            X_new = X_new.toarray()
            
        return X_new
    
    # methods needed to make this grid searchable
    def get_params(self, deep=True):
        # suppose this estimator has parameters "alpha" and "recursive"
        return {'C':self.C, 'threshold':self.threshold}
    
    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self

The text models above result in lots of variables, many of which will be completely useless. This noise can overwhelm the models, so we want to find ways to "regularize" or reduce the number of variables to only significant ones. We do this by "L1" regularization. L1 regularizaiton uses a model that calculates linear errors, which happens to result in a sparse number of variables actually used. We take the variables that were used in this sparser model, then feed them in to a model that uses squared errors, to get better results

We found L1 regularization with a Linear SVC to be an effective method for reducing the number features, especially coming out of term frequency matrices. We chose C=0.15 ('regularization' term that controls how sparse the variables in the model are) via grid search (by trying lots of different values and seeing what does best), and it tends to return about 40 features on our sample size

In [None]:
### Try ExtraTreesClassifier for the BOW models:

# C=0.15 arrived at via grid search, but took a long time so not included here.
l1 = LinearWeightFeatureThreshold(C=0.15)
tv = TfidfVectorizer(tokenizer=SnowballStemTokenizer())
lsvc = LinearSVC(class_weight='auto', C = 2, random_state=rseed)

etc = ExtraTreesClassifier(n_estimators = 200,
                           max_depth = 4,
                           min_samples_split=15,
                           random_state=rseed,
                           class_weight='auto')

pipe_lsvc = Pipeline([('extract', ExtractBody()), ('tv',tv), ('features',l1), ('clf',lsvc)])
pipe_etc = Pipeline([('extract', ExtractBody()), ('tv',tv), ('features',l1), ('clf',etc)])


print '\nL1 Feature Reduction on Bodies w/ LSVC'
print_scores(cross_val_score(pipe_lsvc, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

print '\nL1 Feature Reduction on Bodies w/ ETC'
tv = TfidfVectorizer(tokenizer=SnowballStemTokenizer())
lsvc = LinearSVC(class_weight='auto', C = 2, random_state=rseed)

etc = ExtraTreesClassifier(n_estimators = 200,
                           max_depth = 4,
                           min_samples_split=15,
                           random_state=rseed,
                           class_weight='auto')

print_scores(cross_val_score(pipe_etc, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))


### Results Table += Bag of Words w/ Feature Reduction

The following table documents our results so far. The mean and median scores come from taking the averge of the ROC-AUC scores from 10 k-folds in the specified model.

<table>
<tr>
<th>Method</th>
<th>Mean ROC-AUC</th>
<th>Median ROC-AUC</th>
<th>Standard Deviation</th>
</tr>

<tr>
<td>Activity Features with Reweighted Classes</td>
<td>0.5589</td>
<td>0.5570</td>
<td>0.0213</td>
</tr>
<tr>
<td>Simple BOW, Titles + Bodies, Snowball Stem Tokenizer</td>
<td>0.5615</td>
<td>0.5604</td>
<td>0.0243</td>
</tr>
<tr>
<td>Extra Trees BOW w/ L1 Feature Reduction Bodies</td>
<td>0.5870</td>
<td>0.5920</td>
<td>0.0271</td>
</tr>

</table>

<a href="#top">Return to Table of Contents</a>

<a id="part4"></a>
## 4. Time Features

In [None]:
### Reusable class for time features
### Reusable class for time features
DATE_TIME_COLUMN_DEFAULT = np.where(all_train_df.columns == 'unix_timestamp_of_request')[0][0]

class TimeTransformer(TransformerMixin):
    
    def __init__(self, date_time_column=DATE_TIME_COLUMN_DEFAULT, do_second=True, do_minute=True, do_hour=True, do_dow=True, do_day=True, do_month=True):
        self.date_time_column = date_time_column
        self.do_second = do_second
        self.do_minute = do_minute
        self.do_hour = do_hour
        self.do_dow = do_dow
        self.do_day = do_day
        self.do_month = do_month
        
    def fit(self, X, y, **fit_params):
        return self
    
    def extract_from_date_time_(self, dt, do_second, do_minute, do_hour, do_dow, do_day, do_month):
        extract = []
        if do_second:
            extract.append(dt.second)
        
        if do_minute:
            extract.append(dt.minute)
            
        if do_hour:
            extract.append(dt.hour)
            
        if do_dow:
            extract.append(dt.weekday())
            
        if do_day:
            extract.append(dt.day)
            
        if do_month:
            extract.append(dt.month)
            
        return extract
    
    def transform(self, X, **transform_params):
        date_time_column = self.date_time_column
        do_second = self.do_second
        do_minute = self.do_minute
        do_hour = self.do_hour
        do_dow = self.do_dow
        do_day = self.do_day
        do_month = self.do_month
        extract_from_date_time = self.extract_from_date_time_
        
        features = np.array([
            extract_from_date_time(dt.datetime.fromtimestamp(timei),
                                   do_second=do_second,
                                   do_minute=do_minute,
                                   do_hour=do_hour,
                                   do_dow=do_dow,
                                   do_day=do_day,
                                   do_month=do_month) for timei in X[:,date_time_column]
        ])
        
        return features
    
    def get_params(self, deep=True):
        return {'do_second':self.do_second,
                'do_minute':self.do_minute,
                'do_hour':self.do_hour,
                'do_dow':self.do_dow,
                'do_day':self.do_day,
                'do_month':self.do_month}

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            self.setattr(parameter, value)
        return self


### Inspect Time Variables

We create several features derived from the date & time, visualize them here, and see how they perfrom in a simple model on their own. We checked each time transformation against logistic regression (a linear model) and a tree ensemble (a nonlinear model), because a linear method might not capture all the information in time.

Decision trees (a bunch of consecutive binary splits of the data based on variable values) can be a useful way to explore models where features my be nonlinear as may be the case with time features. For example, hour 23.5 (late at night) and 0.5 (so early in the morning it's still late at night) may be treated similarly by a linear model, but a decision tree, can create a couple splites and capture it easily (hour > 23 and hour < 1).

In [None]:

lr = LogisticRegression(random_state=rseed, class_weight='auto', fit_intercept=True)
etc = ExtraTreesClassifier(n_estimators=200,
                           max_depth = 4,
                           min_samples_split=15,
                           random_state=rseed,
                           class_weight='auto')

In [None]:
### Visualizations to inspect time variables
# Exploring time features, it looks like requests are not as succesful at late nights /
# early mornings or on Mondays / Fridays...
# Though that could be because there's more requests on those days.

tt = TimeTransformer(do_minute=False, do_day=False, do_second=False, do_hour=False, do_dow=False, do_month=True)
print 'Logistic Regression:'
print_scores(cross_val_score(Pipeline([('tt',tt),('model',lr)]), all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

print 'Extra Trees:'
print_scores(cross_val_score(Pipeline([('tt',tt),('model',etc)]), all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

# look at month success
fig = plt.figure(figsize=(8,5))
month = tt.transform(all_train_df.values).flatten()
month_pos = month[all_train_labels]
month_neg = month[np.logical_not(all_train_labels)]
pd.Series(month_pos).hist(bins=12, alpha=0.2, normed=True, label='Winner Values Greater than All')
pd.Series(month_neg).hist(bins=12, alpha=0.2, normed=True, label='Winner Values Less than All')
plt.title("RAOP Month of Message Submissions \n")
plt.xlabel("Month")
plt.ylabel("Frequency")
plt.rcParams['legend.fontsize'] = 10
plt.legend(loc='best')
plt.show()

In [None]:
tt = TimeTransformer(do_minute=False, do_day=True, do_second=False, do_hour=False, do_dow=False, do_month=False)
print 'Logistic Regression:'
print_scores(cross_val_score(Pipeline([('tt',tt),('model',lr)]), all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

print 'Extra Trees:'
print_scores(cross_val_score(Pipeline([('tt',tt),('model',etc)]), all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))



# look at day success
fig = plt.figure(figsize=(8,5))
day = tt.transform(all_train_df.values).flatten()
day_pos = day[all_train_labels]
day_neg = day[np.logical_not(all_train_labels)]
pd.Series(day_pos).hist(bins=31, alpha=0.2, normed=True, label='Winner Values Greater than All')
pd.Series(day_neg).hist(bins=31, alpha=0.2, normed=True, label='Winner Values Less than All')
plt.title("RAOP Day Message Submissions \n")
plt.xlabel("Day")
plt.ylabel("Frequency")
plt.rcParams['legend.fontsize'] = 10
plt.legend(loc='best')
plt.show()

In [None]:
tt = TimeTransformer(do_minute=False, do_day=False, do_second=False, do_hour=False, do_dow=True, do_month=False)
print 'Logistic Regression:'
print_scores(cross_val_score(Pipeline([('tt',tt),('model',lr)]), all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

print 'Extra Trees:'
print_scores(cross_val_score(Pipeline([('tt',tt),('model',etc)]), all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))


# look at day of week success
fig = plt.figure(figsize=(8,5))
dow = tt.transform(all_train_df.values).flatten()
dow_pos = dow[all_train_labels]
dow_neg = dow[np.logical_not(all_train_labels)]
pd.Series(dow_pos).hist(bins=7, alpha=0.2, normed=True, label='Winner Values Greater than All')
pd.Series(dow_neg).hist(bins=7, alpha=0.2, normed=True, label='Winner Values Less than All')
plt.title("RAOP Day of Week Message Submissions \n")
plt.xlabel("Day of Week")
plt.ylabel("Frequency")
plt.rcParams['legend.fontsize'] = 10
plt.legend(loc='best')
plt.show()

In [None]:
tt = TimeTransformer(do_minute=False, do_day=False, do_second=False, do_hour=True, do_dow=False, do_month=False)
print 'Logistic Regression:'
print_scores(cross_val_score(Pipeline([('tt',tt),('model',lr)]), all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

print 'Extra Trees:'
print_scores(cross_val_score(Pipeline([('tt',tt),('model',etc)]), all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))


# look at hourly success
fig = plt.figure(figsize=(8,5))
hour = tt.transform(all_train_df.values).flatten()
hour_pos = hour[all_train_labels]
hour_neg = hour[np.logical_not(all_train_labels)]
pd.Series(hour_pos).hist(bins=24, alpha=0.2, normed=True, label='Winner Values Greater than All')
pd.Series(hour_neg).hist(bins=24, alpha=0.2, normed=True, label='Winner Values Less than All')
plt.title("RAOP Hour of Message Submissions \n")
plt.xlabel("Hour")
plt.ylabel("Frequency")
plt.rcParams['legend.fontsize'] = 10
plt.legend(loc='best')
plt.show()

In [None]:
tt = TimeTransformer(do_minute=True, do_day=False, do_second=False, do_hour=False, do_dow=False, do_month=False)
print 'Logistic Regression:'
print_scores(cross_val_score(Pipeline([('tt',tt),('model',lr)]), all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

print 'Extra Trees:'
print_scores(cross_val_score(Pipeline([('tt',tt),('model',etc)]), all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))



# look at minute success
fig = plt.figure(figsize=(8,5))
minute = tt.transform(all_train_df.values).flatten()
minute_pos = minute[all_train_labels]
minute_neg = minute[np.logical_not(all_train_labels)]
pd.Series(minute_pos).hist(bins=60, alpha=0.2, normed=True, label='Winner Values Greater than All')
pd.Series(minute_neg).hist(bins=60, alpha=0.2, normed=True, label='Winner Values Less than All')
plt.title("RAOP Minute of Message Submissions \n")
plt.xlabel("Minute")
plt.ylabel("Frequency")
plt.rcParams['legend.fontsize'] = 10
plt.legend(loc='best')
plt.show()

In [None]:
tt = TimeTransformer(do_minute=False, do_day=False, do_second=True, do_hour=False, do_dow=False, do_month=False)
print 'Logistic Regression:'
print_scores(cross_val_score(Pipeline([('tt',tt),('model',lr)]), all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

print 'Extra Trees:'
print_scores(cross_val_score(Pipeline([('tt',tt),('model',etc)]), all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))


# look at second success
fig = plt.figure(figsize=(8,5))
second = tt.transform(all_train_df.values).flatten()
second_pos = second[all_train_labels]
second_neg = second[np.logical_not(all_train_labels)]
pd.Series(second_pos).hist(bins=60, alpha=0.2, normed=True, label='Winner Values Greater than All')
pd.Series(second_neg).hist(bins=60, alpha=0.2, normed=True, label='Winner Values Less than All')
plt.title("RAOP Second of Message Submissions \n")
plt.xlabel("Second")
plt.ylabel("Frequency")
plt.rcParams['legend.fontsize'] = 10
plt.legend(loc='best')
plt.show()

### Putting Time Variables Together

Months, day of the month, and hour of the day all appear to be at least somewhat effective predictors.
It's worth noting that the hour of the message only adds value with the nonlinear classifier.
This suggests our final classifier should be nonlinear if it includes this feature.

In [None]:
### Try a couple classifiers for time features to find a good choice.
# Turns out time features don't perform well by themselves.

etc = ExtraTreesClassifier(n_estimators=200,
                           max_depth = 4,
                           min_samples_split=15,
                           random_state=rseed,
                           class_weight='auto')

# only do month, day, hour
tt = TimeTransformer(do_minute=False, do_day=True, do_second=True, do_hour=False, do_dow=False, do_month=True)
# the day was borderline, so try one without
tt_noday = TimeTransformer(do_minute=False, do_day=False, do_second=True, do_hour=False, do_dow=False, do_month=True)

print '\nExtra Tree Ensemble Month, Day, Hour'

etc_pipe = Pipeline([
    ('time',tt),
    ('model',etc)
    ])

print_scores(cross_val_score(etc_pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

print '\nExtra Tree Ensemble Month, Hour'

etc_pipe = Pipeline([
    ('time',tt_noday),
    ('model',etc)
    ])

print_scores(cross_val_score(etc_pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))


### Results Table += Time Features

The following table documents our results so far. The mean and median scores come from taking the averge of the ROC-AUC scores from 5 k-folds in the specified model.

<table>
<tr>
<th>Method</th>
<th>Mean ROC-AUC</th>
<th>Median ROC-AUC</th>
<th>Standard Deviation</th>
</tr>

<tr>
<td>Activity Features with Reweighted Classes</td>
<td>0.5589</td>
<td>0.5570</td>
<td>0.0213</td>
</tr>
<tr>
<td>Simple BOW, Titles + Bodies, Snowball Stem Tokenizer</td>
<td>0.5615</td>
<td>0.5604</td>
<td>0.0243</td>
</tr>
<tr>
<td>Extra Trees BOW w/ L1 Feature Reduction Bodies</td>
<td>0.5870</td>
<td>0.5920</td>
<td>0.0271</td>
</tr>
<tr>
<td>Extra Trees on Month, Hour</td>
<td>0.5406</td>
<td>0.5440</td>
<td>0.0232</td>
</tr>

</table>

<a href="#top">Return to Table of Contents</a>

<a id="part5"></a>
## 5. Interesting Words & Category Tags

In [None]:
### Reusable class for interesting words
TITLE_COLUMN = np.where(all_train_df.columns == 'request_title')[0][0]
BODY_COLUMN = np.where(all_train_df.columns == 'request_text_edit_aware')[0][0]

# Useful method for getting length of text:
def lenArray(text, no_zero = True):
    lens = np.array([[float(len(x.encode('utf-8'))) for x in text]]).T
    lens[lens==0]=1
    return lens

class InterestingWordsTransformer(TransformerMixin):
    def __init__(self, title_col = TITLE_COLUMN, body_col=BODY_COLUMN, do_title=True, do_body=True, do_tags=True, do_words=True):
        self.do_title = do_title
        self.do_body = do_body
        self.do_tags = do_tags
        self.do_words = do_words
        self.title_col = title_col
        self.body_col = body_col
        
        # dictionary of keys = tags and values = word to find
        self.keywords = {
            'sad_food': ['hungry', 'starving', 'no food', 'grocer', 'eaten', 'hunger', 'ramen', 'empty', 'fridge', 'refrig'],
            'money': ['broke', 'paid', 'money', 'unemployed', 'lost', 'job', 'bill', 'wage', 'work', 'payday', 'paycheck', 'funds', 'cash', 'bank', 'laid off', 'poor', 'payroll'],
            'sad': ['worst', 'awful', 'sick', 'problem', 'catch a break', 'cheer', 'hospital', 'bad', 'shitty', 'stress', 'luck', ':(', 'rough', 'tough', 'battle', 'reasons', 'losing'],
            'military': ['military', 'veteran', 'soldier', 'army', 'navy', 'marine', 'air force', 'iraq', 'afghanis'],
            'happy': ['celebrate', 'birthday', 'party', 'new year', 'bday', 'engage', 'annivers', 'surprise', 'loves', 'best'],
            'nice': ['please', 'help', 'thank', ':)', 'helping', 'aid', 'exchange', 'spare', ':D',':-)'],
            'honest': ['sob story', 'honest', 'just want', 'just because'],
            'parent': ['family', 'kids', 'parent', 'mom', 'mommy', 'mother', 'dad', 'father', 'baby', 'boy', 'girl'],
            'relationship': ['husband', 'wife', 'girlfriend', 'boyfriend', 'fianc', 'roommate', 'married'],
            'test': ['study', 'test', 'final', 'midterm','student'],
            'time': ['yesterday', 'lately', 'never', 'during', 'sunday', 'constantly']
        }
    
    def find_tag_words(self, keywords, text):
        word_dict = {}
        tag_dict = {}

        for tag, words in keywords.iteritems():

            tag_count = None

            for word in words:
                # check for the word in the text
                has_word = np.array([(1 if word in t else 0) for t in text])
                word_dict[word] = has_word
                
                # count the words with the tag
                if tag_count is None:
                    tag_count = has_word
                else:
                    tag_count = tag_count +  has_word

            tag_dict[tag] = tag_count

        return (tag_dict, word_dict)
    
    # manually create keywords with categories
    
    def transform(self, X, **transform_params):
        do_title = self.do_title
        do_tags = self.do_tags
        do_words = self.do_words
        do_body = self.do_body
        keywords = self.keywords
        find_tag_words = self.find_tag_words
        body_col = self.body_col
        title_col = self.title_col
        
        features = []
        feature_names = []

        # find keywords and tags
        if do_title and not do_body:
            title_unicode = np.array([x.lower() for x in X[:,title_col]])
            title_tag_dict, title_word_dict = find_tag_words(keywords, title_unicode)
            
            # normalize appearence of important words by character length of text
            # because longer requests should have more hits
            lens = lenArray(X[:,body_col])
            
            # add frequency of tags
            if do_tags:
                features.append(pd.DataFrame(title_tag_dict).values/lens)
                feature_names.append('title_tags')
                
            # add frequency of words
            if do_words:
                features.append(pd.DataFrame(title_word_dict).values/lens)
                feature_names.append('title_words')

        if do_body and not do_title:
            body_unicode = np.array([x.lower() for x in X[:,body_col]])
            body_tag_dict, body_word_dict = find_tag_words(keywords, body_unicode)
            
            # normalize appearence of important words by character length of text
            # because longer requests should have more hits
            lens = lenArray(X[:,body_col])
            
            # add frequency of tags
            if do_tags:
                features.append(pd.DataFrame(body_tag_dict).values/lens)
                feature_names.append('body_tags')
            
            # add frequency of words
            if do_words:
                features.append(pd.DataFrame(body_word_dict).values/lens)
                feature_names.append('body_words')
                
        if do_body and do_title:
            body_unicode = np.array([x.lower() for x in ConcatStringTransformer().transform(X[:,[body_col,title_col]])])
            body_tag_dict, body_word_dict = find_tag_words(keywords, body_unicode)
            
            # normalize appearence of important words by character length of text
            # because longer requests should have more hits
            lens = lenArray(X[:,body_col])
            
            # add frequency of tags
            if do_tags:
                features.append(pd.DataFrame(body_tag_dict).values/lens)
                feature_names.append('body_tags')
            
            # add frequency of words
            if do_words:
                features.append(pd.DataFrame(body_word_dict).values/lens)
                feature_names.append('body_words')

        return np.hstack(tuple(features))
    
    def fit(self, X, y, **fit_params):
        #do nothing
        return self
    
    def get_params(self, deep=True):
        return {'do_words': self.do_words, 'do_tags':self.do_tags, 'do_body':self.do_body, 'do_title':self.do_title}

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            self.setattr(parameter, value)
        return self

This section finds words that we thought were "interesting" (potentially explanatory) in request text. It also puts each word in to a category, because a request where someone bemoans their lack of money may include the workd "broke" or "lost job", but maybe only one of those too. Since the lack of money could be driving the response, we want to combine those features together. The interesting words and their categories are defined above

In [None]:
### Explore different models focusing on interesting words in tags and text:

etc = ExtraTreesClassifier(n_estimators = 200,
                           max_depth = 4,
                           min_samples_split=10,
                           random_state=rseed,
                           class_weight='auto')

lsvc = Pipeline([('scale', StandardScaler()), ('clf', LinearSVC(class_weight='auto', random_state=rseed))])

#lsvc_pca = Pipeline([
#    ('scale', StandardScaler()),
#    ('pca', RandomizedPCA(n_components=3,random_state=rseed)),
#    ('clf', LinearSVC(class_weight='auto',random_state=rseed))
#])


models = {'Extra Trees':etc, 'Linear SVC':lsvc}

print '\n##############'
print 'Body & Title Tags'
trans = InterestingWordsTransformer(do_words=False)

for name, model in models.iteritems():
    print '\n%s' % name
    kfi = kf_over if (name=='Gradient Boosting') else kf
    pipe = Pipeline([('trans',trans),('model',model)])
    print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kfi, scoring=roc_scorer))
    
    


print '\n##############'
print 'Body & Title Words'
trans = InterestingWordsTransformer(do_tags=False)

for name, model in models.iteritems():
    print '\n%s' % name
    kfi = kf_over if (name=='Gradient Boosting') else kf
    pipe = Pipeline([('trans',trans),('model',model)])
    print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kfi, scoring=roc_scorer))

    


print '\n##############'
print 'Body & Title Words & Tags'
trans = InterestingWordsTransformer()

for name, model in models.iteritems():
    print '\n%s' % name
    kfi = kf_over if (name=='Gradient Boosting') else kf
    pipe = Pipeline([('trans',trans),('model',model)])
    print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kfi, scoring=roc_scorer))

    

### Results Table += Interesting Words

The following table documents our results so far. The mean and median scores come from taking the averge of the ROC-AUC scores from 10 k-fold in the specified model.

<table>
<tr>
<th>Method</th>
<th>Mean ROC-AUC</th>
<th>Median ROC-AUC</th>
<th>Standard Deviation</th>
</tr>

<tr>
<td>Activity Features with Reweighted Classes</td>
<td>0.5589</td>
<td>0.5570</td>
<td>0.0213</td>
</tr>
<tr>
<td>Simple BOW, Titles + Bodies, Snowball Stem Tokenizer</td>
<td>0.5615</td>
<td>0.5604</td>
<td>0.0243</td>
</tr>
<tr>
<td>Extra Trees BOW w/ L1 Feature Reduction Bodies</td>
<td>0.5870</td>
<td>0.5920</td>
<td>0.0271</td>
</tr>
<tr>
<td>Extra Trees on Month, Hour</td>
<td>0.5406</td>
<td>0.5440</td>
<td>0.0232</td>
</tr>
<tr>
<td>Extra Trees on Interesting Word Tags in Request</td>
<td>0.5509</td>
<td>0.5582</td>
<td>0.0363</td>
</tr>

</table>

<a href="#top">Return to Table of Contents</a>

<a id="part6"></a>
## 6. Request Quality

In [None]:
from nltk.corpus import brown
from nltk.tokenize.regexp import RegexpTokenizer
from sklearn.feature_extraction import text as sklearn_text
brown_words = np.unique(np.array(brown.words()))
brown_words = np.unique(np.array([x.lower() for x in brown_words]))
brown_word2tag = {word.lower(): tag for word, tag in brown.tagged_words()}
brown_tags = set([tag for word, tag in brown.tagged_words()])

In [None]:
DEFAULT_WORD2TAG = brown_word2tag
DEFAULT_TAG_SET = brown_tags
DEFAULT_WORD_SET = set(brown_words)
DEFAULT_STOP_WORDS = set(['request'])
DEFAULT_TOKENIZER = RegexpTokenizer(r'[\s\.\,\:\-\;\(\)\[\]\{\}\!\?]+',gaps=True)

# This calcualtes how many of the words are in the Brown corpus,
# the idea is that this may capture more well written requests
class InCorpusTransformer(TransformerMixin):
    def __init__(self, word_set=DEFAULT_WORD_SET, tokenizer=DEFAULT_TOKENIZER, stop_words=DEFAULT_STOP_WORDS, normalize=True):
        self.word_set = word_set
        self.tokenizer = tokenizer
        self.stop_words = stop_words
        self.tokenizer = tokenizer
        self.normalize = normalize
    
    def count_tokens(self, tokens):
        if len(tokens) == 0:
            return 2
        else:
            return sum(np.array([token.lower() in self.word_set for token in tokens]))/float(len(tokens))
    
    def tokenize_and_count(self, text):
        tokens = [x.lower() for x in self.tokenizer.tokenize(text) if x.lower() not in self.stop_words]
        return self.count_tokens(tokens)
    
    def process_vector(self, texts):
        return np.array([[self.tokenize_and_count(text) for text in texts]]).T
        
    def transform(self, X, **transform_params):
        if len(X.shape) == 1:
            lens = lenArray(X)
            if self.normalize:
                return self.process_vector(X)/lens
            else:
                return self.process_vector(X)
        else:
            features = []
            for col in range(X.shape[1]):
                lens = lenArray(X[:,col])
                if self.normalize:
                    features.append(self.process_vector(X[:,col])/lens)
                else:
                    features.append(self.process_vector(X[:,col]))
            return np.hstack(tuple(features))
        
    def fit(self, X, y, **fit_params):
        #do nothing
        return self
    
    def get_params(self, deep=True):
        return {'normalize':self.normalize}

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            self.setattr(parameter, value)
        return self




This feature transformer calculates whether words in the request are also present in the Brown University Standard Corpus of Present Day American English. We noticed that "well written" requests tended to perform better, and thought that perhaps requests that used mroe "standard english" words may be perceived as more well written. Additionally, the Brown corpus tags words by part of speech, and we used this information to create another feautre set below. 

It's also worth noting that via analysis of our errors we noted that this feature overly favored long requests. This made us realize we should adjust by the length of the request which led to better results

In [None]:
incorpus = Pipeline([('all_text', ExtractAllText()),('concat', ConcatStringTransformer()),('in',InCorpusTransformer())])
incorpus_raw = Pipeline([('all_text', ExtractAllText()),('concat', ConcatStringTransformer()),('in',InCorpusTransformer(normalize=False))])
etc = ExtraTreesClassifier(n_estimators = 200,
                           max_depth = 4,
                           min_samples_split=15,
                           random_state=rseed,
                           class_weight='auto')

lsvc = Pipeline([('scale', StandardScaler()), ('clf', LinearSVC(class_weight='auto', random_state=rseed))])

print 'LSVC on unnormalized counts'
pipe = Pipeline([('incorpus', incorpus_raw), ('model', lsvc)])
print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

print 'LSVC on normalized counts'

pipe = Pipeline([('incorpus', incorpus), ('model', lsvc)])
print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

print 'ETC on unnormalized counts'
pipe = Pipeline([('incorpus', incorpus_raw), ('model', etc)])
print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

print 'ETC on normalized counts'

pipe = Pipeline([('incoprus', incorpus), ('model', etc)])
print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))



### Results Table += Spelling Mistakes

The following table documents our results so far. The mean and median scores come from taking the averge of the ROC-AUC scores from 10 k-folds in the specified model.

<table>
<tr>
<th>Method</th>
<th>Mean ROC-AUC</th>
<th>Median ROC-AUC</th>
<th>Standard Deviation</th>
</tr>

<tr>
<td>Activity Features with Reweighted Classes</td>
<td>0.5589</td>
<td>0.5570</td>
<td>0.0213</td>
</tr>
<tr>
<td>Simple BOW, Titles + Bodies, Snowball Stem Tokenizer</td>
<td>0.5615</td>
<td>0.5604</td>
<td>0.0243</td>
</tr>
<tr>
<td>Extra Trees BOW w/ L1 Feature Reduction Bodies</td>
<td>0.5870</td>
<td>0.5920</td>
<td>0.0271</td>
</tr>
<tr>
<td>Extra Trees on Month, Hour</td>
<td>0.5406</td>
<td>0.5440</td>
<td>0.0232</td>
</tr>
<tr>
<td>Extra Trees on Interesting Word Tags in Request</td>
<td>0.5509</td>
<td>0.5582</td>
<td>0.0363</td>
</tr>
<tr>
<td>Request Quality (LSVC)</td>
<td>0.5706</td>
<td>0.5800</td>
<td>0.0227</td>
</tr>
<tr>
<td>Request Quality (ETC)</td>
<td>0.5639</td>
<td>0.5695</td>
<td>0.0248</td>
</tr>

</table>

<a href="#top">Return to Table of Contents</a>

<a id="part7"></a>
## 7. Text Summary Features: Text Length

In [None]:
### Reusable class for text length:

TITLE_COLUMN = np.where(all_train_df.columns == 'request_title')[0][0]
BODY_COLUMN = np.where(all_train_df.columns == 'request_text_edit_aware')[0][0]

class TextSummaryTransformer(TransformerMixin):
    def __init__(self, title_col=TITLE_COLUMN, body_col=BODY_COLUMN, do_title=True, do_body=True):
        self.do_title = do_title
        self.do_body = do_body
        self.title_col = title_col
        self.body_col = body_col

    def transform(self, X, **transform_params):
        do_title = self.do_title
        do_body = self.do_body
        title_col = self.title_col
        body_col = self.body_col
        
        features = []
        
        if do_title:
            title_unicode = X[:, title_col]
            title_len = np.array([[len(x.encode('utf-8')) for x in title_unicode]]).T
            features.append(title_len)
            
        if do_body:
            body_unicode = X[:, body_col]
            body_len = np.array([[len(x.encode('utf-8')) for x in body_unicode]]).T
            features.append(body_len)
        
        return np.hstack(tuple(features))
        
    def fit(self, X, y, **fit_params):
        #do nothing
        return self 
    
    def get_params(self, deep=True):
        # suppose this estimator has parameters "alpha" and "recursive"
        return {}

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            self.setattr(parameter, value)
        return self
    

These features simply check how long the request title and body were, since respondents may be more or less likely to grant long eloquent requests or short concise requests

In [None]:
etc = ExtraTreesClassifier(n_estimators = 200,
                           max_depth = 4,
                           min_samples_split=15,
                           random_state=rseed,
                           class_weight='auto')

lsvc = Pipeline([('scale', StandardScaler()), ('clf', LinearSVC(class_weight='auto', random_state=rseed))])


print '\nExtra Trees Classifier on title and body length'
pipe = Pipeline([('text_summary', TextSummaryTransformer()), ('etc', etc)])
print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

print '\Linear SVC on title and body length'
pipe = Pipeline([('text_summary', TextSummaryTransformer()), ('lsvc', lsvc)])
print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

### Results Table += Text Summary Features

The following table documents our results so far. The mean and median scores come from taking the averge of the ROC-AUC scores from 5 k-folds in the specified model.

<table>
<tr>
<th>Method</th>
<th>Mean ROC-AUC</th>
<th>Median ROC-AUC</th>
<th>Standard Deviation</th>
</tr>

<tr>
<td>Activity Features with Reweighted Classes</td>
<td>0.5589</td>
<td>0.5570</td>
<td>0.0213</td>
</tr>
<tr>
<td>Simple BOW, Titles + Bodies, Snowball Stem Tokenizer</td>
<td>0.5615</td>
<td>0.5604</td>
<td>0.0243</td>
</tr>
<tr>
<td>Extra Trees BOW w/ L1 Feature Reduction Bodies</td>
<td>0.5870</td>
<td>0.5920</td>
<td>0.0271</td>
</tr>
<tr>
<td>Extra Trees on Month, Hour</td>
<td>0.5406</td>
<td>0.5440</td>
<td>0.0232</td>
</tr>
<tr>
<td>Extra Trees on Interesting Word Tags in Request</td>
<td>0.5509</td>
<td>0.5582</td>
<td>0.0363</td>
</tr>
<tr>
<td>Request Quality (LSVC)</td>
<td>0.5706</td>
<td>0.5800</td>
<td>0.0227</td>
</tr>
<tr>
<td>Request Quality (ETC)</td>
<td>0.5639</td>
<td>0.5695</td>
<td>0.0248</td>
</tr>
<tr>
<td>Extra Trees on Title and Body Length</td>
<td>0.5787</td>
<td>0.5813</td>
<td>0.0243</td>
</tr>

</table>

<a href="#top">Return to Table of Contents</a>

<a id="part8"></a>
## 8. Location Features

In [None]:
### Collect location name metadata

MANUAL_GEOS = [
{'loc':'nyc', 'g1':'ny', 'g2':'us'},
{'loc':'sf', 'g1':'ca', 'g2':'us'},
{'loc':'uk', 'g1':'uk', 'g2':'non_us'},
{'loc':'australia', 'g1':'aus', 'g2':'non_us'},
{'loc':'canada', 'g1':'can', 'g2':'non_us'},
{'loc':'ottawa', 'g1':'can', 'g2':'non_us'},
{'loc':'toronto', 'g1':'can', 'g2':'non_us'},
{'loc':'vancouver', 'g1':'can', 'g2':'non_us'},
{'loc':'montreal', 'g1':'can', 'g2':'non_us'}
]



def make_geo(other_geos=MANUAL_GEOS, filter_loc=[]):

    from bs4 import BeautifulSoup
    from urllib import urlopen
    import re
    
    ######################
    # Scrape wikipedia list of us cities
    
    # TODO save local
    webpage = urlopen('http://en.wikipedia.org/wiki/List_of_United_States_cities_by_population')
    
    # parse webpage to find the table
    soup=BeautifulSoup(webpage, "html.parser")
    table = soup.find('table', {'class' : 'wikitable sortable'})
    
    # stroe the first 200 US cities
    us_cities = []
    rows = table.findAll('tr')
    for row in rows[1:200]:
        cells = row.findAll('td')

        output = []

        for i, cell in enumerate(cells):
            if i < 4:
                text = cell.text.strip().lower()
                if i == 0:
                    text = int(text)
                if i == 1 or i == 2:
                    text = re.sub(r"\[.*\]|'",'',text)
                if i == 3:
                    text = int(re.sub(r',','',text))
                output.append(text)
        us_cities.append(output)

    us_cities = pd.DataFrame(np.array(us_cities),columns=['rank','city','state','pop'])
    
    ###########################
    # tuple list of state abbreviations
    
    state_abr_raw = [("Alabama","AL"),("Alaska","AK"),("Arizona","AZ"),
                     ("Arkansas","AR"),("California","CA"),("Colorado","CO"),
                     ("Connecticut","CT"),("Delaware","DE"),("District of Columbia","DC"),
                     ("Florida","FL"),("Georgia","GA"),("Hawaii","HI"),
                     ("Idaho","ID"),("Illinois","IL"),("Indiana","IN"),
                     ("Iowa","IA"),("Kansas","KS"),("Kentucky","KY"),
                     ("Louisiana","LA"),("Maine","ME"),("Montana","MT"),
                     ("Nebraska","NE"),("Nevada","NV"),("New Hampshire","NH"),
                     ("New Jersey","NJ"),("New Mexico","NM"),("New York","NY"),
                     ("North Carolina","NC"),("North Dakota","ND"),("Ohio","OH"),
                     ("Oklahoma","OK"),("Oregon","OR"),("Maryland","MD"),
                     ("Massachusetts","MA"),("Michigan","MI"),("Minnesota","MN"),
                     ("Mississippi","MS"),("Missouri","MO"),("Pennsylvania","PA"),
                     ("Rhode Island","RI"),("South Carolina","SC"),("South Dakota","SD"),
                     ("Tennessee","TN"),("Texas","TX"),("Utah","UT"),
                     ("Vermont","VT"),("Virginia","VA"),("Washington","WA"),
                     ("West Virginia","WV"),("Wisconsin","WI"),("Wyoming","WY")]
    
    ############################
    # manupulate state abreviations
    state_abr = []
    for st, abr in state_abr_raw:
        state_abr.append([st.lower(), abr.lower()])
    state_abr = pd.DataFrame(np.array(state_abr), columns = ['state','abr'])
    
    #############################
    # make US Geos
    us_city_state = pd.merge(us_cities,state_abr)
    
    # US geos
    usgeo = us_city_state.loc[:,['city','abr']]
    usgeo.columns = ['loc','g1']
    usgeo = pd.concat([usgeo,pd.DataFrame({'loc':state_abr.abr,'g1':state_abr.abr})])
    usgeo = pd.concat([usgeo,pd.DataFrame({'loc':state_abr.state,'g1':state_abr.abr})])
    usgeo['g2'] = 'us'
    
    
    geo = pd.concat([usgeo, pd.DataFrame(other_geos)])
    
    # get rid of auto generated locations with confusiong names
    #geo = geo[[not x in filter_loc for x in geo['loc']]]
    
    return geo

geo = make_geo()



In [None]:
# reusable class for finding location names and aggregating them for different metadata

DEFAULT_TOKENIZER = RegexpTokenizer(r'[\s\.\,\:\-\;\(\)\[\]\{\}\!\?]+',gaps=True)
FILTER_DEFAULT = ['in', 'hi', 'me', 'ok', 'HI', 'OK', 'or']
DEFAULT_GEO = geo

class GeoTransformer(TransformerMixin):
    
    def __init__(self, geo=DEFAULT_GEO, level=2, tokenizer=DEFAULT_TOKENIZER, stop_words=FILTER_DEFAULT, total_only=True, normalize=True):
        self.tokenizer = tokenizer
        self.stop_words = stop_words
        self.geo = geo
        self.level = level
        self.normalize = normalize
        self.total_only = total_only
        
    def find_words(self, words, texts, g1=None, g2=None, lower=True):
        word_dict = {}
        do_g1 = not g1 is None
        do_g2 = not g2 is None
        
        if do_g1:
            g1_dict = {}
            
        if do_g2:
            g2_dict = {}
        
        #token_list = [[x.lower() for x in self.tokenizer.tokenize(text) if x not in self.stop_words] for text in texts]
    
        i = 0
        for word in words:
            if not word in self.stop_words:
                regex = re.compile('\\b('+word+')\\b')
                has_word = np.array([(1 if regex.search(text.lower()) else 0) for text in texts])
                word_dict[word] = has_word
                if do_g1:
                    g1i = g1.iloc[i]
                    if g1i in g1_dict:
                        g1_dict[g1i] += has_word
                    else:
                        g1_dict[g1i] = has_word
                        
                if do_g2:
                    g2i = g2.iloc[i]
                    if g2i in g2_dict:
                        g2_dict[g2i] += has_word
                    else:
                        g2_dict[g2i] = has_word
                        
            i += 1

        return (word_dict, g1_dict, g2_dict)
    
    def transform(self, X, **transform_params):
        geo = self.geo
        find_words = self.find_words
        level = self.level
        normalize = self.normalize
        
        words = geo['loc']
        g1 = geo['g1']
        g2 = geo['g2']
        
        
        features = []
        
        
        if len(X.shape) > 1:
            cols = X.shape[1]
            for i in cols:
                locs, g1s, g2s = find_words(words, X[:,i], g1s, g2s)
                
                if level == 0:
                    df = pd.DataFrame(locs)
                elif level == 1:
                    df = pd.DataFrame(g1s)
                elif level > 1:
                    df = pd.DataFrame(g2s)
                    
                #df = pd.DataFrame(locss)
                features.append(df.values)
        else:
            lens = lenArray(X)
            
            locs, g1s, g2s = find_words(words, X, g1, g2)
            
            if level == 0:
                df = pd.DataFrame(locs)
            elif level == 1:
                df = pd.DataFrame(g1s)
            elif level > 1:
                df = pd.DataFrame(g2s)
            
            #df = pd.DataFrame(locss)
            if normalize:
                features.append(df.values/lens)
            else:
                features.append(df.values)
            #print df.values/lens
        
        ret_array = np.hstack(tuple(features))
        
        total = np.reshape(np.sum(ret_array,1),(ret_array.shape[0],1))
        
        if self.total_only:
            self.feature_names_ = np.array([u'Total'])
            
            return total
        else:
            self.feature_names_ = np.hstack((df.columns.values,u'Total'))

            return np.hstack((ret_array, total))

        
    
    def fit(self, X, y, **fit_params):
        #do nothing
        return self
    
    def get_params(self, deep=True):
        # suppose this estimator has parameters "alpha" and "recursive"
        return {'level':self.level, 'normalize':self.normalize}

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            self.setattr(parameter, value)
        return self

    

Looking through our errors, we noticed that we were missing alot of successful requests that included location names. We also noticed on the Reddit page that you have to include a location for your request to be granted (duh). So we scraped a list of US city names and their associated states. We also created a list of us states and their abbreviations. Finally, we added a few locations of our own. This transformer, looks for matches of these location names and aggregates them, usually based on state (based on country for international locations)


UNFORTUNATELY, it didn't actually add much value... but thought we'd include it here anyway, cuz it was a lot of work and could be interesting for others to use in other problems or improve here.

In [None]:
geo_trans = GeoTransformer(geo, level = 1, total_only = False, normalize = False)
geo_title = geo_trans.transform(ExtractTitle().transform(all_train_df.values))
geo_body = geo_trans.transform(ExtractBody().transform(all_train_df.values))


print 'Individual occurences in title:'
print zip(geo_trans.feature_names_, np.sum(geo_title,0))


print 'Individual occurences in body:'
print zip(geo_trans.feature_names_, np.sum(geo_body,0))

In [None]:
etc = ExtraTreesClassifier(n_estimators = 200,
                           max_depth = 4,
                           min_samples_split=15,
                           random_state = rseed,
                           class_weight='auto')

level = 1

title = Pipeline([('text',ExtractTitle()),('geo',GeoTransformer(geo, level=level))])
body = Pipeline([('text',ExtractBody()),('geo',GeoTransformer(geo, level=level))])
all_text = Pipeline([('text',ExtractAllText()),('combine',ConcatStringTransformer()),('geo',GeoTransformer(geo,level=level))])

pipe = Pipeline([('features',body),('model',etc)])

print 'Body:'
pipe = Pipeline([('features',body),('model',etc)])
scores = cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer)
print_scores(scores)

print 'Title:'
pipe = Pipeline([('features',title),('model',etc)])
scores = cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer)
print_scores(scores)

print 'On Concat Title/Body:'
pipe = Pipeline([('features',all_text),('model',etc)])
scores = cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer)
print_scores(scores)

### Results Table += Text Summary Features

The following table documents our results so far. The mean and median scores come from taking the averge of the ROC-AUC scores from 5 k-folds in the specified model.

<table>
<tr>
<th>Method</th>
<th>Mean ROC-AUC</th>
<th>Median ROC-AUC</th>
<th>Standard Deviation</th>
</tr>

<tr>
<td>Activity Features with Reweighted Classes</td>
<td>0.5589</td>
<td>0.5570</td>
<td>0.0213</td>
</tr>
<tr>
<td>Simple BOW, Titles + Bodies, Snowball Stem Tokenizer</td>
<td>0.5615</td>
<td>0.5604</td>
<td>0.0243</td>
</tr>
<tr>
<td>Extra Trees BOW w/ L1 Feature Reduction Bodies</td>
<td>0.5870</td>
<td>0.5920</td>
<td>0.0271</td>
</tr>
<tr>
<td>Extra Trees on Month, Hour</td>
<td>0.5406</td>
<td>0.5440</td>
<td>0.0232</td>
</tr>
<tr>
<td>Extra Trees on Interesting Word Tags in Request</td>
<td>0.5509</td>
<td>0.5582</td>
<td>0.0363</td>
</tr>
<tr>
<td>Request Quality (LSVC)</td>
<td>0.5706</td>
<td>0.5800</td>
<td>0.0227</td>
</tr>
<tr>
<td>Request Quality (ETC)</td>
<td>0.5639</td>
<td>0.5695</td>
<td>0.0248</td>
</tr>
<tr>
<td>Extra Trees on Title and Body Length</td>
<td>0.5787</td>
<td>0.5813</td>
<td>0.0243</td>
</tr>
<tr>
<td>Extra Trees of Location Features on Title and Body Length</td>
<td>0.5207</td>
<td>0.5157</td>
<td>0.0252</td>
</tr>
</table>

<a href="#top">Return to Table of Contents</a>

<a id="part9"></a>
## 9. Parts of Speech

In [None]:

# For Part of Speech Tagging
class Word2TagTransformer(TransformerMixin):
    def __init__(self, tag_set=DEFAULT_TAG_SET, word_set=DEFAULT_WORD_SET, word2tag=DEFAULT_WORD2TAG, tokenizer=DEFAULT_TOKENIZER, stop_words=DEFAULT_STOP_WORDS):
        self.tag_set = tag_set
        self.word_set = word_set
        self.word2tag = word2tag
        self.tokenizer = tokenizer
        self.stop_words = stop_words
        self.tokenizer = tokenizer
        self.tags_dict = {tag: 0 for tag in tag_set}
    
    def tag_tokens(self, tokens):
        tag_tokens = []
        
        for token in tokens:
            token = token.lower()
            if (token in self.word_set) and (token not in self.stop_words):
                tag_tokens.append(self.word2tag[token])
        
        return ' '.join(tag_tokens)
        
    
    def tokenize_and_tag(self, text):
        tokens = [x.lower() for x in self.tokenizer.tokenize(text) if x.lower() not in self.stop_words]
        return self.tag_tokens(tokens)
    
    def process_vector(self, texts):
        return np.array([[self.tokenize_and_tag(text) for text in texts]]).T
        
    def transform(self, X, **transform_params):
        if len(X.shape) == 1:
            return self.process_vector(X).flatten()
        else:
            features = []
            for col in range(X.shape[1]):
                features.append(self.process_vector(X[:,col]))
            return np.hstack(tuple(features))
        
    def fit(self, X, y, **fit_params):
        #do nothing
        return self
    
    def get_params(self, deep=True):
        return {}

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            self.setattr(parameter, value)
        return self


This section of feature transformers replaces words with their parts of speech as tagged in the Brown corpus. I then counts the term frequency on the parts of speech tags. Our thinking was again that higher quality or lower quality requests may use different semantic constructs and that this analysis would extract that information.

In [None]:
tv_space = TfidfVectorizer(ngram_range=(1,1),lowercase=True, token_pattern=u'[^\s-]')
all_text = Pipeline([('all_text', ExtractAllText()),('concat', ConcatStringTransformer())])
etc = ExtraTreesClassifier(n_estimators = 200,
                           max_depth = 4,
                           min_samples_split=15,
                           random_state=rseed,
                           class_weight='auto')


lsvc = LinearSVC(class_weight='auto', random_state=rseed)

pipe_nol1 = Pipeline([
    ('text',all_text),
    ('word2tag', Word2TagTransformer()),
    ('tv',tv_space)
])

print '\nLinear SVC on Brown corpus word tags:'
pipe = Pipeline([('process', pipe_nol1),('model',lsvc)])
scores = cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer, verbose=1)
print_scores(scores)

print '\nExtra Trees on Brown corpus word tags:'
pipe = Pipeline([('process', pipe_nol1),('model',etc)])
scores = cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer, verbose=1)
print_scores(scores)


<a id="part10"></a>
## 10 Subreddits

In [None]:
# reusable class that transforms the list of subreddits for each user in to space seperated string for use by 
# the tfidf vectorizer
SUBREDDITS_COLUMN = np.where(all_train_df.columns == 'requester_subreddits_at_request')[0][0]

class SubredditTransformer(TransformerMixin):
   
    def __init__(self, column = SUBREDDITS_COLUMN):
        self.column = column
   
    def fit(self, X, y, **fit_params):
        return self
   
    def transform(self, X, **transform_params):
        return np.array([' '.join(x) for x in X[:,self.column]])
   
    def get_params(self, deep=True):
        return {}
    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            self.setattr(parameter, value)
        return self



This feature creator makes a term frequency matrix out of the list of subreddits that each requesting user contributes to. We decided to use a tree ensemble to test this method, because there may be interactive elements (presence of one subreddit and another). We also use L1 feature regularization as described in the text section above to reduce the feature set. We chose C such that we didn't get too many features (only 20) but didn't lose much explanatory power vs using lots more.


In [None]:
etc = ExtraTreesClassifier(n_estimators=200,
                            max_depth=4,
                            min_samples_split=15,
                            random_state = rseed,
                            class_weight='auto')
l1 = LinearWeightFeatureThreshold(C=.15)
tv_space = TfidfVectorizer(token_pattern = u'[^\s]+', min_df=10)
pipe = Pipeline([('sub',SubredditTransformer()), ('tv', tv_space), ('l1', l1), ('model',etc)])

In [None]:
scores = cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer, verbose=1)
print_scores(scores)

### Results Table += Subreddits + Parts of Speech

The following table documents our results so far. The mean and median scores come from taking the averge of the ROC-AUC scores from 10 k-folds in the specified model.

<table>
<tr>
<th>Method</th>
<th>Mean ROC-AUC</th>
<th>Median ROC-AUC</th>
<th>Standard Deviation</th>
</tr>

<tr>
<td>Activity Features with Reweighted Classes</td>
<td>0.5589</td>
<td>0.5570</td>
<td>0.0213</td>
</tr>
<tr>
<td>Simple BOW, Titles + Bodies, Snowball Stem Tokenizer</td>
<td>0.5615</td>
<td>0.5604</td>
<td>0.0243</td>
</tr>
<tr>
<td>Extra Trees BOW w/ L1 Feature Reduction Bodies</td>
<td>0.5870</td>
<td>0.5920</td>
<td>0.0271</td>
</tr>
<tr>
<td>Extra Trees on Month, Hour</td>
<td>0.5406</td>
<td>0.5440</td>
<td>0.0232</td>
</tr>
<tr>
<td>Extra Trees on Interesting Word Tags in Request</td>
<td>0.5509</td>
<td>0.5582</td>
<td>0.0363</td>
</tr>
<tr>
<td>Request Quality (LSVC)</td>
<td>0.5706</td>
<td>0.5800</td>
<td>0.0227</td>
</tr>
<tr>
<td>Request Quality (ETC)</td>
<td>0.5639</td>
<td>0.5695</td>
<td>0.0248</td>
</tr>
<tr>
<td>Extra Trees on Title and Body Length</td>
<td>0.5787</td>
<td>0.5813</td>
<td>0.0243</td>
</tr>
<tr>
<td>Extra Trees of Location Features on Title and Body Length</td>
<td>0.5207</td>
<td>0.5157</td>
<td>0.0252</td>
</tr>
<tr>
<td>Extra Trees of Parts of Speech</td>
<td>0.5530</td>
<td>0.5513</td>
<td>0.0167</td>
</tr>
<tr>
<td>Extra Trees of Subreddits</td>
<td>0.5586</td>
<td>0.5581</td>
<td>0.0262</td>
</tr>
</table>

<a href="#top">Return to Table of Contents</a>

<a id="part11"></a>
## 11. Final, Composite Model

We took two approaches to combining these features sets.
In one, we checked which feature sets were the best and started with those. We then iteratively add additional feature sets to see whether or not they improve performance. If they don't add anything or subtract value, we remove them and keep going.
In the other, we don't choose at all and leave it to the model to decide.

Secondly, we try three different kinds of ensembled decision trees that get roughly similar results. Decision trees (a bunch of consecutive binary splits of the data based on variable values) can be a useful way to explore models where features my be nonlinear. Repeating a previous example, for the time features, hour 23.5 (late at night) and 0.5 (so early in the morning it's still late at night) may be treated similarly by a linear model, but a decision tree, can create a couple splites and capture it easily (hour > 23 and hour < 1).

The "ensemble" part of "tree ensemble" means we constructing a ton of different decision trees, then average the prediction of each tree to inform our final prediction. This reduces the overfitting that can occur in a single decision tree.

In [None]:
feats = {}
all_feats={}

In [None]:

lsvc = LinearSVC(class_weight='auto', random_state=rseed)
etc = ExtraTreesClassifier(n_estimators = 200,
                           max_depth = 4,
                           min_samples_split=15,
                           random_state=rseed,
                           class_weight='auto')
rfc = RandomForestClassifier(n_estimators = 200,
                           max_depth = 4,
                           min_samples_split=15,
                           random_state=rseed,
                           class_weight='auto')

Start with Extra Trees on BOW w/ L1 Feature Reduction

In [None]:
### Try ExtraTreesClassifier for the BOW models:

l1_bow = LinearWeightFeatureThreshold(C=0.15)
tv_bow = TfidfVectorizer(tokenizer=SnowballStemTokenizer())

all_feats['bow_l1'] = Pipeline([('extract', ExtractBody()), ('tv',tv_bow), ('features',l1_bow)])
feats['bow_l1'] = all_feats['bow_l1']


In [None]:
pipe = Pipeline([('featues', FeatureUnion(feats.items())),('model',etc)])
print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

<table>
<tr>
<th>Method</th>
<th>Mean ROC-AUC</th>
<th>Median ROC-AUC</th>
<th>Standard Deviation</th>
</tr>

<tr>
<td>Extra Trees BOW w/ L1 Feature Reduction Bodies</td>
<td>0.5870</td>
<td>0.5920</td>
<td>0.0271</td>
</tr>
</table>

<a href="#top">Return to Table of Contents</a>

Add Tile and Body Length

In [None]:
all_feats['length'] = TextSummaryTransformer()
feats['length'] = all_feats['length']

In [None]:
pipe = Pipeline([('featues', FeatureUnion(feats.items())),('model',etc)])
print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

In [None]:
del feats['length']

<table>
<tr>
<th>Method</th>
<th>Mean ROC-AUC</th>
<th>Median ROC-AUC</th>
<th>Standard Deviation</th>
</tr>

<tr>
<td>Prevailing Model</td>
<td>0.5870</td>
<td>0.5920</td>
<td>0.0271</td>
</tr>
<tr>
<td>+Body & Title Length (NO ADDITION)</td>
<td>0.5864</td>
<td>0.5865</td>
<td>0.0211</td>
</tr>
</table>

<a href="#top">Return to Table of Contents</a>

Try adding request quality to the model

In [None]:
incorpus = Pipeline([('all_text', ExtractAllText()),('concat', ConcatStringTransformer()),('in',InCorpusTransformer())])
all_feats['incorpus']=incorpus
feats['incorpus']=incorpus

In [None]:
pipe = Pipeline([('featues', FeatureUnion(feats.items())),('model',etc)])
print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

Keep it

In [None]:
#del feats['incorpus']

<table>
<tr>
<th>Method</th>
<th>Mean ROC-AUC</th>
<th>Median ROC-AUC</th>
<th>Standard Deviation</th>
</tr>

<tr>
<td>Prevailing Model</td>
<td>0.5870</td>
<td>0.5920</td>
<td>0.0271</td>
</tr>
<tr>
<td>+Request Quality (words from brown corpus)</td>
<td>0.5903</td>
<td>0.5992</td>
<td>0.0274</td>
</tr>
</table>

<a href="#top">Return to Table of Contents</a>

Add subreddit analysis

In [None]:
l1 = LinearWeightFeatureThreshold(C=.15)
tv_space = TfidfVectorizer(token_pattern = u'[^\s]+', min_df=10)

sub = Pipeline([('sub',SubredditTransformer()), ('tv', tv_space), ('l1', l1)])
feats['sub'] = sub
all_feats['sub'] = sub

In [None]:
pipe = Pipeline([('featues', FeatureUnion(feats.items())),('model',etc)])
print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

It didn't help at all so will leave it out for now

In [None]:
del feats['sub']

<table>
<tr>
<th>Method</th>
<th>Mean ROC-AUC</th>
<th>Median ROC-AUC</th>
<th>Standard Deviation</th>
</tr>

<tr>
<td>Prevailing Model</td>
<td>0.5903</td>
<td>0.5992</td>
<td>0.0274</td>
</tr>
<tr>
<td>+Subreddit Analysis</td>
<td>0.5913</td>
<td>0.5932</td>
<td>0.0309</td>
</tr>
</table>

<a href="#top">Return to Table of Contents</a>

Add activity features

In [None]:
acts = ExtractActivities()
all_feats['activities'] = acts
feats['activities'] = acts

In [None]:
pipe = Pipeline([('featues', FeatureUnion(feats.items())),('model',etc)])
print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

Keep it

In [None]:
#del feats['activities']

<table>
<tr>
<th>Method</th>
<th>Mean ROC-AUC</th>
<th>Median ROC-AUC</th>
<th>Standard Deviation</th>
</tr>

<tr>
<td>Prevailing Model</td>
<td>0.5903</td>
<td>0.5992</td>
<td>0.0273</td>
</tr>
<tr>
<td>+Activities</td>
<td>0.5964</td>
<td>0.5882</td>
<td>0.0227</td>
</tr>
</table>

<a href="#top">Return to Table of Contents</a>

Add parts of speech tagging

In [None]:
tv_space = TfidfVectorizer(ngram_range=(1,1),lowercase=True, token_pattern=u'[^\s-]')
all_text = Pipeline([('all_text', ExtractAllText()),('concat', ConcatStringTransformer())])

pos_tags = Pipeline([
    ('text',all_text),
    ('word2tag', Word2TagTransformer()),
    ('tv',tv_space),
    ('desparse', DesparseTransformer())
])

all_feats['pos_tags'] = pos_tags
feats['pos_tags'] = pos_tags

In [None]:
pipe = Pipeline([('featues', FeatureUnion(feats.items())),('model',etc)])
print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

Keep it!

In [None]:
#del feats['pos_tags']

<table>
<tr>
<th>Method</th>
<th>Mean ROC-AUC</th>
<th>Median ROC-AUC</th>
<th>Standard Deviation</th>
</tr>

<tr>
<td>Prevailing Model</td>
<td>0.5964</td>
<td>0.5882</td>
<td>0.0227</td>
</tr>
<tr>
<td>+Part of Speech Tags from Brown Corpus</td>
<td>0.6024</td>
<td>0.6116</td>
<td>0.0198</td>
</tr>
</table>

<a href="#top">Return to Table of Contents</a>

Interesting word tags

In [None]:
interesting = InterestingWordsTransformer(do_words=False)
all_feats['interesting'] = interesting
feats['interesting'] = interesting

In [None]:
pipe = Pipeline([('features', FeatureUnion(feats.items())),('model',etc)])
print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

Worse, toss it

In [None]:
del feats['interesting']

<table>
<tr>
<th>Method</th>
<th>Mean ROC-AUC</th>
<th>Median ROC-AUC</th>
<th>Standard Deviation</th>
</tr>

<tr>
<td>Prevailing Model</td>
<td>0.6024</td>
<td>0.6116</td>
<td>0.0198</td>
</tr>
<tr>
<td>+InterestingWords</td>
<td>0.5958</td>
<td>0.5916</td>
<td>0.0147</td>
</tr>
</table>

<a href="#top">Return to Table of Contents</a>

Try adding time featues (month and hour)

In [None]:
times = TimeTransformer(do_day=False, do_dow=False, do_hour=True,do_minute=False,do_second=False,do_month=True)
feats['times'] = times
all_feats['times'] = times

In [None]:
pipe = Pipeline([('features', FeatureUnion(feats.items())),('model',etc)])
print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

Slightly worse

In [None]:
del feats['times']

<table>
<tr>
<th>Method</th>
<th>Mean ROC-AUC</th>
<th>Median ROC-AUC</th>
<th>Standard Deviation</th>
</tr>

<tr>
<td>Prevailing Model</td>
<td>0.6024</td>
<td>0.6116</td>
<td>0.0198</td>
</tr>
<tr>
<td>+Times</td>
<td>0.5981</td>
<td>0.6056</td>
<td>0.0225</td>
</tr>
</table>

<a href="#top">Return to Table of Contents</a>

Try adding subreddit features

In [None]:
l1 = LinearWeightFeatureThreshold(C=.15)
tv_space = TfidfVectorizer(token_pattern = u'[^\s]+', min_df=10)

sub = Pipeline([('sub',SubredditTransformer()), ('tv', tv_space), ('l1', l1)])
feats['sub'] = sub
all_feats['sub'] = sub

In [None]:
pipe = Pipeline([('features', FeatureUnion(feats.items())),('model',etc)])
print_scores(cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer))

Worse, toss it

In [None]:
del feats['sub']

<table>
<tr>
<th>Method</th>
<th>Mean ROC-AUC</th>
<th>Median ROC-AUC</th>
<th>Standard Deviation</th>
</tr>

<tr>
<td>Prevailing Model</td>
<td>0.6024</td>
<td>0.6116</td>
<td>0.0198</td>
</tr>
<tr>
<td>+Subreddit</td>
<td>0.6009</td>
<td>0.6079</td>
<td>0.0280</td>
</tr>
</table>

<a href="#top">Return to Table of Contents</a>

Try all features with Extra Trees regularization then RandomForest prediction

Random Forest w/ All Features

In [None]:
pipe = Pipeline([('featues', FeatureUnion(all_feats.items())),('model',rfc)])
rfc_all = cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer)
print_scores(rfc_all)

Extra Trees Classifier With All

In [None]:
pipe = Pipeline([('featues', FeatureUnion(all_feats.items())),('model',etc)])
etc_all = cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf, scoring=roc_scorer)
print_scores(etc_all)

Gradient Boosting with All

In [None]:
gbc = GradientBoostingClassifier(n_estimators = 200,
                            learning_rate=0.01,
                           max_depth = 3,
                           min_samples_split=10,
                           random_state = rseed)
pipe = Pipeline([('featues', FeatureUnion(all_feats.items())),('model',gbc)])
# note we use the oversampled k fold kf_over
gbc = cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf_over, scoring=roc_scorer)
print_scores(gbc)


## Aggregated Models
<table>
<tr>
<th>Method</th>
<th>Mean ROC-AUC</th>
<th>Median ROC-AUC</th>
<th>Standard Deviation</th>
</tr>

<tr>
<td>Best From Manual Stepwise Selection</td>
<td>0.6024</td>
<td>0.6116</td>
<td>0.0198</td>
</tr>

<tr>
<td>All Features Gradient Boosting</td>
<td>0.6114</td>
<td>0.6070</td>
<td>0.0325</td>
</tr>
<tr>
<td>All Features Extra Trees</td>
<td>0.6035</td>
<td>0.6052</td>
<td>0.0255</td>
</tr>
<tr>
<td>All Features Random Forests</td>
<td>0.6093</td>
<td>0.6026</td>
<td>0.0295</td>
</tr>
</table>

## Feature Models

<table>
<tr>
<th>Method</th>
<th>Mean ROC-AUC</th>
<th>Median ROC-AUC</th>
<th>Standard Deviation</th>
</tr>

<tr>
<td>Activity Features with Reweighted Classes</td>
<td>0.5589</td>
<td>0.5570</td>
<td>0.0213</td>
</tr>
<tr>
<td>Simple BOW, Titles + Bodies, Snowball Stem Tokenizer</td>
<td>0.5615</td>
<td>0.5604</td>
<td>0.0243</td>
</tr>
<tr>
<td>Extra Trees BOW w/ L1 Feature Reduction Bodies</td>
<td>0.5870</td>
<td>0.5920</td>
<td>0.0271</td>
</tr>
<tr>
<td>Extra Trees on Month, Hour</td>
<td>0.5406</td>
<td>0.5440</td>
<td>0.0232</td>
</tr>
<tr>
<td>Extra Trees on Interesting Word Tags in Request</td>
<td>0.5509</td>
<td>0.5582</td>
<td>0.0363</td>
</tr>
<tr>
<td>Request Quality (LSVC)</td>
<td>0.5706</td>
<td>0.5800</td>
<td>0.0227</td>
</tr>
<tr>
<td>Request Quality (ETC)</td>
<td>0.5639</td>
<td>0.5695</td>
<td>0.0248</td>
</tr>
<tr>
<td>Extra Trees on Title and Body Length</td>
<td>0.5787</td>
<td>0.5813</td>
<td>0.0243</td>
</tr>
<tr>
<td>Extra Trees of Location Features on Title and Body Length</td>
<td>0.5207</td>
<td>0.5157</td>
<td>0.0252</td>
</tr>
<tr>
<td>Extra Trees of Parts of Speech</td>
<td>0.5530</td>
<td>0.5513</td>
<td>0.0167</td>
</tr>
<tr>
<td>Extra Trees of Subreddits</td>
<td>0.5586</td>
<td>0.5581</td>
<td>0.0262</td>
</tr>
</table>


<a href="#top">Return to Table of Contents</a>

<a id="part12"></a>
## 12. Notes On Error Analysis

We built tools for error analysis and a few of the features above came out of that analysis.
For example, we noticed that the intresting words and incorpus counting features massively preferred long requests. This led to us dividing the counts by the length of requests, sort if in the fashion of TFIDF calculations, which led to better results for those features.

We also noded that a lot of the succesful requests we failed to identify had location names in them and looking on the Reddit group noticed that location was a requirement for fulfillment (which, logistically, is obvious in hindsight). This inpsired us to create our geographic word identificaiton feature.

In [None]:

tv = TfidfVectorizer(tokenizer=SnowballStemTokenizer())
l1 = LinearWeightFeatureThreshold(C=0.3)

etc = ExtraTreesClassifier(n_estimators = 200,
                           max_depth = 4,
                           min_samples_split=15,
                           random_state=rseed,
                           class_weight='auto')

pipe = Pipeline([
    ('body', Pipeline([('extract', ExtractBody()),
         ('tv',tv),
         ('features',l1)
         ])),
    ('model', etc)
])  
                    

# Number of errors to examine per fold. Keep the number negative to find the biggest error cases.
# Keep in mind that this will print 10x this many error cases.

TITLE_COLUMN = np.where(all_train_df.columns == 'request_title')[0][0]
BODY_COLUMN = np.where(all_train_df.columns == 'request_text_edit_aware')[0][0]


num_errors_per_fold = -1

check_kf = kf

for train_index, test_index in check_kf:
    X_train, X_test = all_train_df.values[train_index], all_train_df.values[test_index]
    y_train, y_test = all_train_labels[train_index], all_train_labels[test_index]
    
    print X_train.shape
    print X_test.shape
    
    pipe.fit(X_train, y_train)
    cl_probs = pipe.predict_proba(X_test)

    # Loop through this fold of test data and determine the R ratio for each one:
    ratios = []
    for i in range(0, cl_probs.shape[0]):
        ratios.append(cl_probs[i].max() / cl_probs[i][y_test[i]])

    # Find the 3 largest ratios and print them as error cases to examine:
    ratios = np.asarray(ratios)
    heavy_ratios = ratios.argsort()[num_errors_per_fold:][::-1]
    for r in heavy_ratios:
        print "== We guessed %s for this, but it was actually %s. ==" % (np.argmax(cl_probs[r]),
                                                                         y_test[r])
        print "\n", X_test[r, TITLE_COLUMN]
        print "\n", X_test[r, BODY_COLUMN]
        print "\n==========\n"

        

<a id="part13"></a>
## 13. Appendix

### The Thinker

We used this function to easily explore our text and generate ideas

In [None]:
def TheThinker(n):
    bodies = ExtractBody().transform(all_train_df)
    titles = ExtractTitle().transform(all_train_df)
    username = ExtractUser().transform(all_train_df)
    y = all_train_labels

    choices = np.random.choice(np.arange(n_all),n)
    for i in choices:
        print '###########################'
        if y[i]:
            print 'SUCCESS'
        else:
            print 'FAILURE'

        print 'User:', username[i]

        print 'Title:'
        print titles[i]

        print 'Body:'
        print bodies[i]


In [None]:
TheThinker(2)

### String Theory

We used this transformer to create simple models that look for the occurence of strings in text. This was useful for testing hunches about significance when exploring our data set and doing error analysis.

In [None]:
class CheckWordsTransformer(TransformerMixin):
    def __init__(self, words=[]):
        self.words = words
        
    def find_words(self, words, text, lower=True):
        word_dict = {}

        for word in words:
            if lower:
                has_word = np.array([(1 if word in t.lower() else 0) for t in text])
            else:
                has_word = np.array([(1 if word in t else 0) for t in text])
            word_dict[word] = has_word

        return word_dict
    
    # manually create keywords with categories
    
    def transform(self, X, **transform_params):
        words = self.words
        find_words = self.find_words
        
        features = []
        
        if len(X.shape) > 1:
            cols = X.shape[1]
            for i in cols:
                lens = lenArray(X[:,i])
                features.append(pd.DataFrame(find_words(words, X[:,i])).values/lens)
        else:
            lens = lenArray(X)
            features.append(pd.DataFrame(find_words(words, X)).values/lens)
            
        return np.hstack(tuple(features))
    
    def fit(self, X, y, **fit_params):
        #do nothing
        return self
    
    def get_params(self, deep=True):
        # suppose this estimator has parameters "alpha" and "recursive"
        return {'words': self.words}

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            self.setattr(parameter, value)
        return self

    

For example, the string 'thank' is pretty explanatory... is it really a magic word?!

In [None]:
strings = ['thank']
pipe = Pipeline([('text',all_text),('strings',CheckWordsTransformer(words=strings)),('model',etc)])
strings_cv = cross_val_score(pipe, all_train_df.values, all_train_labels, cv=kf_over, scoring=roc_scorer)
print_scores(strings_cv)