### Random Acts of Pizza - Baseline Model & EDA
## Authors: Ben Arnoldy, Mary Boardman, Zach Merritt, and Kevin Gifford
#### Kaggle Competition Description:
In machine learning, it is often said there are no free lunches. How wrong we were.

This competition contains a dataset with 5671 textual requests for pizza from the Reddit community Random Acts of Pizza together with their outcome (successful/unsuccessful) and meta-data. Participants must create an algorithm capable of predicting which requests will garner a cheesy (but sincere!) act of kindness.

"I'll write a poem, sing a song, do a dance, play an instrument, whatever! I just want a pizza," says one hopeful poster. What about making an algorithm?

Kaggle is hosting this competition for the machine learning community to use for fun and practice. This data was collected and graciously shared by Althoff et al. (Buy them a pizza -- data collection is a thankless and tedious job!) We encourage participants to explore their accompanying paper and ask that you cite the following reference in any publications that result from your work:

Tim Althoff, Cristian Danescu-Niculescu-Mizil, Dan Jurafsky. How to Ask for a Favor: A Case Study on the Success of Altruistic Requests, Proceedings of ICWSM, 2014.
_______________________________________________________________________________________________

## Notebook Title: EDA & Model Baseline
#### Purpose: Load the 'Random Acts of Pizza' train and test data. Conduct an exploratory data analysis to gain an understanding of the data. Create a baseline Logisitic Regression model using non-text (numeric) fields only. 

## I. Load Data and Modules, Process Data

### A. Load Data and Modules

In [68]:
import json
import pandas as pd
import numpy as np
import matplotlib as mpl
from matplotlib.colors import ListedColormap
from matplotlib.colors import LogNorm
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

from subprocess import check_output
#from wordcloud import WordCloud, STOPWORDS

#ML
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.mixture import GMM
from sklearn.mixture import GaussianMixture

from sklearn.decomposition import TruncatedSVD
from sklearn.random_projection import sparse_random_matrix

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

from sklearn.pipeline import make_pipeline

%matplotlib inline
plt.style.use('bmh')

In [69]:
#1. Train Data
with open('../data/train.json') as fin:
    trainjson = json.load(fin)
train = pd.io.json.json_normalize(trainjson)
#2. Test Data
with open('../data/test.json') as fin:
    testjson = json.load(fin)
test = pd.io.json.json_normalize(testjson)

print("Train Shape:", train.shape)
print("Test Shape:", test.shape)

Train Shape: (4040, 32)
Test Shape: (1631, 17)


### B1. Find any missing values

In [70]:
train.isnull().sum()

giver_username_if_known                                    0
number_of_downvotes_of_request_at_retrieval                0
number_of_upvotes_of_request_at_retrieval                  0
post_was_edited                                            0
request_id                                                 0
request_number_of_comments_at_retrieval                    0
request_text                                               0
request_text_edit_aware                                    0
request_title                                              0
requester_account_age_in_days_at_request                   0
requester_account_age_in_days_at_retrieval                 0
requester_days_since_first_post_on_raop_at_request         0
requester_days_since_first_post_on_raop_at_retrieval       0
requester_number_of_comments_at_request                    0
requester_number_of_comments_at_retrieval                  0
requester_number_of_comments_in_raop_at_request            0
requester_number_of_comm

No missing data except in column "requester_user_flair." We see in the next section that this isn't a column in the test data, so we may just elect to not use it to train the model.

### B2. Identify common columns between test and train

In [71]:
print("Common columns in train and test:")
print(train.columns[train.columns.isin(test.columns)])
print("----")
print("Columns in train but NOT test:")
print(train.columns[~train.columns.isin(test.columns)])

Common columns in train and test:
Index(['giver_username_if_known', 'request_id', 'request_text_edit_aware',
       'request_title', 'requester_account_age_in_days_at_request',
       'requester_days_since_first_post_on_raop_at_request',
       'requester_number_of_comments_at_request',
       'requester_number_of_comments_in_raop_at_request',
       'requester_number_of_posts_at_request',
       'requester_number_of_posts_on_raop_at_request',
       'requester_number_of_subreddits_at_request',
       'requester_subreddits_at_request',
       'requester_upvotes_minus_downvotes_at_request',
       'requester_upvotes_plus_downvotes_at_request', 'requester_username',
       'unix_timestamp_of_request', 'unix_timestamp_of_request_utc'],
      dtype='object')
----
Columns in train but NOT test:
Index(['number_of_downvotes_of_request_at_retrieval',
       'number_of_upvotes_of_request_at_retrieval', 'post_was_edited',
       'request_number_of_comments_at_retrieval', 'request_text',
       '

As can be seen above, there is a series of columns in the training data only. These columns reflect data about the post (e.g., the #of upvotes) at the time this Reddit data was retrieved. We use certain supervised and unsupversied techniques to derive value from this data (even though that information is not provided on the data set we will be predicting).

### C. Create training data, labels, and special 'in training only' data

In [72]:
train_labels_master = train[['requester_received_pizza']]
train_data_master = train[test.columns & train.columns]
train_only_data_master = train[train.columns[~train.columns.isin(test.columns)]].drop(['requester_received_pizza'], axis = 1)

In [73]:
print(train.shape, train_data_master.shape)

(4040, 32) (4040, 17)


### D. Set column types and profile

In [74]:
train_data_master = train_data_master.assign(
    unix_timestamp_of_request = pd.to_datetime(
        train_data_master.unix_timestamp_of_request, unit = "s"),
    unix_timestamp_of_request_utc = pd.to_datetime(
        train_data_master.unix_timestamp_of_request_utc, unit = "s"))

In [75]:
train_data_master.describe()
train_data_master.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4040 entries, 0 to 4039
Data columns (total 17 columns):
giver_username_if_known                               4040 non-null object
request_id                                            4040 non-null object
request_text_edit_aware                               4040 non-null object
request_title                                         4040 non-null object
requester_account_age_in_days_at_request              4040 non-null float64
requester_days_since_first_post_on_raop_at_request    4040 non-null float64
requester_number_of_comments_at_request               4040 non-null int64
requester_number_of_comments_in_raop_at_request       4040 non-null int64
requester_number_of_posts_at_request                  4040 non-null int64
requester_number_of_posts_on_raop_at_request          4040 non-null int64
requester_number_of_subreddits_at_request             4040 non-null int64
requester_subreddits_at_request                       4040 non-null obj

## V. Ngrams

### Isolate Text Column

In [79]:
# Isolate the text column for the training and dev dataframes
x_train, x_test, y_train, y_test = train_test_split(
   train_data_master,
   train_labels_master.values.ravel(), test_size=0.29, random_state=0)

# Isolate the text column for the training and dev dataframes
import re
from sklearn.grid_search import GridSearchCV

x_train_text = x_train.request_text_edit_aware
x_test_text = x_test.request_text_edit_aware

### Run text cleaning preprocessing on the training dataset to remove a lot of the junk.

# Define custom text preprocessor
def text_cleaner(s):
    # Establish a compiled regex that finds words shorter than 3 characters
    shortword = re.compile(r'\W*\b\w{1,3}\b')
    
    # Convert all text to lowercase
    text = s.lower()
    
    # Remove newlines and punctuation marks
    text = re.sub(r'\n', ' ', text)
    text = re.sub('[,?]',' ',text)
    text = re.sub('\. ',' ',text)
    text = re.sub(' \.',' ',text)
    text = re.sub('\.{2,}',' ',text)
    text = re.sub(r'/',' ',text)
    text = re.sub('-','',text)
    text = re.sub('"','',text)
    text = re.sub('[<>()]',' ',text)

    # Convert sequences of numbers to zero
    text = re.sub('\d+', '0', text)
    
    # Remove short words
    text = shortword.sub('', text)
    
    # Remove extra whitespace
    text = re.sub(' +',' ',text)

    return text

# Set up count vectorizer to use custom preprocessor
# Using bigrams in the vectorizer gains about a percentage point of accuracy, but it appears that using
# trigrams or larger n-grams doesn't provide any further gains. 
vectotron = CountVectorizer(preprocessor=text_cleaner, analyzer='word',ngram_range=(2,2)) 
x_train_vect = vectotron.fit_transform(x_train_text)
x_test_vect = vectotron.transform(x_test_text)
# print(vectotron.vocabulary_)


# Fit a Bernoulli Naive Bayes model using the vectorized text and use GridSearch to optimize params
model_TextNB = BernoulliNB()
alphas = {'alpha': [0.0, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0, 20.0, 50.0, 100.0]}
BernNB_clf = GridSearchCV(model_TextNB,param_grid=alphas)
BernNB_clf.fit(x_train_vect,y_train)
print('Optimized score for BernoulliNB (alpha=',BernNB_clf.best_params_['alpha'],'): ',BernNB_clf.best_score_,'\n',sep='')
alpha_optimal = BernNB_clf.best_params_['alpha']

# Predict and check accuracy
model_TextNB = BernoulliNB(alpha=alpha_optimal)
model_TextNB.fit(x_train_vect,y_train)
predict_NB = model_TextNB.predict(x_test_vect)
test_accNB = metrics.accuracy_score(y_test, predict_NB)
print(test_accNB)

score_NB = model_TextNB.score(x_test_vect, y_test)
print(score_NB)


  'setting alpha = %.1e' % _ALPHA_MIN)
  'setting alpha = %.1e' % _ALPHA_MIN)
  'setting alpha = %.1e' % _ALPHA_MIN)


Optimized score for BernoulliNB (alpha=10.0): 0.75139470013947

0.76023890785
0.76023890785


### Try TfidVectorizer

In [80]:
# Try a Tfid vectorizer instead of counts, still using bigrams
vectimus_prime = TfidfVectorizer(preprocessor=text_cleaner, analyzer='word',ngram_range=(2,2)) 
x_train_vect = vectimus_prime.fit_transform(x_train_text)
x_test_vect = vectimus_prime.transform(x_test_text)

# Try Logistic Regression instead of Naive Bayes
model_logR = LogisticRegression()
model_logR.fit(x_train_vect,y_train)
predict_logR = model_logR.predict(x_test_vect)
test_acclogR = metrics.accuracy_score(y_test,predict_logR)
print(test_acclogR)

score_logR = model_logR.score(x_test_vect, y_test)
print(score_logR)

# Accuracy score is exactly the same as the BernoulliNB model. That seems weird, right?

0.758532423208
0.758532423208


### Try reducing vocab

In [81]:
# Try reducing the vocabulary to eliminate meaningless words.

vectorsaurus = CountVectorizer(preprocessor=text_cleaner, analyzer='word', ngram_range=(2,2)) 
x_train_vect = vectorsaurus.fit_transform(x_train_text)
x_test_vect = vectorsaurus.transform(x_test_text)

# Determine weights with Logistic Regression and L1 regularization
model_logR2 = LogisticRegression(penalty='l1')
model_logR2_fit = model_logR2.fit(x_train_vect, y_train)

# Create list of vocabulary words and their associated weights, then filter out everything with weight of zero.
word_weights = dict(zip(vectorsaurus.vocabulary_.keys(),model_logR2_fit.coef_[0]))
word_weights = dict((k, v) for k, v in word_weights.items() if v != 0)

# Create new vocabulary without zero-weight features
# new_vocab = { key: vectorsaurus.vocabulary_[key] for key in word_weights.keys() }
new_vocab = list(word_weights.keys())

# Re-run the vectorization, and run the models again using the new data
vectorsaurus_rex = CountVectorizer(preprocessor=text_cleaner, vocabulary=new_vocab)
x_train_vect = vectorsaurus_rex.fit_transform(x_train_text)
x_test_vect = vectorsaurus_rex.transform(x_test_text)

model_logR3 = LogisticRegression()
model_logR3_fit = model_logR3.fit(x_train_vect, y_train)
predict_logR3 = model_logR3_fit.predict(x_test_vect)
test_acclogR3 = metrics.accuracy_score(y_test,predict_logR3)

print(test_acclogR3)

score_logR3 = model_logR3_fit.score(x_test_vect, y_test)
print(score_logR3)
# Ok, this is getting weird.


0.76023890785
0.76023890785
