### Random Acts of Pizza - Baseline Model & EDA
## Authors: Ben Arnoldy, Mary Boardman, Zach Merritt, and Kevin Gifford
#### Kaggle Competition Description:
In machine learning, it is often said there are no free lunches. How wrong we were.

This competition contains a dataset with 5671 textual requests for pizza from the Reddit community Random Acts of Pizza together with their outcome (successful/unsuccessful) and meta-data. Participants must create an algorithm capable of predicting which requests will garner a cheesy (but sincere!) act of kindness.

"I'll write a poem, sing a song, do a dance, play an instrument, whatever! I just want a pizza," says one hopeful poster. What about making an algorithm?

Kaggle is hosting this competition for the machine learning community to use for fun and practice. This data was collected and graciously shared by Althoff et al. (Buy them a pizza -- data collection is a thankless and tedious job!) We encourage participants to explore their accompanying paper and ask that you cite the following reference in any publications that result from your work:

Tim Althoff, Cristian Danescu-Niculescu-Mizil, Dan Jurafsky. How to Ask for a Favor: A Case Study on the Success of Altruistic Requests, Proceedings of ICWSM, 2014.
_______________________________________________________________________________________________

## Notebook Title: EDA & Model Baseline
#### Purpose: Load the 'Random Acts of Pizza' train and test data. Conduct an exploratory data analysis to gain an understanding of the data. Create a baseline Logisitic Regression model using non-text (numeric) fields only. 

## I. Load Data and Modules, Process Data

### A. Load Data and Modules

In [3]:
import json
import pandas as pd
import numpy as np
import matplotlib as mpl
from matplotlib.colors import ListedColormap
from matplotlib.colors import LogNorm
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

from subprocess import check_output
#from wordcloud import WordCloud, STOPWORDS

#ML
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.mixture import GMM
from sklearn.mixture import GaussianMixture

from sklearn.decomposition import TruncatedSVD
from sklearn.random_projection import sparse_random_matrix

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

from sklearn.pipeline import make_pipeline

%matplotlib inline
plt.style.use('bmh')

In [4]:
#1. Train Data
with open('../data/train.json') as fin:
    trainjson = json.load(fin)
train = pd.io.json.json_normalize(trainjson)
#2. Test Data
with open('../data/test.json') as fin:
    testjson = json.load(fin)
test = pd.io.json.json_normalize(testjson)

print("Train Shape:", train.shape)
print("Test Shape:", test.shape)

Train Shape: (4040, 32)
Test Shape: (1631, 17)


### B1. Find any missing values

In [5]:
train.isnull().sum()

giver_username_if_known                                    0
number_of_downvotes_of_request_at_retrieval                0
number_of_upvotes_of_request_at_retrieval                  0
post_was_edited                                            0
request_id                                                 0
request_number_of_comments_at_retrieval                    0
request_text                                               0
request_text_edit_aware                                    0
request_title                                              0
requester_account_age_in_days_at_request                   0
requester_account_age_in_days_at_retrieval                 0
requester_days_since_first_post_on_raop_at_request         0
requester_days_since_first_post_on_raop_at_retrieval       0
requester_number_of_comments_at_request                    0
requester_number_of_comments_at_retrieval                  0
requester_number_of_comments_in_raop_at_request            0
requester_number_of_comm

No missing data except in column "requester_user_flair." We see in the next section that this isn't a column in the test data, so we may just elect to not use it to train the model.

### B2. Identify common columns between test and train

In [6]:
print("Common columns in train and test:")
print(train.columns[train.columns.isin(test.columns)])
print("----")
print("Columns in train but NOT test:")
print(train.columns[~train.columns.isin(test.columns)])

Common columns in train and test:
Index(['giver_username_if_known', 'request_id', 'request_text_edit_aware',
       'request_title', 'requester_account_age_in_days_at_request',
       'requester_days_since_first_post_on_raop_at_request',
       'requester_number_of_comments_at_request',
       'requester_number_of_comments_in_raop_at_request',
       'requester_number_of_posts_at_request',
       'requester_number_of_posts_on_raop_at_request',
       'requester_number_of_subreddits_at_request',
       'requester_subreddits_at_request',
       'requester_upvotes_minus_downvotes_at_request',
       'requester_upvotes_plus_downvotes_at_request', 'requester_username',
       'unix_timestamp_of_request', 'unix_timestamp_of_request_utc'],
      dtype='object')
----
Columns in train but NOT test:
Index(['number_of_downvotes_of_request_at_retrieval',
       'number_of_upvotes_of_request_at_retrieval', 'post_was_edited',
       'request_number_of_comments_at_retrieval', 'request_text',
       '

As can be seen above, there is a series of columns in the training data only. These columns reflect data about the post (e.g., the #of upvotes) at the time this Reddit data was retrieved. We use certain supervised and unsupversied techniques to derive value from this data (even though that information is not provided on the data set we will be predicting).

### C. Create training data, labels, and special 'in training only' data

In [7]:
train_labels_master = train[['requester_received_pizza']]
train_data_master = train[test.columns & train.columns]
train_only_data_master = train[train.columns[~train.columns.isin(test.columns)]].drop(['requester_received_pizza'], axis = 1)

In [8]:
print(train.shape, train_data_master.shape)

(4040, 32) (4040, 17)


### D. Set column types and profile

In [9]:
train_data_master = train_data_master.assign(
    unix_timestamp_of_request = pd.to_datetime(
        train_data_master.unix_timestamp_of_request, unit = "s"),
    unix_timestamp_of_request_utc = pd.to_datetime(
        train_data_master.unix_timestamp_of_request_utc, unit = "s"))

In [10]:
train_data_master.describe()
train_data_master.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4040 entries, 0 to 4039
Data columns (total 17 columns):
giver_username_if_known                               4040 non-null object
request_id                                            4040 non-null object
request_text_edit_aware                               4040 non-null object
request_title                                         4040 non-null object
requester_account_age_in_days_at_request              4040 non-null float64
requester_days_since_first_post_on_raop_at_request    4040 non-null float64
requester_number_of_comments_at_request               4040 non-null int64
requester_number_of_comments_in_raop_at_request       4040 non-null int64
requester_number_of_posts_at_request                  4040 non-null int64
requester_number_of_posts_on_raop_at_request          4040 non-null int64
requester_number_of_subreddits_at_request             4040 non-null int64
requester_subreddits_at_request                       4040 non-null obj

## V. Dimensionality Reduction

### CountVectorizer

In [11]:
# I believe we cannot just vectorize the train data because the text-based data
# is held within one specific feature called request_test_edit_aware.
# Below I repurposed some of Zach's code to vectorize

#Create Sparse matrix of words
count_vect = CountVectorizer()

#Split train and test data
x_train_text, x_test_text, y_train, y_test = train_test_split(
    train_data_master['request_text_edit_aware'], 
    train_labels_master.values.ravel(), test_size=0.29, random_state=0)
# x_train, x_test, y_train, y_test = train_test_split(
#    train_data_master,
#    train_labels_master.values.ravel(), test_size=0.29, random_state=0)
x_train = count_vect.fit_transform(x_train_text)
x_test = count_vect.transform(x_test_text)
cv_feature_names = count_vect.get_feature_names()

### Apply SVD (in lieu of PCA)

In [12]:
# Since PCA does not work for a sparse matrix, we are using Truncated SVD instead for dimensionality reduction
for top_comps in range (1, 15):
    svd = TruncatedSVD(n_components = top_comps)
    svd.fit(x_train)
    print(svd.explained_variance_ratio_)

[ 0.25966847]
[ 0.25966847  0.02954686]
[ 0.25966847  0.02954676  0.02266197]
[ 0.25966847  0.02954691  0.02266269  0.02118754]
[ 0.25966847  0.02954692  0.02266293  0.02118774  0.01776063]
[ 0.25966847  0.0295469   0.02266255  0.02118853  0.01776144  0.01537296]
[ 0.25966847  0.02954693  0.02266229  0.02118824  0.01776452  0.01537306
  0.01309128]
[ 0.25966847  0.02954693  0.02266291  0.02118755  0.01776222  0.01537325
  0.01309176  0.01216625]
[ 0.25966847  0.02954692  0.02266274  0.02118772  0.01776315  0.01537217
  0.0130813   0.01218914  0.0110454 ]
[ 0.25966847  0.02954693  0.0226628   0.02118786  0.01776331  0.01537374
  0.01309167  0.01218854  0.01104436  0.01060167]
[ 0.25966847  0.02954693  0.02266268  0.02118768  0.01776264  0.01537374
  0.01309247  0.0121846   0.01104337  0.01060253  0.0101695 ]
[ 0.25966847  0.02954693  0.02266252  0.0211875   0.01776295  0.01537366
  0.01309259  0.0121883   0.01103768  0.0106001   0.01016479  0.00949482]
[ 0.25966847  0.02954692  0.022662

In [None]:
# The output of vectorizer shows a major dropoff in explained variance after the 1st component
# This suggests that projecting down to 1 component might be best.

### Try various hyperparameters with GMM on text field, try to classify 

In [14]:
def train_gmm(svd_comps = 2, gmm_comps = 4, cov_type = 'full', sparse = True):
    
    # project data to few dimensions using SVD if sparse, PCA if dense
    if (sparse): svd_test = TruncatedSVD(n_components=svd_comps)
    else: svd_test = PCA(n_components=svd_comps)
    svd_test_ft = svd_test.fit_transform(x_test)
    if (sparse): svd_train = TruncatedSVD(n_components=svd_comps)
    else: svd_train = PCA(n_components=svd_comps)
    svd_train_ft = svd_train.fit_transform(x_train)

    # create two sets of svd data, one that's positively labeled, one that's negatively labeled
    pos_svd = svd_train_ft[y_train == 1]
    neg_svd = svd_train_ft[y_train == 0]
    
    # fit a GMM for pos and neg datasets
    gmm_pos = GaussianMixture(n_components=gmm_comps, covariance_type=cov_type) 
    gmm_fit_pos = gmm_pos.fit(pos_svd)
    gmm_neg = GaussianMixture(n_components=gmm_comps, covariance_type=cov_type) 
    gmm_fit_neg = gmm_neg.fit(neg_svd)

    prediction = np.ndarray(shape=y_test.shape)
    
    for sample in range(svd_test_ft.shape[0]):
        pos_score = gmm_fit_pos.score(svd_test_ft[sample].reshape(1,-1))
        neg_score = gmm_fit_neg.score(svd_test_ft[sample].reshape(1,-1))
        # make pick
        if (pos_score >= neg_score): prediction[sample] = 1
        else: prediction[sample] = 0 

    # calculate accuracy
    accuracy = metrics.accuracy_score(y_test, prediction)
    # f1 = metrics.f1_score(y_test, prediction)
    print("Accuracy score for", svd_comps,"SVD comps,",gmm_comps,"GMM comps,","Covariance type",cov_type,"=",accuracy)

    return(accuracy)

def gmm_trials(sparse = True, max_params = 50):

    def calc_limits(max_params = 50):
        # this function helps estaish the svd and gmm settings to stay within max_params

        valid_configs = [] # list of tuples (num_svd_components, num_gmm_components)

        for cov_type in ['spherical', 'diag', 'tied', 'full']:
            for svd_comp in range(1,20): 
                for gmm_comp in range(1,20): 
                    if ((svd_comp + svd_comp) * gmm_comp) * 2 <= max_params:
                        valid_configs.append((svd_comp, gmm_comp, cov_type))

        return(valid_configs)

    configs = calc_limits(max_params) # get the valid configurations

    top_accuracy_val = 0 # keeps track of top accuracy value
    top_accuracy_config = () # keeps track of config that creates top accuracy value
    
    for config in configs:
        accuracy = train_gmm(config[0], config[1], config[2])
        if accuracy > top_accuracy_val:
            top_accuracy_val = accuracy
            top_accuracy_config = config

    print("*********************************************************************************")
    print("The best accuracy score is", top_accuracy_val)
    print("To get it, set SVD comps to", top_accuracy_config[0], ", GMM comps to", top_accuracy_config[1],", and covariance type to", top_accuracy_config[2])

    
#Create Sparse matrix of words
count_vect = CountVectorizer()

#Split train and test data
x_train_text, x_test_text, y_train, y_test = train_test_split(
    train_data_master['request_text_edit_aware'], 
    train_labels_master.values.ravel(), test_size=0.29, random_state=0)
x_train = count_vect.fit_transform(x_train_text)
x_test = count_vect.transform(x_test_text)

gmm_trials(sparse = True)

Accuracy score for 1 SVD comps, 1 GMM comps, Covariance type spherical = 0.652730375427
Accuracy score for 1 SVD comps, 2 GMM comps, Covariance type spherical = 0.581911262799
Accuracy score for 1 SVD comps, 3 GMM comps, Covariance type spherical = 0.570819112628
Accuracy score for 1 SVD comps, 4 GMM comps, Covariance type spherical = 0.523890784983
Accuracy score for 1 SVD comps, 5 GMM comps, Covariance type spherical = 0.584470989761
Accuracy score for 1 SVD comps, 6 GMM comps, Covariance type spherical = 0.575938566553
Accuracy score for 1 SVD comps, 7 GMM comps, Covariance type spherical = 0.587030716724
Accuracy score for 1 SVD comps, 8 GMM comps, Covariance type spherical = 0.555460750853
Accuracy score for 1 SVD comps, 9 GMM comps, Covariance type spherical = 0.574232081911
Accuracy score for 1 SVD comps, 10 GMM comps, Covariance type spherical = 0.579351535836
Accuracy score for 1 SVD comps, 11 GMM comps, Covariance type spherical = 0.578498293515
Accuracy score for 1 SVD comps

KeyboardInterrupt: 

Results: The best accuracy score doesn't beat the 0.68 naive bayes baseline, so dimensionality reduction on the vectorized text field doesn't help. 

### Next: Dimensionality reduction on all features

In [None]:
#Try using ALL features, even those not in the test data. 

#Normalize all fields (numeric)
min_max_scaler = preprocessing.MinMaxScaler()
mn_mx_scaler_allfeats = min_max_scaler.fit_transform(
    train.select_dtypes(include = ['float64', 'int64','datetime64[ns]']).apply(pd.to_numeric).values)
mn_mx_scaler_commonfeats = min_max_scaler.fit_transform(
    train_data_master.select_dtypes(include = ['float64', 'int64','datetime64[ns]']).apply(pd.to_numeric).values)

#Split train and test data
x_train, x_test = mn_mx_scaler_allfeats[:3232], mn_mx_scaler_commonfeats[3232:] 
y_train, y_test = train_labels_master[:3232].values.ravel(), train_labels_master[3232:].values.ravel()



In [None]:
gmm_trials(sparse = False, max_params = 50)

Results: A 0.74 accuracy level is less than what was achieved with logistic regression baseline of 0.75. 

In [None]:
#Try using just the features in common with train and test. 

#Normalize all fields (numeric)
min_max_scaler = preprocessing.MinMaxScaler()
mn_mx_scaler = min_max_scaler.fit_transform(
    train.select_dtypes(include = ['float64', 'int64','datetime64[ns]']).apply(pd.to_numeric).values)

#Split train and test data
x_train, x_test = mn_mx_scaler[:3232], mn_mx_scaler[3232:] 
y_train, y_test = train_labels_master[:3232].values.ravel(), train_labels_master[3232:].values.ravel()


In [None]:
gmm_trials(sparse = False, max_params = 50) 

Result: a 0.73 accuracy score is less than the logistic regression baseline. 

### Reduce vectorized word field, plug it into logistic regression with other features

In [33]:
#Create Sparse matrix of words
count_vect = CountVectorizer()
x_train = train_data_master[:3232]['request_text_edit_aware']
x_test = train_data_master[3232:]['request_text_edit_aware'] 
train_features = count_vect.fit_transform(x_train)
test_features = count_vect.transform(x_test)

#Reduce the vectorized word feature to small number of numerical components
svd_comps = 110 # 110 after some trial and error, this was the best
svd = TruncatedSVD(n_components=svd_comps)
svd_ft_tr = svd.fit_transform(train_features)
svd_ft_dv = svd.fit_transform(test_features)
svd_ft = np.concatenate((svd_ft_tr, svd_ft_dv), axis=0)

# add the newly reduced text field feature to the train and test data
train_data_SVD = train_data_master.copy(deep = True)
for comp in range(svd_comps):
    train_data_SVD['SVD'+str(comp)] = svd_ft[:,comp]

#Normalize all fields (numeric)
min_max_scaler = preprocessing.MinMaxScaler()
mn_mx_scaler = min_max_scaler.fit_transform(train_data_SVD.select_dtypes(include = ['float64', 'int64','datetime64[ns]']).apply(pd.to_numeric).values)

#Split train and test data
x_train, x_test = mn_mx_scaler[:3232], mn_mx_scaler[3232:] 
y_train, y_test = train_labels_master[:3232].values.ravel(), train_labels_master[3232:].values.ravel()


In [None]:
# do logistic regression on our new train data

def model_report(title, y_test, predictions):

    """
    Output: Classification report, confusion matrix, and ROC curve
    """
    print(title)
    print("---------")
    print(classification_report(y_test, predictions))

    cm = metrics.confusion_matrix(y_test, predictions)
    plt.figure(figsize=(3,3))
    sns.heatmap(cm, annot=True, fmt=".0f", linewidths=.5, square = True, cmap = 'Blues_r');
    plt.ylabel('Actual label');
    plt.xlabel('Predicted label');
    all_sample_title = 'Accuracy: {0}'.format(round(metrics.accuracy_score(y_test, predictions),2))
    plt.title(all_sample_title, size = 15)
    plt.show()
    
    fpr, tpr, threshold = metrics.roc_curve(y_test, predictions)
    roc_auc = metrics.auc(fpr, tpr)

    plt.title('Receiver Operating Characteristic')
    plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()
    
#Train Model
# all parameters not specified are set to their defaults
logisticRegr = LogisticRegression()
logisticRegr.fit(x_train, y_train)
predictions = logisticRegr.predict(x_test)

#Output model report
model_report("Logistic Regression (using only common numeric fields)",y_test, predictions)

In [34]:
#Create Sparse matrix of words
count_vect = TfidfVectorizer()
x_train = train_data_master[:3232]['request_text_edit_aware']
x_test = train_data_master[3232:]['request_text_edit_aware'] 
train_features = count_vect.fit_transform(x_train)
test_features = count_vect.transform(x_test)

#Reduce the vectorized word feature to small number of numerical components
svd_comps = 25 # 25 after some trial and error, this was the best
svd = TruncatedSVD(n_components=svd_comps)
svd_ft_tr = svd.fit_transform(train_features)
svd_ft_dv = svd.fit_transform(dev_features)
svd_ft = np.concatenate((svd_ft_tr, svd_ft_dv), axis=0)

# add the newly reduced text field feature to the train and test data
train_data_SVD = train_data_master.copy(deep = True)
for comp in range(svd_comps):
    train_data_SVD['SVD'+str(comp)] = svd_ft[:,comp]

#Normalize all fields (numeric)
min_max_scaler = preprocessing.MinMaxScaler()
mn_mx_scaler = min_max_scaler.fit_transform(train_data_SVD.select_dtypes(include = ['float64', 'int64','datetime64[ns]']).apply(pd.to_numeric).values)

#Split train and test data
x_train, x_test = mn_mx_scaler[:3232], mn_mx_scaler[3232:] 
y_train, y_test = train_labels_master[:3232].values.ravel(), train_labels_master[3232:].values.ravel()


NameError: name 'dev_features' is not defined

In [35]:
# do logistic regression on our new train data

def model_report(title, y_test, predictions):

    """
    Output: Classification report, confusion matrix, and ROC curve
    """
    print(title)
    print("---------")
    print(classification_report(y_test, predictions))

    cm = metrics.confusion_matrix(y_test, predictions)
    plt.figure(figsize=(3,3))
    sns.heatmap(cm, annot=True, fmt=".0f", linewidths=.5, square = True, cmap = 'Blues_r');
    plt.ylabel('Actual label');
    plt.xlabel('Predicted label');
    all_sample_title = 'Accuracy: {0}'.format(round(metrics.accuracy_score(y_test, predictions),2))
    plt.title(all_sample_title, size = 15)
    plt.show()
    
    fpr, tpr, threshold = metrics.roc_curve(y_test, predictions)
    roc_auc = metrics.auc(fpr, tpr)

    plt.title('Receiver Operating Characteristic')
    plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()
    
#Train Model
# all parameters not specified are set to their defaults
logisticRegr = LogisticRegression(C=.01)
logisticRegr.fit(x_train, y_train)
predictions = logisticRegr.predict(x_test)

#Output model report
model_report("Logistic Regression (using only common numeric fields)",y_test, predictions)

ValueError: could not convert string to float: "We're completely broke and craving pizza. I thought I was super smart and ordered Domino's gift cards on Walmart.com with my Paypal - they aren't going to be e-delivered until tomorrow! Payday is Friday, and we don't have any money for groceries until then... but I will reciprocate! I will also send a homemade candle if someone wants... if you're kind enough to feed hubs and I, let me know if you'd like one, and I'll let you know the scents that we have available. \n\nThank you!"

### Results
* Reducing the vectorized word feature got us to 0.68 accuracy - the exact same accuracy that Naive Bayes got us. 
* Reducing all the features (even those not in the test set) got us to 0.74 accuracy - a bit less than the 0.75 accuracy from logistic regression on the features. 
* Reducing just the features that are shared by train and test sets got us to 0.73 accuracy - even further from the 0.75 accuracy of logistic regression.
* Vectorizing the text field, reducing it through SVD to 110 components, adding it to the train data, and running logistic regression again did improve accuracy by small margins.
* Tfidf vectorizing the text field before doing the above did not improve the accuracy.