# Random Acts of Pizza - Baseline Model & EDA
## Authors: Ben Arnoldy, Mary Boardman, Zach Merritt, and Kevin Gifford
#### Kaggle Competition Description:
In machine learning, it is often said there are no free lunches. How wrong we were.

This competition contains a dataset with 5671 textual requests for pizza from the Reddit community Random Acts of Pizza together with their outcome (successful/unsuccessful) and meta-data. Participants must create an algorithm capable of predicting which requests will garner a cheesy (but sincere!) act of kindness.

"I'll write a poem, sing a song, do a dance, play an instrument, whatever! I just want a pizza," says one hopeful poster. What about making an algorithm?

Kaggle is hosting this competition for the machine learning community to use for fun and practice. This data was collected and graciously shared by Althoff et al. (Buy them a pizza -- data collection is a thankless and tedious job!) We encourage participants to explore their accompanying paper and ask that you cite the following reference in any publications that result from your work:

Tim Althoff, Cristian Danescu-Niculescu-Mizil, Dan Jurafsky. How to Ask for a Favor: A Case Study on the Success of Altruistic Requests, Proceedings of ICWSM, 2014.
_______________________________________________________________________________________________

## Notebook Title: EDA & Model Baseline
#### Purpose: Load the 'Random Acts of Pizza' train and test data. Conduct an exploratory data analysis to gain an understanding of the data. Create a baseline Logisitic Regression model using non-text (numeric) fields only. 

## I. Load Data and Modules, Process Data

### A. Load Data and Modules

In [50]:
import json
import pandas as pd
import numpy as np
import matplotlib as mpl
from matplotlib.colors import ListedColormap
from matplotlib.colors import LogNorm
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

from subprocess import check_output
from wordcloud import WordCloud, STOPWORDS

#ML
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.mixture import GMM
from sklearn.mixture import GaussianMixture

from sklearn.decomposition import TruncatedSVD
from sklearn.random_projection import sparse_random_matrix

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

from sklearn.pipeline import make_pipeline

%matplotlib inline
plt.style.use('bmh')

In [3]:
#1. Train Data
with open('../data/train.json') as fin:
    trainjson = json.load(fin)
train = pd.io.json.json_normalize(trainjson)
#2. Test Data
with open('../data/test.json') as fin:
    testjson = json.load(fin)
test = pd.io.json.json_normalize(testjson)

print("Train Shape:", train.shape)
print("Test Shape:", test.shape)

Train Shape: (4040, 32)
Test Shape: (1631, 17)


### B1. Find any missing values

In [4]:
train.isnull().sum()

giver_username_if_known                                    0
number_of_downvotes_of_request_at_retrieval                0
number_of_upvotes_of_request_at_retrieval                  0
post_was_edited                                            0
request_id                                                 0
request_number_of_comments_at_retrieval                    0
request_text                                               0
request_text_edit_aware                                    0
request_title                                              0
requester_account_age_in_days_at_request                   0
requester_account_age_in_days_at_retrieval                 0
requester_days_since_first_post_on_raop_at_request         0
requester_days_since_first_post_on_raop_at_retrieval       0
requester_number_of_comments_at_request                    0
requester_number_of_comments_at_retrieval                  0
requester_number_of_comments_in_raop_at_request            0
requester_number_of_comm

No missing data except in column "requester_user_flair." We see in the next section that this isn't a column in the test data, so we may just elect to not use it to train the model.

### B2. Identify common columns between test and train

In [5]:
print("Common columns in train and test:")
print(train.columns[train.columns.isin(test.columns)])
print("----")
print("Columns in train but NOT test:")
print(train.columns[~train.columns.isin(test.columns)])

Common columns in train and test:
Index(['giver_username_if_known', 'request_id', 'request_text_edit_aware',
       'request_title', 'requester_account_age_in_days_at_request',
       'requester_days_since_first_post_on_raop_at_request',
       'requester_number_of_comments_at_request',
       'requester_number_of_comments_in_raop_at_request',
       'requester_number_of_posts_at_request',
       'requester_number_of_posts_on_raop_at_request',
       'requester_number_of_subreddits_at_request',
       'requester_subreddits_at_request',
       'requester_upvotes_minus_downvotes_at_request',
       'requester_upvotes_plus_downvotes_at_request', 'requester_username',
       'unix_timestamp_of_request', 'unix_timestamp_of_request_utc'],
      dtype='object')
----
Columns in train but NOT test:
Index(['number_of_downvotes_of_request_at_retrieval',
       'number_of_upvotes_of_request_at_retrieval', 'post_was_edited',
       'request_number_of_comments_at_retrieval', 'request_text',
       '

As can be seen above, there is a series of columns in the training data only. These columns reflect data about the post (e.g., the #of upvotes) at the time this Reddit data was retrieved. We use certain supervised and unsupversied techniques to derive value from this data (even though that information is not provided on the data set we will be predicting).

### C. Create training data, labels, and special 'in training only' data

In [6]:
train_labels_master = train[['requester_received_pizza']]
train_data_master = train[test.columns & train.columns]
train_only_data_master = train[train.columns[~train.columns.isin(test.columns)]].drop(['requester_received_pizza'], axis = 1)

### D. Set column types and profile

In [7]:
train_data_master = train_data_master.assign(
    unix_timestamp_of_request = pd.to_datetime(
        train_data_master.unix_timestamp_of_request, unit = "s"),
    unix_timestamp_of_request_utc = pd.to_datetime(
        train_data_master.unix_timestamp_of_request_utc, unit = "s"))

In [8]:
train_data_master.describe()
train_data_master.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4040 entries, 0 to 4039
Data columns (total 17 columns):
giver_username_if_known                               4040 non-null object
request_id                                            4040 non-null object
request_text_edit_aware                               4040 non-null object
request_title                                         4040 non-null object
requester_account_age_in_days_at_request              4040 non-null float64
requester_days_since_first_post_on_raop_at_request    4040 non-null float64
requester_number_of_comments_at_request               4040 non-null int64
requester_number_of_comments_in_raop_at_request       4040 non-null int64
requester_number_of_posts_at_request                  4040 non-null int64
requester_number_of_posts_on_raop_at_request          4040 non-null int64
requester_number_of_subreddits_at_request             4040 non-null int64
requester_subreddits_at_request                       4040 non-null obj

## V. Dimensionality Reduction

In [53]:
# Use a CountVectorizer to vectorize the data, since it is text-based
vectorizer = CountVectorizer()
vtrain = vectorizer.fit_transform(train)
vtest = vectorizer.fit_transform(test)

# Since PCA does not work for a sparse matrix, we are using Truncated SVD instead for dimensionality reduction
for i in range (1, 32):
    svd = TruncatedSVD(n_components = i)
    svd.fit(vtrain)
    print(svd.explained_variance_ratio_)

[ 0.03224117]
[ 0.03224219  0.0304133 ]
[ 0.02992198  0.02807986  0.0305983 ]
[ 0.03197376  0.03217411  0.03178523  0.03225802]
[ 0.02752933  0.0317864   0.03224705  0.03097403  0.03101596]
[ 0.03075156  0.03211839  0.02910555  0.03221067  0.0269372   0.02987501]
[ 0.03209292  0.02900304  0.03221527  0.03225631  0.03177776  0.03186617
  0.02776758]
[ 0.03145666  0.02913705  0.03225681  0.03031721  0.03130617  0.03223785
  0.03093214  0.03198467]
[ 0.03222603  0.03003651  0.03131154  0.03225538  0.03224564  0.03195873
  0.02689188  0.03225771  0.03194264]
[ 0.03186377  0.03219792  0.02789387  0.03130725  0.02995907  0.03222589
  0.03225795  0.03220399  0.0310018   0.03223042]
[ 0.03225764  0.0298621   0.03223184  0.03214671  0.03198844  0.03196338
  0.03189147  0.03205516  0.03146474  0.03101119  0.03209509]
[ 0.02891589  0.03180612  0.03186347  0.03225229  0.0320773   0.03219261
  0.03157722  0.03208945  0.03208882  0.0318493   0.03104503  0.03201184]
[ 0.03124322  0.03215426  0.031341

### A. Creating Dimension-Reduced Data Sets with Number of Components 2-6

In [34]:
# Instead of a for-loop, we created separate data sets, using separate variables for each. 

svd_2 = TruncatedSVD(n_components = 2)
vtrain_2 = svd_2.fit(vtrain)
print(svd_2.explained_variance_ratio_)

svd_3 = TruncatedSVD(n_components = 3)
vtrain_3 = svd_3.fit(vtrain)
print(svd_3.explained_variance_ratio_)

svd_4 = TruncatedSVD(n_components = 4)
vtrain_4 = svd_4.fit(vtrain)
print(svd_4.explained_variance_ratio_)

svd_5 = TruncatedSVD(n_components = 5)
vtrain_5 = svd_5.fit(vtrain)
print(svd_5.explained_variance_ratio_)

svd_6 = TruncatedSVD(n_components = 6)
vtrain_6 = svd_6.fit(vtrain)
print(svd_6.explained_variance_ratio_)

[ 0.0322521   0.03203267]
[ 0.03038149  0.03224394  0.03212615]
[ 0.03216073  0.03181132  0.0322574   0.03195261]
[ 0.02992446  0.0320203   0.03225629  0.03210184  0.03220781]
[ 0.02920177  0.02920185  0.03142868  0.03000636  0.03091608  0.03206158]


### B. K-Means

In [99]:
raw_score = []
for i in range(1,21):
    kmeans_raw = KMeans(n_clusters=i, init='k-means++', max_iter=100, n_init=1, verbose=False)
    kmeans_raw.fit(vtrain)
    y_hat_kmeans_raw = kmeans_raw.predict(vtrain)
    centers = kmeans_raw.cluster_centers_
#     print('The KMeans score for the raw data, using',i,'clusters is:',kmeans_raw.score(vtrain[y_hat_kmeans_raw]))
    raw_score.append(kmeans_raw.score(vtrain[y_hat_kmeans_raw]))
print(raw_score)

svd_2 = TruncatedSVD(n_components = 2)
normalizer = Normalizer(copy=False)
lsa_2 = make_pipeline(svd_2, normalizer)
vtrain_lsa2 = lsa_2.fit_transform(vtrain)

score_2 = []
for i in range(1,21):
    kmeans2 = KMeans(n_clusters=i, init='k-means++', max_iter=100, n_init=1, verbose=False)
    kmeans2.fit(vtrain_lsa2)
    y_hat_kmeans2 = kmeans2.predict(vtrain_lsa2)
    centers = kmeans2.cluster_centers_
#     print('The KMeans score for the truncated, 2 component data, using',i,'clusters is:',
#           kmeans2.score(vtrain_lsa2[y_hat_kmeans]))
    score_2.append(kmeans2.score(vtrain_lsa2[y_hat_kmeans2]))
print(score_2)

svd_3 = TruncatedSVD(n_components = 3)
normalizer = Normalizer(copy=False)
lsa_3 = make_pipeline(svd_3, normalizer)
vtrain_lsa3 = lsa_3.fit_transform(vtrain)

score_3 = []
for i in range(1,21):
    kmeans3 = KMeans(n_clusters=i, init='k-means++', max_iter=100, n_init=1, verbose=False)
    kmeans3.fit(vtrain_lsa3)
    y_hat_kmeans3 = kmeans3.predict(vtrain_lsa3)
    centers = kmeans3.cluster_centers_
#     print('The KMeans score for the truncated, 3 component data, using',i,'clusters is:',
#           kmeans3.score(vtrain_lsa3[y_hat_kmeans3]))
    score_3.append(kmeans3.score(vtrain_lsa3[y_hat_kmeans3]))
print(score_3)

svd_4 = TruncatedSVD(n_components = 4)
normalizer = Normalizer(copy=False)
lsa_4 = make_pipeline(svd_4, normalizer)
vtrain_lsa4 = lsa_4.fit_transform(vtrain)

score_4 = []
for i in range(1,21):
    kmeans4 = KMeans(n_clusters=i, init='k-means++', max_iter=100, n_init=1, verbose=False)
    kmeans4.fit(vtrain_lsa4)
    y_hat_kmeans4 = kmeans4.predict(vtrain_lsa4)
    centers = kmeans4.cluster_centers_
#     print('The KMeans score for the truncated, 4 component data, using',i,'clusters is:',
#           kmeans4.score(vtrain_lsa4[y_hat_kmeans4]))
    score_4.append(kmeans4.score(vtrain_lsa4[y_hat_kmeans4]))
print(score_4)

svd_5 = TruncatedSVD(n_components = 5)
normalizer = Normalizer(copy=False)
lsa_5 = make_pipeline(svd_5, normalizer)
vtrain_lsa5 = lsa_5.fit_transform(vtrain)

score_5 = []
for i in range(1,21):
    kmeans5 = KMeans(n_clusters=i, init='k-means++', max_iter=100, n_init=1, verbose=False)
    kmeans5.fit(vtrain_lsa5)
    y_hat_kmeans5 = kmeans5.predict(vtrain_lsa5)
    centers = kmeans5.cluster_centers_
#     print('The KMeans score for the truncated, 5 component data, using',i,'clusters is:',
#           kmeans5.score(vtrain_lsa5[y_hat_kmeans5]))
    score_5.append(kmeans5.score(vtrain_lsa5[y_hat_kmeans5]))
print(score_5)

svd_6 = TruncatedSVD(n_components = 6)
normalizer = Normalizer(copy=False)
lsa_6 = make_pipeline(svd_6, normalizer)
vtrain_lsa6 = lsa_6.fit_transform(vtrain)

score_6 = []
for i in range(1,21):
    kmeans6 = KMeans(n_clusters=i, init='k-means++', max_iter=100, n_init=1, verbose=False)
    kmeans6.fit(vtrain_lsa6)
    y_hat_kmeans6 = kmeans6.predict(vtrain_lsa6)
    centers = kmeans6.cluster_centers_
#     print('The KMeans score for the truncated, 6 component data, using',i,'clusters is:',
#           kmeans5.score(vtrain_lsa6[y_hat_kmeans6]))
    score_6.append(kmeans6.score(vtrain_lsa6[y_hat_kmeans6]))
print(score_6)

# # The plot below wasn't doing much for me
# fig = plt.figure(0,figsize=(32,32))
# ax = fig.add_subplot(4,4,2)

# # Draw the plots
# ax.scatter(vtrain_lsa2[:, 0], vtrain_lsa2[:, 1], s=1, c=y_hat_kmeans)
# ax.scatter(centers[:, 0], centers[:, 1], c='red', s=50, alpha=0.6)
# ax.grid()

[-31.0, -30.967741935483893, -30.93333333333331, -30.896551724137915, -30.85714285714287, -29.85185185185184, -28.846153846153825, -4.8, -6.708333333333333, -27.7391304347826, -6.681818181818183, -28.571428571428573, -7.6000000000000005, -7.578947368421053, -10.38888888888889, -6.588235294117648, -23.4375, -18.666666666666668, -18.57142857142857, -8.307692307692308]
[-31.381506843869321, -14.997176071877483, -9.3432309570669858, -4.26587106570455, -3.160970212564111, -2.2801263787698316, -1.1013204986730025, -0.78767549045146401, -0.67964914781019481, -0.83807499302375899, -0.43500721155880928, -0.40902992260446691, -0.20883457038090492, -0.22936579346328256, -0.33493696726466982, -0.10936210689692027, -0.097280104990433269, -0.052811559813683573, -0.040251136497941653, -0.036245767758011582]
[-36.545820849825247, -18.391363441918415, -10.892638928755609, -15.076596184369798, -16.307152193077577, -10.498511174481742, -9.0304917349591989, -5.1594608451421138, -6.1096643494572636, -3.309

### C. Gaussian Mixture Model

In [114]:
# The model using the raw data fit the train data, but wouldn't predict the test data, as it wasn't the right shape. 
# I think this goes back to the different dimensionality between the two data sets. 
# vtraind = vtrain.toarray()
# vtestd = vtest.toarray()
# gmm = GaussianMixture()
# gmm.fit(vtraind)
# gmm.predict(vtestd)

svd_2 = TruncatedSVD(n_components = 2)
normalizer = Normalizer(copy=False)
lsa_2 = make_pipeline(svd_2, normalizer)
vtrain_lsa2 = lsa_2.fit_transform(vtrain)

svd_2_test = TruncatedSVD(n_components = 2)
normalizer = Normalizer(copy=False)
lsa_2_test = make_pipeline(svd_2_test, normalizer)
vtest_lsa2 = lsa_2_test.fit_transform(vtest)

gmm2 = GaussianMixture()
gmm2.fit(vtrain_lsa2)
gmm2.score_samples(vtest_lsa2)

array([-2.13235994, -2.3239353 , -1.9670322 , -2.13366984, -2.16196193,
       -2.08809671, -2.09155992, -2.32646581, -2.10368873, -2.08441638,
       -1.94689032, -2.29990774, -2.05141319, -2.3432188 , -2.34865185,
       -2.21702992, -2.19896781])

In [None]:
# Separate out the train data with positive labels, fit the GMM model
p2pos = p2_train[train_labels == 1]
pos = GaussianMixture(n_components=4, covariance_type='full')
pos.fit(p2pos)

# Separate out the train data with negative labels, fit the GMM model    
p2neg = p2_train[train_labels != 1]    
neg = GaussianMixture(n_components=4, covariance_type='full')
neg.fit(p2neg)

# Fit the test data to both models
wlprob_pos = pos.score_samples(p2_test)
wlprob_neg = neg.score_samples(p2_test)

# Create an array that picks the winner from each prediction 
pred_test_labels = np.array(wlprob_pos > wlprob_neg).astype(int)

# Calculate the accuracy to 6 decimal places, print it out
accuracy = np.round(np.mean(pred_test_labels == test_labels), 6)   
print('The accuracy of this model is:', accuracy)