## Predicting Spooky Authors: Part 1
### Author: Blake Conrad

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

df_train = pd.read_csv("../input/train.csv")
df_test = pd.read_csv("../input/test.csv")

## Bag of words
Text preprocessing, tokenizing and filtering of stopwords are included in a high level component that is able to build a dictionary of features and transform documents to feature vectors:

Occurrence count is a good start but there is an issue: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.
To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called tf for Term Frequencies.
Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.
This downscaling is called tf–idf for “Term Frequency times Inverse Document Frequency”.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# Feed: 
#  1. A list of sentences
#  2. A pandas dataframe that represents a list of sentences
# E.g., ["This is the first sentence, yes.",
#        "Now youre getting the idea, aren't you?",
#        ...]
def get_bag_of_words(X):
    
    count_vect = CountVectorizer()
    X_counts = count_vect.fit_transform(X)
    #X_counts.shape

    tf_transformer = TfidfTransformer(use_idf=False).fit(X_counts)
    X_tf = tf_transformer.transform(X_counts)
    #X_tf.shape

    tfidf_transformer = TfidfTransformer()
    X_tfidf = tfidf_transformer.fit_transform(X_counts)
    #X_tfidf.shape
    #X_tfidf.data
    
    print("Bag of words created!")
    return X_tfidf

## Split training sets
It is important to remember when training and testing that upon the split in the data sets prior to building our term frequencies and term frequency inverse document frequencies, the vocabulary is subject to change with the data fed into it. To avoid any dimensionality mismatching, we just build the universal vocabulary with our training and testing set, but only train the model on the observations in the data related to the training data points. Likewise when testing, we only use the testing points. This is loosely managed by the `pd.concat([df1,df2,...,dfn])` function which is the same as a `rbind(df1,df2)` in R, By putting X_train in front of the bag of words for the training term frequency inverse document frequencies, we know to only seek from `0:NumberOfTrainingRows == 0:len(X_train)`, similarly for the testing; since we put it first in the `pd.concat`; `0:len(X_test)`. We use this logic again later when actually building a predictor for the model and cross validation, so it is good to understand the little trick now. 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_train["text"],
                                                    df_train['author'],
                                                    test_size=0.33, random_state=42)

X_train_tfidf = get_bag_of_words(pd.concat([X_train, X_test]))
X_test_tfidf = get_bag_of_words(pd.concat([X_test, X_train]))

## Train a Naive Bayes Model to predict Authors

MultinomialNB implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice). The distribution is parametrized by vectors \theta_y = (\theta_{y1},\ldots,\theta_{yn}) for each class y, where n is the number of features (in text classification, the size of the vocabulary) and \theta_{yi} is the probability P(x_i \mid y) of feature i appearing in a sample belonging to class y.

In [None]:
from sklearn.naive_bayes import MultinomialNB

# Fit on our term frequency inverse document frequency
clf = MultinomialNB().fit(X_train_tfidf[:len(X_train)], y_train)

# Build the test data set
y_pred = clf.predict(X_test_tfidf[:len(X_test)])
print("Top 5 Predictions on X_test: ", y_pred[:5])



## Performance
Now we are interesting in seeing how well we actually predicted. I import some precious libraries, then build a little helper function to get a quick glance at how well we did in some standard areas of performance analysis. I typically like to use `accuracy` as a benchmark, however `log_loss` or `entropy` and others are used depending on the type of problem being dealt with.

## Report

Not too bad, we got a 79% Accuracy  on our first shot. Lets take a look at building a `Pipeline`, `Parameter Grid Optimization`, and `Cross Validation` to see what our best model looks like. Additionally, we can start to look at other models (I.e., New pipelines) for `SVM`, `Random Forest`,  `Adaptive Boosting`, and `Ensemble`.

In [None]:
#import entropy/log loss as a metric
from sklearn.metrics import precision_score, \
    recall_score, confusion_matrix, classification_report, \
    accuracy_score, f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import classification_report

def generate_results(y_true, y_pred):
    
    print ('Accuracy:\n', accuracy_score(y_test, y_pred))
    print ('F1 score:\n', f1_score(y_test, y_pred, average='macro'))
    print ('Recall:\n', recall_score(y_test, y_pred, average='macro'))
    print ('Precision:\n', precision_score(y_test, y_pred, average='macro'))
    print ('clasification report:\n', classification_report(y_test,y_pred))
    print ('confussion matrix:\n',confusion_matrix(y_test, y_pred))
    #print ('log loss:\n',log_loss(y_test, y_pred))
    #print entropy/log_loss as a metric

generate_results(y_test, y_pred)

print(classification_report(y_test, y_pred, target_names=clf.classes_))

## Actually do a Kaggle Prediction
1. `X_train_tfidf = get_bag_of_words(pd.concat([df_train["text"], df_test['text']]))`Again, we build vocabulary on the training and testing data

In [None]:
X_train_tfidf = get_bag_of_words(pd.concat([df_train["text"], df_test['text']]))
y_train = df_train["author"]
X_test_tfidf = get_bag_of_words(pd.concat([df_test["text"], df_train["text"]]))
y_pred = []

clf = MultinomialNB().fit(X_train_tfidf[:len(df_train)], y_train)
y_pred = clf.predict_proba(X_test_tfidf[:len(df_test)])
results = pd.DataFrame({'id':df_test["id"]})
results[clf.classes_] = pd.DataFrame(y_pred)

# For the results, I need a table like the following
#
# id | P(author1) | P(author2) | P(author3)
#
results.head()


## Consider K-Fold Cross Validation

In [None]:
from sklearn.model_selection import KFold

# Create 10 folds to test our data on
kf = KFold(n_splits=10)
scores=[]
for train_index, test_index in kf.split(df_train):
    
    # Foldi Train/Test Data
    #print("TRAIN:", train_index, "TEST:", test_index)
    
    X_traini, X_testi = df_train.loc[train_index,"text"], df_train.loc[test_index,"text"]
    y_traini, y_testi = df_train.loc[train_index,"author"], df_train.loc[test_index,"author"]
    
    # Foldi Train/Test Bag of words
    X_train_tfidfi = get_bag_of_words(pd.concat([X_traini, X_testi]))
    X_test_tfidfi = get_bag_of_words(pd.concat([X_testi, X_traini]))
    
    # Foldi Model
    clfi = MultinomialNB().fit(X_train_tfidfi[:len(X_traini)], y_traini)

    # Test Foldi Model on Foldi held out data
    y_predi = clfi.predict(X_test_tfidfi[:len(X_testi)])
    
    # Append results, iterate
    scores.append(accuracy_score(y_testi, y_predi))
print("Accuracy Scores After 10-Fold Cross Validation:")
print(scores)
print("Average Accuracyy After 10 Folds:")
print(np.mean(scores))

## SVM
So lets take a look at SVM. Some important things to note:
<ul>
<li>The Hyper Parameter `C=1.0` in our model is at default. This is fine, but the area around the hyperplane that we accept/reject from might be more predictive if we allow some wiggle room. This allows for a natural and prescribed amount of error in the training phase to mispredict, which may be helpful for the <em>real world situations</em> we will encounter (I.e., avoids overfitting).</li>
<li>The Parameter `kernel=’rbf’ in our model is at default. This could be more predictive if we mapped into higher dimensions with different kernel functions. Also, `degree=3` is the initial, but this will tell the degree of the polynomial we wish to map to.</li>

</ul
<p> It turns out SVM predicts very poorly in this case! So it has been removed :) </p>

## Parameter Tuning
We will use grid search for this, which will make great use for the SVM model we just build (and our Naive Bayes model too!). So for starters, lets do a grid search on the best `alpha` value for the Naive Bayes model, then do a grid search on the best `C` value for the SVM.

Obviously, such an exhaustive search can be expensive. If we have multiple CPU cores at our disposal, we can tell the grid searcher to try these eight parameter combinations in parallel with the n_jobs parameter. If we give this parameter a value of -1, grid search will detect how many cores are installed and uses them all:

## Pipeline objects
Each of the pipeline objects acts as an sklearn estimator (I.e., pipelineObject.fit(X,y)) and the coolest part about them is that you can throw them into a GridSearch CV object and it will optimize the best parameter for each layer of the pipe. Pretty cool eh?

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

nb_parameters = {#'vect__ngram_range': [(1, 1), (1, 2),(1,3)],
                 #'tfidf__use_idf': (True, False),
                 'alpha': (1,0,0.01, 0.001, 0.0001)}

svm_parameters = {#'vect__ngram_range': [(1, 1), (1, 2),(1,3)],
                 #'tfidf__use_idf': (True, False),
                 'C': (1, 10, 100, 1000),
                 'kernel': ('linear', 'rbf'),
                 'gamma': (0.0001, 0.001, 0.01, 0.1)}
xgb_parameters = {#'vect__ngram_range': [(1, 1), (1, 2),(1,3)],
                 #'tfidf__use_idf': (True, False),
                 'n_estimators':[1000,1500,2000],
                 'max_depth':[3],
                 'subsample':[0.5],
                 'learning_rate':[0.01, 0.02, 0.03, 0.04, 0.05],
                 'min_samples_leaf': [1],
                 'random_state': [3]}


from sklearn import ensemble
from sklearn.svm import SVC

def getXGBPipe():
    clf_xgb_pipe = Pipeline([('vect', CountVectorizer()), 
                              ('tfidf', TfidfTransformer()),
                              ('clf', ensemble.GradientBoostingClassifier())])
def getNaiveBayesPipe():
    clf_nb_pipe = Pipeline([('vect', CountVectorizer()), 
                              ('tfidf', TfidfTransformer()),
                              ('clf', MultinomialNB())])
    return clf_nb_pipe


def getSVMPipe():
    clf_svm_pipe = Pipeline([('vect', CountVectorizer()), 
                              ('tfidf', TfidfTransformer()),
                              ('clf', SVC())])
    return clf_svm_pipe




## Grid Search for Naive Bayes Parameters

## Predict on the optimal estimator

In [None]:
gs_clf_nb = GridSearchCV(MultinomialNB(), nb_parameters, n_jobs=-1, cv=10)
gs_clf_nb_fit = gs_clf_nb.fit(X_train_tfidf[:len(X_train)], y_train)
best_nb_clf_fit = gs_clf_nb_fit.best_estimator_
print("The best alpha: ", best_nb_clf_fit.alpha)
y_pred = best_nb_clf_fit.predict_proba(X_test_tfidf[:len(X_test)])
generate_results(y_pred, y_test)


## Submittion for the best NB

In [None]:
X_train_tfidf = get_bag_of_words(pd.concat([df_train["text"], df_test['text']]))
y_train = df_train["author"]
X_test_tfidf = get_bag_of_words(pd.concat([df_test["text"], df_train["text"]]))
y_pred = []

clf = best_nb_clf_fit.fit(X_train_tfidf[:len(df_train)], y_train)
y_pred = clf.predict_proba(X_test_tfidf[:len(df_test)])
results = pd.DataFrame({'id':df_test["id"]})
results[clf.classes_] = pd.DataFrame(y_pred)

# For the results, I need a table like the following
#
# id | P(author1) | P(author2) | P(author3)
#
results.to_csv("11102017_2_bestNB.csv")

## What if we could do more with our NB, like support it with other features?
Consider the following: Each author as a unique person, will have a unique language, vocabulary, and choose their words uniquely.
If the postulate is true, then the most unfrequent words could be predictive in revealing who is who by just their words.
Does K-means reveal any secrets about 3 unique groups? Lets try!
Does another 50 columns with booleans for least frequent words (and booleans to represent if we saw it or not) help identify these spooky authors? Lets try!

## Kmeans | k=3

Create the kmeans for

In [None]:
from sklearn.cluster import KMeans
#log_loss(y_pred, y_true)

kmeans_column_train = KMeans(n_clusters=3, random_state=0).fit(X_train_tfidf[:len(X_train)]).labels_
kmeans_column_test = KMeans(n_clusters=3, random_state=0).fit(X_test_tfidf[:len(X_test)]).labels_

# Cast into a usable form to append as a column
kmeans_column_train = np.matrix(kmeans_column_train).T
kmeans_column_test = np.matrix(kmeans_column_test).T

print("Kmeans columns created!")

## Feature Engineering | K-means with K=3

Lets append the kmeans results columns to our actually data matrices `X_train_tfidf` and `X_test_tfidf` then calculate some metrics and see if it is responsive with our best estimator above (Naive Bayes).

So it looks pretty promising, lets submit it again and see what we get.

In [None]:
# Append the columns
X_train_raw = np.matrix(X_train_tfidf[:len(X_train)].toarray())
X_test_raw = np.matrix(X_test_tfidf[:len(X_test)].toarray())

X_train_new = np.concatenate((X_train_raw, kmeans_column_train), axis=1)
X_test_new = np.concatenate((X_test_raw, kmeans_column_test), axis=1)

clf = best_nb_clf_fit.fit(X_train_new, y_train)
y_pred = clf.predict_proba(X_test_new)

generate_results(y_pred, y_test)

## Feature Engineering | PCA + Kmeans + Naive Bayes

In [None]:

# 1. PCA on X_train_tfidf and X_test_tfidf
from sklearn.decomposition import PCA
pca = PCA().fit(np.matrix(X_train_tfidf[:len(X_train)].toarray()))
pca

# 2. Visualize to pick the top K components that maximize variance

# 3. Kmeans on our top K columns

# 4. with PCA + Kmeans run NB on it

# 5. Predict on NB and report out how well we did



## Submit NB with Kmeans Support Column

This did not actually improve the score! The clusters may not be finding the 3 authors as I suspected to begin with ... 

In [None]:
from sklearn.cluster import KMeans

X_train_tfidf = get_bag_of_words(pd.concat([df_train["text"], df_test['text']]))
y_train = df_train["author"]
X_test_tfidf = get_bag_of_words(pd.concat([df_test["text"], df_train["text"]]))
y_pred = []
print("Data matrix created!")

kmeans_column_train = KMeans(n_clusters=3, random_state=0).fit(X_train_tfidf[:len(df_train["text"])]).labels_
kmeans_column_test = KMeans(n_clusters=3, random_state=0).fit(X_test_tfidf[:len(df_test["text"])]).labels_
kmeans_column_train = np.matrix(kmeans_column_train).T
kmeans_column_test = np.matrix(kmeans_column_test).T

print("Kmeans columns created!")


X_train_raw = np.matrix(X_train_tfidf[:len(df_train["text"])].toarray())
X_train_new = np.concatenate((X_train_raw, kmeans_column_train), axis=1)
X_test_raw = np.matrix(X_test_tfidf[:len(X_test)].toarray())
X_test_new = np.concatenate((X_test_raw, kmeans_column_test), axis=1)
print("Column appended!")

clf = best_nb_clf_fit.fit(X_train_new, y_train)
print("Estimator built!")
y_pred = clf.predict_proba(X_test_new)
print("Predictions casted!")


results = pd.DataFrame({'id':df_test["id"]})
results[clf.classes_] = pd.DataFrame(y_pred)

# For the results, I need a table like the following
#
# id | P(author1) | P(author2) | P(author3)
#
results.to_csv("11102017_2_kmeans.csv")
print("Writing out!")

## Repeat for XGB

This takes quite a long time to complete .. so I will hold off for now during my experimentation stages :) 

In [None]:
gs_clf_xgb = GridSearchCV(ensemble.GradientBoostingClassifier(), xgb_parameters, n_jobs=-1, cv=3)
gs_clf_xgb_fit = gs_clf_xgb.fit(np.matrix(X_train_tfidf[:len(X_train)], y_train)
best_xgb_clf_fit = gs_clf_xgb_fit.best_estimator_
print("The best xgb: ", best_nb_clf_fit)
y_pred = best_xgb_clf_fit.predict_proba(X_test_tfidf[:len(X_test)])
generate_results(y_pred, y_test)

## Run a baseline XGB model

This run takes a while, but do it next time!!

In [None]:
gs_clf_xgb = ensemble.GradientBoostingClassifier().fit(np.matrix(X_train_tfidf[:len(X_train)].toarray()),
                                                       y_train)
print("Classifier fit!")
y_pred = best_xgb_clf_fit.predict_proba(np.matrix(X_test_tfidf[:len(X_test)].toarray()))
print("Classifier predictions casted!!")
generate_results(y_pred, y_test)

## Feature Engineering | Naive Bayes + XGB

Well so far we have a pretty promising Naive Bayes model prediciting quite well. The Kmeans didn't help us out any, but maybe XGB will. My next consideration is to see the following:

<ul> 
<li> Consider how well XGB does independently. </li>
<li> Consider an average between the XGB result set and the NB result set, test to see how predictive it is.</li>

</ul>

## Feature Engineering | Most Unfrequent Words

Considering an assumption I made above that individuals think and rationalize uniquely (I.e., behavioral economics), their individual syntax should be unique as well. So the likelihood of two individuals choosing the exact same phrase is unlikely. So the question I want to pose is as follows: Are words that the vocabulary sees less able to identify who said them? Some things to consider for this postulate:

<ul> 
<li> What is the count of unique words that fits an appropriate threshold? </li>
<li> Should they be considered as boolean values individually appended to their columns (I.e., did this uncommon word appear in this sentence (for M new words, M new columns)?</li>

</ul>

## Try a XGB Model to see performance gain
## Try other relavent models to see performance gain
## Do an ensemble combinations to see performance gain
## Feature engineering to see performance gain

## To be continued ..