## Background

We take **19,759 passages of text** and attempt to [](http://)classify them according to  the **3 authors** that wrote them. The goal is to to minimise **log loss**.

In the real world accuracy is important but there are other factors in play too, namely:
* Speed
* Simplicity
* Transparency 
* Resources
* Ethics

Our goal is to quickly produce a model with a few lines of code, running in a few seconds and achieve a top half leaderboard position (at time of writing).

![](http://)![](http://)Fortunately [abhisek](https://www.kaggle.com/abhishek) has created a [wonderful kernel](https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle/notebook) that goes through pretty much every model we would want to consider. This really is a fantastic resource and thank you so much for sharing it. Lets take a look at the performance of abhisek's models. 

## Results

![](https://stanwarkaggle.files.wordpress.com/2017/11/model_comparison.jpg)

I'm using an [MSI Phantom Pro 6RE](https://www.msi.com/Laptop/GS43VR-6RE-Phantom-Pro.html) with a [GTX1060](https://www.scan.co.uk/shop/computer-hardware/virtual-reality/nvidia-geforce-gtx-1060-6gb) graphics card. It's worth noting that abhisek intentionally didn't spend long on parameter tuning. There are some obvious caveats that I'll mention in a minute but let's get stuck into the results first.

The blue bars shows the log loss scores (which are also written in white at the end of the bars). The orange bars show how long the models took to run. The paramaters are as per abhisek's tutorial (no tuning) but I've included a list at the base of this kernel too.

There are a lot of models here. To make life a bit easier lets focus on the top 5. These have noticeably lower log loss scores than all the other models.

Taking the top 5 in reverse order: 

5th - **Bidirectional LSTM with Glove**. this takes over 7,000 seconds to run! (nearly two hours) <br>
4th - **LSTM with Glove**. This comes in at around 2,800 seconds (just under an hour) <br>
3rd - **Count Vectorizer with Naive Bayes**. This simple model takes a second to run! <br>
2nd - **GRU with Glove**. This is the slowest model of the lot, coming in at over 8,300 seconds (well over two hours) <br>
1st - **Ensemble** This achieved a small step up in accuracy. It used some of the simpler models and therefore took a few seconds to run. <br>

There have been some interesting discussions on some unpleasant [bias in word embeddings](http://ruder.io/word-embeddings-2017/index.html#bias). The consequences are not much of an issue here but in real world use cases this is something to consider.

I love neural networks as much as the next person, and while they can work a treat on image data, they aren't really doing that much more than Naive Bayes here. Therefore taking accuracy, speed, simplicity bias etc. into consideration I like the look of the third placed model (sending word counts into a Naive Bayes).

Lets have a go at running that model. Fortunately abhisek has done that for us so I'll mostly just be grabbing bits of his code.

## Caveats

The results here are not necessarily fair or replicable elsewhere. They are very subject to factors such as:
* Nature of data
* Parameter choices
* Features

Nonetheless it is still interesting to make a crude comparisson between such a wide range of methods to spot one that meets our criteria well.

### Preperation

First we import the packages.

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from sklearn import preprocessing
from sklearn.naive_bayes import MultinomialNB

Next we read in the data.

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
sample = pd.read_csv('../input/sample_submission.csv')
train.head(3)

And we want the labels in the right format. They are placed into an array taking values of 0, 1 or 2 (indicating the 3 authors).

In [None]:
lbl_enc = preprocessing.LabelEncoder()
y = lbl_enc.fit_transform(train.author.values)

 ## Preperation

We create a matrix of counts. Each line represents a passage of text. The columns are populated with every word, bigram and trigram (stopwords are removed).

In [None]:
ctv = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), stop_words = 'english')

A reminder that this is what a count vector looks like. [Credit goes here](https://medium.com/deep-math-machine-learning-ai/chapter-9-1-nlp-word-vectors-d51bff9628c1)
![](https://cdn-images-1.medium.com/max/800/1*YXy_Txtmtttw85Vv05JdRQ.jpeg)

## Model

We will fit a Naive Bayes model on the counts. We already know this is effective we will fit on all the train data, predict on the validation data and send to  the leaderboard. <br> <br>
Lets not do this blind. We want to know more about this simple but effective method. Lets start by heading to the [Naive Bayes wikipedia page](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)

### Naive Bayes Wikipedia

In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features.

Naive Bayes has been studied extensively since the 1950s. It was introduced under a different name into the text retrieval community in the early 1960s,[1]:488 and remains a popular (baseline) method for text categorization, the problem of judging documents as belonging to one category or the other (such as spam or legitimate, sports or politics, etc.) with word frequencies as the features. With appropriate pre-processing, it is competitive in this domain with more advanced methods including support vector machines.[2] It also finds application in automatic medical diagnosis.[3]

Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features/predictors) in a learning problem. Maximum-likelihood training can be done by evaluating a closed-form expression,[1]:718 which takes linear time, rather than by expensive iterative approximation as used for many other types of classifiers.

In the statistics and computer science literature, naive Bayes models are known under a variety of names, including simple Bayes and independence Bayes.[4] All these names reference the use of Bayes' theorem in the classifier's decision rule, but naive Bayes is not (necessarily) a Bayesian method.[1][4]

Okay so we've got a rough idea but lets try and get our heads around this a bit better. Who do we go to when we want to learn about NLP? It has to be the Lionel Messi of NLP - Dan Jurafsky!
<img src="http://ichef.bbci.co.uk/onesport/cps/480/mcs/media/images/65764000/jpg/_65764217_messi-getty.jpg" alt="Drawing" style="width: 300px;"/>
<img src="https://web.stanford.edu/~jurafsky/danfall13.jpg" alt="Drawing" style="width: 200px;"/>

And how will we learn most effectively? A worked example... <br>
<img src="https://i.ytimg.com/vi/pc36aYTP44o/hqdefault.jpg" alt="Drawing" style="width: 400px;"/>
<br> head over to [youtube](https://www.youtube.com/watch?v=pc36aYTP44o) to watch this one

We are now ready to run the model. It takes about a second and is just a few lines of code.

In [None]:
xtrain_ctv_all=ctv.fit_transform(train.text.values)
xtest_ctv_all=ctv.transform(test.text.values)
clf = MultinomialNB(alpha=1.0)
clf.fit(xtrain_ctv_all, y)

## Submit

In [None]:
sub = pd.DataFrame(clf.predict_proba(xtest_ctv_all), columns=["EAP","HPL","MWS"],)
sub["id"] = test.id
cols = sub.columns.tolist()
sub = sub[cols[-1:] + cols[:-1]]
sub.to_csv("simple_spooky_sub.csv")
sub.head()

![](https://i.imgur.com/XJyemeI.jpg)

## Conclusion

* A simple Naive Bayes model can perform almost as well as complex and time consuming neural network based methods.

* Getting to the top half of the leaderboard with a few lines of code within a few seconds is achievable.

* To boost score for minimal resource effort the inclusion of simple features, parameter tuning and considering simple ensemble methods would be a sensible next step.

## Challenge!!!

Can you produce a higher score with less than 20 lines of codes that runs in under 10 seconds? Please share if you can. It would be great to see.

## Resources

[Countvectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

[Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)

[Glove embeddings](https://nlp.stanford.edu/projects/glove/)

## Parameters


**TF-IDF Logistic Regression**	min doc freq=3, ngram range 1-3, stopwords=nltk English <br> <br>
**Count Vectorizer Logistic Regression**	min doc freq=3, ngram range 1-3, stopwords=nltk English <br> <br>
**TF-IDF Vectorizer Multinomial Naive Bayes** Default <br> <br>
**Count Vectorizer Multinomial Naive Bayes** Default <br> <br>
**SVM on SVD of TFIDF**	Default (Probability Estimates enabled) <br> <br>
**TFIDF Boosted Trees**	200 boosted trees, max depth=7, subsample=0.8, colsample by tree=0.8, learning rate=0.1 <br> <br>
**Boosted Trees on Counts**	200 boosted trees, max depth=7, subsample=0.8, colsample by tree=0.8, learning rate=0.1 <br> <br>
**Boosted Trees on TFIDF (with SVD)**	200 boosted trees, max depth=7, subsample=0.8, colsample by tree=0.8, learning rate=0.1 <br> <br>
**Boosted Trees on Counts (with SVD)**	200 boosted trees, max depth=7, subsample=0.8, colsample by tree=0.8, learning rate=0.1 <br> <br>
**Default Boosted Trees on Glove Embeddings** Defaults <br> <br>
**Boosted Trees on Glove Embeddings**	200 boosted trees, max depth=7, subsample=0.8, colsample by tree=0.8, learning rate=0.1 <br> <br>
**3 layer sequential NN on Glove (5 epochs)**	300 Dense layer - 0.2 dropout - 300 dense layer - 0.3 dropout - batch normailsation - softmax to 3 <br> <br>
**LSTM with Glove**	300 embedding - 0.3 spacial dropout - 300 LSTM (with 0.3 dropout and 0.3 recurrent dropout) - 1024 dense - 0.8 dropout - 1024 dense - 0.8 dropout - softmax to 3 <br> <br>
**Bidirectional LSTM with Glove**	300 embedding - 0.3 spacial dropout - 300 bidirectional LSTM (with 0.3 dropout and 0.3 recurrent dropout) - 1024 dense - 0.8 dropout - 1024 dense - 0.8 dropout - softmax to 3 <br> <br>
**GRU with Glove**	300 embedding - 0.3 spacial dropout - 300 GRU (with 0.3 dropout and 0.3 recurrent dropout) - 300 GRU (with 0.3 dropout and 0.3 recurrent dropout) - 1024 dense - 0.8 dropout - 1024 dense - 0.8 dropout - softmax to 3 <br>
