Last updated: 31.10.2017
# Spooky NLP problem
Recently I started looking into ML algorithms for doing NLP tasks, mostly information retrieval, topic modelling and content classification. Therefore, I find this competition very interesting. Especially the objective - estimate the author from a snippet of text - is quite appealing because this focuses more on _style_ (choice of words, sentence structure etc) than on the actual content. Something quite different from the problems I am usually working on.
Ok, enough talking. Let's get started. In this kernel I will touch on the following topics (with a varying level of detail):
- label encoding
- text representation
- sentence and word tokenization
- stopword removal
- stemming and lemmatizing
- cross validation

While doing this, I also try to give a brief introduction to the following python packages. This might be interesting for people who want to get started on doing NLP in python.
- nltk
- textblob
- pandas
- seaborn

These lists are evolving and will grow with future updates of this notebook.

Enjoy reading and please comment! Feedback is very much appreciated!

Contents
1. [Explorative Data Analysis](#Explorative-Data-Analysis)
    1. [Required packages](#Required-packages)
    1. [The training data](#The-training-data)
    1. [Target label encoding](#Target-label-encoding)
    1. [Training set balance](#Training-set-balance)
    1. [Sentences](#Sentences)
    1. [Vocabulary](#Vocabulary)
    1. [Special characters](#Special-characters)
1. [Reference model](#Reference-model)

## Explorative Data Analysis

### Required packages
Import the most commonly used python packages:
- [matplotlib.pyplot](http://matplotlib.org/api/pyplot_api.html) for basic plotting and figure styling
- [numpy](http://www.numpy.org/)  is a powerful library for handling N-dimensional arrays, performing linear algebra operations, vectorizing transformations, random number generation and more.  I am going to use only very basic functionality of this powerful package in this kernel.
- [pandas](http://pandas.pydata.org/) is the de-facto standard for handling datasets in the python data science community. It is so feature-rich that it is hard to describe in a sentence. If I had to, I would probably go with: _"A python library for handling tabular/labelled data supporting the usual SQL-like operations (grouping, aggregating, transforming, NA handling...), add some fancy time series functionality, plus some easy-but-still-appealing visualisations and a superb documentation."_
- [seaborn](http://seaborn.pydata.org/) a great visualisation library built on-top of `matplotlib` with a concise API geared towards statistical plots.
- [sklearn](http://scikit-learn.org/stable/index.html) I can't  write a short intro without mentioning `scikit-learn`. In my opinion, this is the most commonly used python package for doing ML. It comes with heaps of standard supervised classification and regression algorithms as well as unsupervised clusterisation algorithms. In addition, it provides very neat preprocessing functionality and has many evaluation metrics defined. I will make heavy use of this package in this kernel. However, for specialized tasks you often find better suited/more performant libraries (e.g. xgboost or keras to mention two other common suspects on kaggle).
- [nltk](http://www.nltk.org/) I would say _the_ standard library for doing Natural Language Processing in education and research. It is very modular and comes with a lot of algorithms and configuration options which leads to a rather steep learning curve.
- [textblob](https://textblob.readthedocs.io/en/dev/) is built on-top of `nltk` and provides a more easily-accessible interface. If you don't need highly optimized performance, this might be your best starting point.

For a more detailed disucssion on python NLP packages, I would recommend you reading this 
[article](https://elitedatascience.com/python-nlp-libraries).


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn as sk
import nltk
from textblob import TextBlob as TB, Word

### The training data
Let's load the training data:

In [None]:
df = pd.read_csv("../input/train.csv")
print('loaded %d samples with %d features' %(df.shape))

We can inspect the head and tail of our training set using `pandas.DataFrame`'s methods [head()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) and [tail()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html)

In [None]:
print(df.head())
print(df.tail())

As expected, we see that there is an ID which we need for the submission, some text snipped and the author.

### Target label encoding
The author is the target label we want to predict in this competition. However, not all algorithms handle non-numeric data well. It is easy to construct a one-to-one mapping for author -> code and then use the numeric code as label. Because this task is quite common, there is a convience function in the `sklearn` package doing exactly this: [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html). Because we are going to need it later anyway, we can add a new column with these numeric labels.

In [None]:
df['label'] = sk.preprocessing.LabelEncoder().fit_transform(df.author)
df.head()

We can quickly check that each author is mapped to exactly one label by grouping by the `author` column and counting how often different labels appear.

In [None]:
df.groupby('author').label.value_counts()

As expected we see exactly one label per author. Selecting on `label == 0` is equal to selecting on `author == 'EAP'` and similarly for the other authors.

### Training set balance
Let's see how many samples we have per author. In general, the performance of supervised learning algorithms can be impacted negatively if you have a very imbalanced distribution of target labels.

To check the balance of the training set we group the samples by author and count how many texts we find in each group. Then we plot this as bar chart. Thanks to `pandas` slim API this is a one-liner:

In [None]:
df.groupby('author').text.count().plot.bar(title = 'training set balance');

Things don't look too bad. We have more text snippets for Edgar while almost equally many for HP and Mary. The difference is not dramatic, but we will keep this in mind for later.

Another way of looking at this is: What does having a balanced training set mean? Having the same number of text snippets? Text snippets may vary in length, so the above comparison might not be the most useful. Two short text snippets may carry less information than one really long one. Instead of looking the raw number of training samples per author, one could compare different metrics. The most obvious to me are:
- total number of sentences
- total number of words
- total number of characters

The last one is easy to get: just get the string length of each text (`len(text`) while for the other two one needs to split texts into sentences/words. At first glance, these seem to be trivial tasks (e.g. split on punctuation or white spaces respectively). However, it might be difficult to implement all the corner cases (e.g. punctuation inside quotations, apostrophes etc). Luckily we can employ the tokenizers of the `nltk` package which handle all these cases:
- `nltk.sentence_tokenize(text)` will split the text into a list of sentences. Optionally, you can pass a `language` option to use special punctuation characters (e.g. Spanish). The default is English.
- `nltk.word_tokenize(sentence)` will split the sentence in a list of words.

Feel free to check the documentation of `nltk.tokenize` for other tokenization options.

In [None]:
# calculate different measures of the quantity of information
df['n_sentences'] = df.text.transform(lambda x: len(nltk.sent_tokenize(x)))
df['n_words'] = df.text.transform(lambda x: len(nltk.word_tokenize(x)))
df['text_len'] = df.text.transform(lambda x: len(x))

Having calculated the the numbers above per text snippet, we need to group again by author and `sum()` them up (instead of counting as we did previously). The we compare the _amount of information_ per author again using bar plots. The key here is to pass the `subplots = True` option to the plotting function. Otherwise, all bars are drawn in one figure which is not optimal in this case due to the very different y-scales between the individual. For better readability, I split the one-line call voer many lines and added some inline comments.

In [None]:
# first group by author
(df.groupby('author')
 # select the columns we are interested in (note the list inside the []-operator)
 [['n_sentences','n_words','text_len']]
 # we calculate the sum for each column within each author group
 .sum()
 # finally, plot as bar chart in different figures
 .plot.bar(subplots = True, layout = (1,3), figsize = (18,6)));

Starting from the left, the total number of sentences per author gives roughly the same picture as already obtained from the number of text snippets (looking at the y-scale we can guess that most text snippets actually correspond to one sentence -> we can check this in a minute). Looking at the total number of words, the difference is still clearly visible but already reduced (EAP has about 30% more words than HPL). Considering the raw text length, the difference is even less pronounced (EAP has about 20% more characters in total than HPL).

Even thought the initial picture did not change significantly, I think it was still worth checking it. In the end, we are still doing exploration ;-)

### Sentences
Ok, let's quickly check the assumption that the text snippets are made out of individual sentences. We already calculated the number of sentences for each text snippet. So let's just look at the value counts of `n_sentences`.

In [None]:
print(df.n_sentences.value_counts())
df.n_sentences.plot.hist(log = True)
print('out of %d text snippets %d contain more than one sentence' %
      (len(df), (df.n_sentences > 1).sum()))

Our assumption seems to be true for most of the samples. Nevertheless, there are 419 texts (about 2%) which can be splitted into multiple sentences, up to nine. Let's have a look at this peculiar text snippet.

In [None]:
print(df[df.n_sentences == 9].text.iloc[0])

Some serious bargaining happening here :-)

### Vocabulary
If you are still with me, we should dare the next step and have  a look at the vocabulary. The objective sounds simple: _"How often an author is using which word?"_. Yet, we need to do some gymnastics to get this information nicely displayed. But I will walk you through this daunting section step by step.

1. Determine the vocabulary (all words appearing in any of the texts).
1. Count the number of occurences of each word (= _term_ in `sklearn` speak) for each document (yields the term-document matrix).
1. Sum the term counts over all documents belonging to one author.
1. Getting the most common words per author and plot them.

Steps 1. and 2. can be done using the [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from `sklearn.feature_extraction.text`. This algorithms has many options for steering the text preprocessing, tokenization, limiting the vocabulary etc. Here we can use the default settings which will produce word counts per document in a way you would naively expect. Step 1. (determining the vocabulary) is done be calling the `fit` method while calculating term counts per document can be done by calling `transform` of the already fitted `CountVectorizer`. If the input to both methods is the same, you can combine those to steps and call `fit_transform` directly.

In [None]:
# initialize count vectorizer
cv = sk.feature_extraction.text.CountVectorizer()
# learn vocabulary and calculate term-document frequncies in one go
X = cv.fit_transform(df.text)
print('learned a vocabulary of size %d' % len(cv.vocabulary_))
print('first 5 terms in vocabulary (ordered alphabetically):')
print(cv.get_feature_names()[:5])
print('shape of term-document matrix is [n_samples, vocabulary_size]: ', X.shape)

As a next step, we want to accumulate the word counts for all documents belonging to one author. This is, we want to sum over all rows (in `numpy` speak along `axis = 0`) but only include rows for a given author. Here we are going to use a trick: We will binarize the the author labels in a one-vs-all fashion using [label_binarize](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.label_binarize.html) from `sklearn.preprocessing`. As input we give the list of author labels and as a second argument the set of possible class labels (i.e. the three different author acronyms).

In [None]:
Y = sk.preprocessing.label_binarize(df.author, df.author.unique())
print('shape of Y: ',Y.shape)
print('The first 5 rows of Y:')
print(Y[:5,:])
print('class labels: ', df.author.unique())

You can see that the resulting matrix has shape `[n_samples, n_classes]` where `n_classes` is the number of different class labels (3 different authors in our case). Every row contains exactly one 1 and otherwise only 0. A 1 in the first column means that this sample belongs to author `EAP`, a 1 in the second column corresponds to `HPL` and in the third column to `MWS`. If you remember a bit of linear algebra, you will figure out that summing the word counts per author can now be written as a simple dot-product `Y.T * X`.

In [None]:
counts = Y.T * X
print('shape of counts is [n_classes, vocabulary_size]: ', counts.shape)

Now we have the full information about how often each word was used by each author. We put this back into a `pandas.DataFrame` to benefit from their great sorting and visualisation functionality. 

In [None]:
count_df = pd.DataFrame(data = counts.T,
                        columns = df.author.unique(),
                        index = cv.get_feature_names())
count_df.head(10)

We can also make a bar plot showing the counts of the most frequent words easily using this dataframe.

In [None]:
_, (ax1, ax2, ax3) = plt.subplots(ncols = 3, figsize = (16,6))
topn = 10
count_df.EAP.sort_values(ascending = False).iloc[topn::-1].plot.barh(title = 'EAP', ax = ax1)
count_df.HPL.sort_values(ascending = False).iloc[topn::-1].plot.barh(title = 'HPL', ax = ax2)
count_df.MWS.sort_values(ascending = False).iloc[topn::-1].plot.barh(title = 'MWS', ax = ax3);

Well, this looks discouraging. All words appearing in the plots above don't carry much information. These are so-called stopwords. But we are lucky that `nltk` comes with predefined stopword lists. One must be careful using those since the definition of a stopword depends heavily on the specific task. For now we go with default list and see what we get.

In [None]:
english_sw = nltk.corpus.stopwords.words('english')
print('loaded %d stopwords for english' % len(english_sw))

In [None]:
count_df_no_sw = count_df[~count_df.index.isin(english_sw)]
_, (ax1, ax2, ax3) = plt.subplots(ncols = 3, figsize = (16,6))
topn = 10
count_df_no_sw.EAP.sort_values(ascending = False).iloc[topn::-1].plot.barh(title = 'EAP', ax = ax1)
count_df_no_sw.HPL.sort_values(ascending = False).iloc[topn::-1].plot.barh(title = 'HPL', ax = ax2)
count_df_no_sw.MWS.sort_values(ascending = False).iloc[topn::-1].plot.barh(title = 'MWS', ax = ax3);

### Special characters
So far we have been looking at words and their occurences. Maybe the authors also have a preference for certain special characters? We can do the same analysis as above on individual characters instead of words by passing the option `analyze = 'char'` to the `CountVectorizer`.

In [None]:
cv2 = sk.feature_extraction.text.CountVectorizer(analyzer = 'char')
X2 = cv2.fit_transform(df.text)
char_counts = pd.DataFrame(data = (X2.T * Y),
                           columns = df.author.unique(),
                           index = cv2.get_feature_names())

Now we can inspect how often an author is using which character.

In [None]:
char_counts

This apart from the fact that EAP and HPL are using some French and Spanish special characters, we can't learn much from the table. We should normalize it by the total number of characters per author (which will give us an estimate for the conditional probability $P(character|author)$).

In [None]:
char_counts /= char_counts.sum()
char_counts.head(10)

In order to plot this data easily with `seaborn` we need to convert it to _long-format_.

In [None]:
char_counts = (char_counts.stack()
               .reset_index()
               .rename(columns = {'level_0': 'char',
                                  'level_1': 'author',
                                  0: 'probability'
                                 }))

In [None]:
char_counts.head(10)

After this modification we can easily plot the conditional probabilities for each character given an author. For the sake of simplicity I restricted the set of plotted characters to those which have a probability higher than 1 in 100,000 (which basically removes the characters with accents). The absolute values on the y-scale are not that much of importance at the moment. Look for characters were the probabilites have different values.

In [None]:
sns.factorplot(col = 'char', x = 'author', y = 'probability',
               data = char_counts[char_counts.probability > 1e-5], kind = 'bar', col_wrap = 4, size = 3, log = True);

In [None]:
def text2POS(text):
    tags = TB(text).tags
    return ' '.join([t for _,t in tags])

In [None]:
df['tags'] = df.text.transform(text2POS)

In [None]:
c = sk.feature_extraction.text.CountVectorizer(tokenizer = lambda x: x.split())
X3 = c.fit_transform(df.tags)
a = pd.DataFrame(data = X3.T * Y, columns = df.author.unique(), index = c.get_feature_names())

In [None]:
X3.todense()[0,:]

In [None]:
df.tags.iloc[0]

In [None]:
b = a / a.sum()

In [None]:
b = b.stack().reset_index().rename(columns = {'level_0': 'POS', 'level_1': 'author', 0: 'probability'})

In [None]:
a.div(a.max(axis = 1), axis = 'index').min(axis = 1)

In [None]:
stop_tags = a.index[a.div(a.max(axis = 1), axis = 'index').min(axis = 1) > 0.9]
stop_tags

In [None]:
sns.factorplot(col = 'POS', x = 'author', y = 'probability',
               data = b, kind = 'bar', col_wrap = 4, size = 3, log = True);

While most of the letters show very similar probabilites for each of three classes, there are some significant differences for some special characters like quotation marks, semi-colons, double-colons, question marks. That is something to keep in mind for the feature engineering and selection we are going to do later.

## Reference model
Before diving further into feature engineering, I would like to build a simple model which can serve as a reference to judge further improvements. A very simple yet powerful approach to text classification is the [_Bag-of-words_](https://en.wikipedia.org/wiki/Bag-of-words_model) representation in conjunction with the [naive Bayes model](https://en.wikipedia.org/wiki/Naive_Bayes_classifier). The bag of words model represents a text by a list of words (or tokens) together with their document frequency (= absolute counts). It does not store any information about the grammar or the initial word order.
The naive Bayes model assumes that all input features are independent and that one can therefore write the posterior probability distribution as a product of individual conditional probability densities divided by some normalization factor. Despite the crude assumptions (words in texts are usually not independent), this approach can lead to quite satisfying classification results. Due to its historical importance and its simplicity, we shall use it as the reference model.

First we need to define the input features for our model. Here I am going to use the word counts and the character counts as discussed above with the modification that I require some minimal threshold. Tokens which appear less often are not considered and just dropped. 

The `sklearn` package comes with a handy tool to _stack_ and process several features - so-called [Pipeline]( http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)s and [FeatureUnion](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html)s. You can think of a `FeatureUnion` as concatenating feature matrices next to each other (along `axis = 1`) and a pipeline as an object which calls `fit_transform` on its input and passes the output to the next stage in the pipeline. The last stage might be an estimator in which case it is only fitted (and no transformation is applied). The whole pipeline can then be used with the same semantics which are provided by the last step (e.g. as classifier, as regressor or as transformer).

In [None]:
from sklearn.pipeline import make_pipeline, make_union
from sklearn.naive_bayes import MultinomialNB

reference = make_pipeline(
    make_union(
        make_pipeline(
            sk.preprocessing.FunctionTransformer(func = lambda df: df.text, validate = False),
            make_union(
                sk.feature_extraction.text.CountVectorizer(min_df = 5),
                sk.feature_extraction.text.CountVectorizer(analyzer = 'char', min_df = 10),
            )
        ),
        make_pipeline(
             sk.preprocessing.FunctionTransformer(func = lambda df: df.lemma, validate = False),
             sk.feature_extraction.text.CountVectorizer(tokenizer = lambda x: x.split(),
                                                        #stop_words = ['cc','nns'],
                                                        min_df = 5
                                                       )
        )
    ),
    MultinomialNB(fit_prior = False)
)

Now we have our pipeline defined which can be used as a classifier. We will use [cross validation](https://en.wikipedia.org/wiki/Cross-validation) to train and evaluate it on random splits of the training sample.

Cross-validation is an important concept. If you train your model to closely to your training data, you will start relying on specific features or artifacts which are only present in the training set. This is meant when people talk about _overfitting_: your model does not generalize well to unseen data. In order to avoid this, you can train your model only on a subst of the available training set and evaluate it on the remaining set. For a more detailed discussion on this topic, check [sklearn's user guide](http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation).

The evaluation metric is set to the logarithmic loss (which is used in the evaluation of this competition). In order to use this metric, we have to wrap it into a `scorer` object specifying that less is better and that we need probability predictions for each class instead of simple prediction... I know ... technicalities. Ah wait, one more thing: `sklearn` cross-validation framework always tries to maximise the target function. That's why the returned scores are negative (a log-loss can't be negative). Just revert the sign and you get the actual scores.

In [None]:
# wrap the metric in a scorer function
score_func = sk.metrics.make_scorer(sk.metrics.log_loss,
                                    greater_is_better = False,
                                    needs_proba = True)

# run the K-fold cross-validation with K = 5
scores = sk.model_selection.cross_validate(reference,
                                           df,
                                           df.label,
                                           cv = 5,
                                           scoring = score_func,
                                           return_train_score = True)
# output the performance
train_scores = scores['train_score']
test_scores = scores['test_score']
print('log-loss score of your model = %.2f +/- %.2f (training: %.2f +/- %.2f)' % (-np.mean(test_scores), np.std(test_scores, ddof = 1),-np.mean(train_scores), np.std(train_scores, ddof = 1)))

Congratulations, you have trained a text classifier. The performance is not great, but remember that we got this result almost out-of-the-box without much feature engineering and tuning.

For now I will make a break and continue this evening with feature engineering. Things I have in mind include
- imputing missing punctuation
- stemming and lemmatizing
- sentiment

So stay tuned and leave comment if you liked it!

In [None]:
def _penn_to_wordnet(tag):
    '''Converts the corpus tag into a Wordnet tag.'''
    if tag in ("NN", "NNS", "NNP", "NNPS"):
        return nltk.wordnet.NOUN
    if tag in ("JJ", "JJR", "JJS"):
        return nltk.wordnet.wordnet.ADJ
    if tag in ("VB", "VBD", "VBG", "VBN", "VBP", "VBZ"):
        return nltk.wordnet.wordnet.VERB
    if tag in ("RB", "RBR", "RBS"):
        return nltk.wordnet.wordnet.ADV
    return None

def lemmatize(text):
    b = TB(text)
    return ' '.join([Word(w).lemmatize(_penn_to_wordnet(t)) for w,t in b.tags])

In [None]:
df['lemma'] = df.text.transform(lemmatize)

In [None]:
c.fit_transform(df.lemma)
print(len(c.vocabulary_))

In [None]:
csw = sk.feature_extraction.text.CountVectorizer(vocabulary = english_sw)
X = csw.fit_transform(df.text)
sw_df = pd.DataFrame(data = X.T * Y, columns = df.author.unique(), index = csw.get_feature_names())

In [None]:
a = pd.DataFrame(data = X.todense(), columns = csw.get_feature_names(), index = df.index)

In [None]:
a['len'] = df.n_words

In [None]:
d = a.div(df.n_words, axis = 'index') * 100

In [None]:
d['author'] = df.author

In [None]:
sns.pairplot(data = d, vars = ['but', 'then', 'and', 'or'], hue = 'author');

In [None]:
d.columns

In [None]:
b = TB(df.text.iloc[2])
print(b)
b.noun_phrases

In [None]:
b

In [None]:
b.sentiment

In [None]:
df['polarity'] = df.text.transform(lambda x: TB(x).sentiment.polarity)

In [None]:
df['subjectivity'] = df.text.transform(lambda x: TB(x).sentiment.subjectivity)

In [None]:
sns.violinplot(x = 'author', y = 'subjectivity', data = df, inner = 'quartil')