# Python NLP Tutorial

The dataset I'll be using is the Kaggle "Spooky Authors" competition -- given a writing sample, can we predict which author penned it? Our options are Edgar Allen Poe, HP Lovecraft, and Mary Shelley.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import (cross_val_score, GridSearchCV,
                                     train_test_split, KFold)
kf = KFold(n_splits=4, shuffle=True)

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [18]:
data = pd.read_csv('train.csv')

data = data.sample(frac=1.0) # shuffle
# Shuffling isn't ever really necessary, but it's good to be cautious!

In [65]:
data.head(10)

Unnamed: 0,id,text,author
15980,id22852,"His careless extravagance, which made him squa...",MWS
4722,id00476,and was he not consequently damned?,EAP
5011,id03406,To a sensitive mind there is always more or le...,EAP
102,id03072,"I vaow afur Gawd, I dun't know what he wants n...",HPL
6491,id19412,"Such as he had now become, such as was his ter...",MWS
12528,id09587,"Trever seemed dazed in the confusion, and shra...",HPL
14417,id21015,"""Hem ahem rather civil that, I should say"" sai...",EAP
7717,id15308,Wherefore should I see her?,MWS
14289,id27422,The Greeks wept for joy when they beheld the M...,MWS
8723,id10438,All the time she could command she spent in so...,MWS


In [66]:
data['author'].value_counts()

EAP    7900
MWS    6044
HPL    5635
Name: author, dtype: int64

# Stemming

Stemming is the process of reducing words down to their roots. It's commonly taught as part of NLP, but in my experience it typically doesn't help with a model's performance.

In fact, it's been so long since I'm stemmed, I forget the most efficient way to code it. The example below works fine, but I remember learning it a different way.

In [19]:
from nltk.stem.snowball import SnowballStemmer

# There are several different stemmers,
# but Snowball tends to do the best job.
stemmer = SnowballStemmer(language='english')

In [20]:
stemmed = []

for sample in data['text']:
    # Stem each word and then rejoin them all to a string
    st = " ".join([stemmer.stem(word) for word in sample.split()])
    stemmed.append(st)

I'm not going to be very conscientious about training and testing splits, since that's not really the purpose of this notebook. For teaching purposes, simple cross validation is fine IMO.

In [21]:
x = np.array(stemmed)
y = data['author']

# Pipelines

Pipelines usually aren't necessary for ML tasks -- but when they are, they're absolutely essential. They allow you to run your data through multiple steps before your model makes a prediction.

The main time I use them is for NLP tasks like this. I'll walk you through the process:

```
from sklearn.pipeline import Pipeline

pl = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('classify', MultinomialNB())
])
```

You'll then use this just like you would any other machine learning model (e.g. `LogisticRegression()`)

There are at least 2 ways to create a pipeline, and this is the more difficult (but flexible) way. You pass a **list of tuples** to the Pipeline class.

- The 2nd item in each tuple is a step in your machine learning model.

- The 1st item is what you want to call it.

"What you want to call it?" I know that sounds weird, but you'll see why that's useful in a moment.

Let's continue on by seeing how a dummy classifier does in terms of predicting accurately.

In [22]:
from sklearn.dummy import DummyClassifier

dum = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('classify', DummyClassifier())
])

# Here's what I mean by treating your pipeline like any other model
cv = cross_val_score(dum, x, y, cv=kf, scoring='accuracy')

print('Mean score:', cv.mean())
print('Std Dev:   ', cv.std())

Mean score: 0.34041551160391936
Std Dev:    0.0031701578217426434


Not very good since each author represents about a third of the data. 

`MultinomialNB()` is the classifier you'll most often use for NLP. It's fast and does a great job. Other classifiers, like logistic regression or decision trees work too, but try them out and you'll see why I usually don't bother.

Pay close attention to how I set up my parameter grid below. For each parameter in your pipeline, you refer to it by the name you gave the step, followed by 2 underscores, followed by the parameter you're testing. Examine the code and it'll make sense.

In [23]:
pl = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('classify', MultinomialNB())
])

param_grid = [
    {
        'tfidf__max_df': np.arange(.01,.10,.01),
        'tfidf__min_df': [2,3,4],
        'tfidf__ngram_range': [(1,2)],
        'tfidf__norm': ['l1', 'l2'],
        'classify__alpha': [.01, .1, .2],
    },
]
# If you're curious why I enclosed my grid dictionary in a list,
# I actually don't remember. :) There are certain advanced use cases
# where you might have multiple grids.

grid =\
GridSearchCV(pl, cv=kf, param_grid=param_grid, scoring='accuracy')\
.fit(x, y)

model_nb = grid.best_estimator_
print(model_nb)
cv = cross_val_score(model_nb, x, y, cv=kf, scoring='accuracy')

print('Mean score:', cv.mean())
print('Std Dev:   ', cv.std())

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.04, max_features=None, min_df=4,
        ngram_range=(1, 2), norm='l1', preprocessor=None, smooth_idf=True,
...        vocabulary=None)), ('classify', MultinomialNB(alpha=0.2, class_prior=None, fit_prior=True))])
Mean score: 0.4025746959129042
Std Dev:    0.006219236493607164


Notice that that's our accuracy score with stemmed words. It's actually not all that good! Let's try again with our original text.

In [24]:
x = data['text']

pl = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('classify', MultinomialNB())
])

param_grid = [
    {
        'tfidf__max_df': [.03,.04,.05],
        'tfidf__min_df': [4],
        'tfidf__ngram_range': [(1,2)],
        'tfidf__norm': ['l1'],
        'classify__alpha': [.2],
    },
]

grid =\
GridSearchCV(pl, cv=kf, param_grid=param_grid, scoring='accuracy')\
.fit(x, y)

model_nb = grid.best_estimator_
print(model_nb)
cv = cross_val_score(model_nb, x, y, cv=kf, scoring='accuracy')

print('Mean score:', cv.mean())
print('Std Dev:   ', cv.std())

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.03, max_features=None, min_df=4,
        ngram_range=(1, 2), norm='l1', preprocessor=None, smooth_idf=True,
...        vocabulary=None)), ('classify', MultinomialNB(alpha=0.2, class_prior=None, fit_prior=True))])
Mean score: 0.6775116535934644
Std Dev:    0.002423282058608898


MUCH better, right? If you have the time and motivation, you could improve it even further, but let me show you one more trick I like to do with NLP.

In [68]:
model_nb.fit(x, y);

In [43]:
from random import sample

# Again, here's why you named each step in your pipeline
# .vocabulary_ is actually a dictionary, but it contains all the
# words your vectorizer encountered.
tfidf_words = model_nb.named_steps['tfidf'].vocabulary_

# Let's print a few random words!
s = sample(list(tfidf_words), 5)
print(s)

['upstream', 'cord', 'blitzen', 'coffins', 'mizzen']


In [71]:
scores = {}

# For each word in the vocabulary
for word in tfidf_words.keys():
    
    # Predict the class of that word
    # (Notice the word has to be put in a list
    # And you'll typically have to do a bit of indexing
    # as well -- in this case, [0])
    pred = model_nb.predict_proba([word])[0]
    
    # Add that word and its predictions to our new "scores" dict
    scores[word] = pred

That `model_nb.predict_proba([word])[0]` is not at all what you'd expect, right? In a perfectly logical world, the code would shorten to `model_nb.predict_proba(word)`.

I'm pointing this out as a tip to people who are still newish to Python in general. **When you're not sure what kinds of outputs and datatypes you're dealing with, debug them by printing them out.** Then look at whether you need to do additional indexing, conversion, etc.

Sure, there are more advanced debugging methods out there, but the print function will solve 99% of your problems.

In [72]:
# I hate this syntax, but I think it's the fastest
# way to convert a dict to a dataframe
scores = pd.DataFrame([scores]).T

scores.head()

Unnamed: 0,0
aaem,"[0.40349353899586293, 0.2878083661065427, 0.30..."
ab,"[0.40349353899586293, 0.2878083661065427, 0.30..."
aback,"[0.40349353899586293, 0.2878083661065427, 0.30..."
abaft,"[0.40349353899586293, 0.2878083661065427, 0.30..."
abandon,"[0.5266669572934571, 0.25686285964094774, 0.21..."


Yikes! That's not useful at all!

In [73]:
# This converts a column of lists into multiple columns
# in effect "expanding" it
# Again, really weird syntax but super useful

scores = scores[0].apply(pd.Series)

scores.sample(10)

Unnamed: 0,0,1,2
hesitates,0.403494,0.287808,0.308698
perpetrate,0.403494,0.287808,0.308698
beautify,0.403494,0.287808,0.308698
throb,0.403494,0.287808,0.308698
physiology,0.403494,0.287808,0.308698
smoky,0.403494,0.287808,0.308698
ailments,0.403494,0.287808,0.308698
whispers,0.252173,0.454014,0.293813
diotima,0.330113,0.283243,0.386644
perticcler,0.403494,0.287808,0.308698


In [59]:
# I had to look up the order and figure it out manually
# I'm sure there's a better way!

scores.columns = ['Poe', 'Lovecraft', 'Shelley']

In [60]:
# Poe words
scores.sort_values('Poe', ascending=False).head(10)

Unnamed: 0,Poe,Lovecraft,Shelley
dupin,0.914451,0.041665,0.043884
marie,0.883941,0.056525,0.059534
madame,0.879644,0.042088,0.078267
jupiter,0.878962,0.05895,0.062088
balloon,0.876592,0.054143,0.069265
automaton,0.860367,0.061508,0.078125
diddle,0.857114,0.06959,0.073296
monsieur,0.856356,0.069959,0.073684
diddler,0.856338,0.069968,0.073694
chess,0.845633,0.07995,0.074417


In [61]:
# Lovecraft words
scores.sort_values('Lovecraft', ascending=False).head(10)

Unnamed: 0,Poe,Lovecraft,Shelley
gilman,0.058872,0.890681,0.050447
innsmouth,0.067213,0.875192,0.057595
arkham,0.075709,0.859416,0.064875
whateley,0.085385,0.841449,0.073166
later,0.128023,0.822405,0.049571
jermyn,0.111046,0.812859,0.076095
armitage,0.10196,0.810672,0.087368
despite,0.099299,0.809283,0.091418
folk,0.106462,0.802311,0.091227
aout,0.107131,0.801068,0.0918


In [62]:
# Shelley words
scores.sort_values('Shelley', ascending=False).head(10)

Unnamed: 0,Poe,Lovecraft,Shelley
raymond,0.019843,0.022816,0.95734
perdita,0.027295,0.022207,0.950498
adrian,0.032765,0.026657,0.940578
idris,0.042039,0.034202,0.923758
windsor,0.056153,0.045685,0.898162
elizabeth,0.054179,0.052539,0.893282
misery,0.08672,0.039718,0.873562
sister,0.074425,0.05894,0.866635
justine,0.07526,0.06123,0.86351
miserable,0.09944,0.043885,0.856676
