## Text Classification with NLTK and Scikit-learn
##### By Ruben Seoane, all credit to nlpforhackers.io
Based on: https://nlpforhackers.io/text-classification/

Text classification is the task of assigning labels to text, which classify it into categories (legal vs medical, etc)

We will use a news corpus available in scikit-learn.

In [2]:
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all')

print(len(news.data))

18846


In [3]:
print(len(news.target_names))
print(news.target_names)

20
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [5]:
for text, num_label in zip(news.data[:10], news.target[:10]):
    print ('[%s]:\t\t "%s ..."' % (news.target_names[num_label], text[:100].split('\n')[0]))

[rec.sport.hockey]:		 "From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu> ..."
[comp.sys.ibm.pc.hardware]:		 "From: mblawson@midway.ecn.uoknor.edu (Matthew B Lawson) ..."
[talk.politics.mideast]:		 "From: hilmi-er@dsv.su.se (Hilmi Eren) ..."
[comp.sys.ibm.pc.hardware]:		 "From: guyd@austin.ibm.com (Guy Dawson) ..."
[comp.sys.mac.hardware]:		 "From: Alexander Samuel McDiarmid <am2o+@andrew.cmu.edu> ..."
[sci.electronics]:		 "From: tell@cs.unc.edu (Stephen Tell) ..."
[comp.sys.mac.hardware]:		 "From: lpa8921@tamuts.tamu.edu (Louis Paul Adams) ..."
[rec.sport.hockey]:		 "From: dchhabra@stpl.ists.ca (Deepak Chhabra) ..."
[rec.sport.hockey]:		 "From: dchhabra@stpl.ists.ca (Deepak Chhabra) ..."
[talk.religion.misc]:		 "From: arromdee@jyusenkyou.cs.jhu.edu (Ken Arromdee) ..."


This says that we will have to classify 18846 documents in 20 classes.

Let's build a simple model for training and evaluating a classifier against a test set:

In [12]:
from sklearn.model_selection import train_test_split as tts

def train(classifier, X, y):
    X_train, X_test, y_train, y_test = tts(X, y, test_size=0.2, random_state=33)
    
    classifier.fit(X_train, y_train)
    print('Accuracy: %s' % classifier.score(X_test, y_test))
    return classifier

We will experiment with the **_Multinomial Naive Bayes_** classifier. 1st we need to transform the text into a feature vector as follows:

In [11]:
from sklearn.naive_bayes import MultinomialNB as MNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF

trial1 = Pipeline([
    ('vectorizer', TFIDF()),
    ('classifier', MNB()),
])

train(trial1, news.data, news.target)

Accuracy: 0.8527851458885941


Pipeline(memory=None,
     steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=Tr...      vocabulary=None)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

Let's improve tha accuracy, first thing will be to ignore irrelevant words as **stopwords**:

In [14]:
from nltk.corpus import stopwords

trial2 = Pipeline([
    ('vectorizer', TFIDF(stop_words=stopwords.words('english'))),
    ('classifier', MNB()),
])

train(trial2, news.data, news.target)

Accuracy: 0.8830238726790451


Pipeline(memory=None,
     steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=Tr...      vocabulary=None)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

We got a 3% increase in accuracy, let's try now by adjusting the **alpha** parameter of the Naive Bayes classifier, we'll set it to a lower value:

In [15]:
trial3 = Pipeline([
    ('vectorizer', TFIDF(stop_words=stopwords.words('english'))),
    ('classifier', MNB(alpha=0.05)),
])

train(trial3, news.data, news.target)

Accuracy: 0.9103448275862069


Pipeline(memory=None,
     steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=Tr...     vocabulary=None)), ('classifier', MultinomialNB(alpha=0.05, class_prior=None, fit_prior=True))])

Let's now ignore words that appear fewer than 5 times in the document collection:

In [17]:
trial4 = Pipeline([
    ('vectorizer', TFIDF(stop_words=stopwords.words('english'), min_df=5)),
    ('classifier', MNB(alpha=0.05)),
])

train(trial4, news.data, news.target)

Accuracy: 0.906631299734748


Pipeline(memory=None,
     steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=5,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=Tr...     vocabulary=None)), ('classifier', MultinomialNB(alpha=0.05, class_prior=None, fit_prior=True))])

It doesn't improve our model, so let's use the NLTK tokenizer to split the text into wors and then perform stemming.

In [19]:
import string
from nltk.stem import PorterStemmer as PS
from nltk import word_tokenize

def stemming_tokenizer(text):
    stemmer = PS()
    return [stemmer.stem(w) for w in word_tokenize(text)]

trial5 = Pipeline([
    ('vectorizer', TFIDF(tokenizer=stemming_tokenizer,
                         stop_words=stopwords.words('english') + list(string.punctuation))),
    ('classifier', MNB(alpha=0.05)),
                         
])

train(trial5, news.data, news.target)

Accuracy: 0.9140583554376658


Pipeline(memory=None,
     steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=Tr...rue, vocabulary=None)), ('classifier', MultinomialNB(alpha=0.05, class_prior=None, fit_prior=True))])

This small improvement resulted in a great decrease in speed.
That's why building a model requires trial an error, and in many cases accuracy will mean a loss in speed, so we have to weight the amount of data we need to process, and whether our application is time dependant for execution speed or needs a higher accuracy