# Text Classification using scikit-learn and NLTK
Document/Text classification is one of the important and typical task in supervised machine learning (ML). Assigning categories to documents, which can be a web page, library book, media articles, gallery etc. has many applications like, e.g., spam filtering, email routing, sentiment analysis, etc. In this class, we will learn how we can do text classification using Python, scikit-learn and little bit of NLTK.

## 1.  Loading the data set
The data set will be using for this example is the famous “20 Newsgoup” data set. About the data from the original website:

*The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.*

This data set is in-built in scikit, so we don’t need to download it explicitly.

In [2]:
#Loading the data set - training data.
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)

*Note: Above, we are only loading the **training** data. We will load the test data separately later in the example.*

You can check the target names (categories) and some data files with the following commands.

In [3]:
# You can check the target names (categories) and some data files by following commands.
twenty_train.target_names #prints all the categories

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [4]:
print("\n".join(twenty_train.data[0].split("\n")[:3])) #prints first 3 lines of the first data file

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu


In [11]:
print(twenty_train.target[0],twenty_train.target_names[twenty_train.target[0]])

7 rec.autos


## 2. Extracting features from text files
Text files are actually series of words (ordered). In order to run machine learning algorithms we need to convert the text files into numerical feature vectors. We will be using **bag of words** model for our example. Briefly, we segment each text file into words (for English splitting by space), and count # of times each word occurs in each document and finally assign each word an integer id. **Each unique word in our dictionary will correspond to a feature (descriptive feature).**

Scikit-learn has a high level component which will create feature vectors for us ‘CountVectorizer’. More about it <a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html">here</a>.

In [9]:
# Extracting features from text files
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(11314, 130107)

Here, by doing ‘**count_vect.fit_transform(twenty_train.data)**’, we are learning the vocabulary dictionary and it returns a Document-Term matrix. [n_samples, n_features].

**TF**: Just counting the number of words in each document has 1 issue: it will give more weightage to longer documents than shorter documents. To avoid this, we can use frequency (**TF - Term Frequencies**) i.e. #count(word) / #Total words, in each document.

**TF-IDF**: Finally, we can even reduce the weightage of more common words like (the, is, an etc.) which occurs in all document. This is called as **TF-IDF, i.e., Term Frequency times inverse document frequency**.

We can achieve both using below line of code:

In [10]:
# TF-IDF
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(11314, 130107)

## 3. Running ML algorithms
There are various algorithms which can be used for text classification. We will start with the simplest one ‘<a href="http://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes">Naive Bayes (NB)</a>’ (**don’t think it is too Naive! 😃**)

You can easily build a NBclassifier in scikit using below 2 lines of code: (note - there are many variants of NB, but discussion about them is out of scope)

In [23]:
# Machine Learning
# Training Naive Bayes (NB) classifier on training data.
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)


This will train the NB classifier on the training data we provided.

**Building a pipeline**: We can write less code and do all of the above, by building a <a href="http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html">pipeline</a> as follows:

In [22]:
# The names ‘vect’ , ‘tfidf’ and ‘clf’ are arbitrary but will be used later.
# We will be using the 'text_clf' going forward.
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])

text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

In [27]:
# Performance of NB Classifier
import numpy as np
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)

0.7738980350504514

The accuracy we get is ~77.38%, which is not bad for start and for a naive classifier.

**Support Vector Machines (SVM)**: Let’s try using a different algorithm SVM, and see if we can get any better performance. More about it <a href="http://scikit-learn.org/stable/modules/svm.html">here</a>.

In [29]:
# Training Support Vector Machines - SVM and calculating its performance

from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, max_iter=5, random_state=42))])

text_clf_svm = text_clf_svm.fit(twenty_train.data, twenty_train.target)
predicted_svm = text_clf_svm.predict(twenty_test.data)
np.mean(predicted_svm == twenty_test.target)

0.82381837493361654

The accuracy we get is~82.38% (a little better!).

## 3. Grid Search

Almost all the classifiers will have various parameters which can be tuned to obtain optimal performance. Scikit gives an extremely useful tool ‘GridSearchCV’.

Here, we are creating a list of parameters for which we would like to do performance tuning. All the parameters name start with the classifier name (remember the arbitrary name we gave). E.g. `vect__ngram_range`; here, we are telling to use unigram and bigrams (i.e., single words or pairs of subsequent words) and choose the one which is optimal.

In [30]:
from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False), 'clf__alpha': (1e-2, 1e-3)}

Next, we create an instance of the grid search by passing the classifier, `parameters` and `n_jobs=-1` which tells to use multiple cores from user machine.

This might take few minutes to run depending on the machine configuration.

In [32]:
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)

Lastly, to see the best mean score and the params, run the following code:

In [37]:
print(gs_clf.best_score_)
print(gs_clf.best_params_)

0.906752695775
{'clf__alpha': 0.01, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}


Th accuracy has now increased to ~90.6% for the NB classifier (not so naive anymore!) and the corresponding parametrs are `{‘clf__alpha’: 0.01, ‘tfidf__use_idf’: True, ‘vect__ngram_range’: (1, 2)`}.

Similarly, we get improved accuracy ~89.79% for SVM classifier with below code. **Note: You can further optimize the SVM classifier by tuning other parameters. This is left up to you to explore more.**

In [38]:
# Similarly doing grid search for SVM
from sklearn.model_selection import GridSearchCV
parameters_svm = {'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False),'clf-svm__alpha': (1e-2, 1e-3)}

gs_clf_svm = GridSearchCV(text_clf_svm, parameters_svm, n_jobs=-1)
gs_clf_svm = gs_clf_svm.fit(twenty_train.data, twenty_train.target)

print(gs_clf_svm.best_score_)
print(gs_clf_svm.best_params_)

0.89791408874
{'clf-svm__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}


## Exercise 1. Removing stop words
In most of the text classification problems, stop words (the, then, etc.) are indeed not useful. Let’s see if removing them increases the accuracy. Update the code of the NB classifier in which we dfon't tune the parameters, by creating a CountVectorizer object that considers the English stop words made available by NLTK. Then, run the remaining steps like before. Does accuracy increase?

Try the same for SVM and also while doing grid search.

## Exercise 2. FitPrior parameter in MultinomialNB
Set to false the `fit_prior` parameter for MultinomialNB. This will make sure that a uniform prior will be used.
Do this both without and with the grid search and print the accuracy.