# Natural Language Processing Lab

In this lab we will further explore Scikit's and NLTK's capabilities to process text. We will use the 20 Newsgroup dataset, which is provided by Scikit-Learn.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from sklearn.datasets import fetch_20newsgroups

In [3]:
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

## 1. Data inspection

We have downloaded a few newsgroup categories and removed headers, footers and quotes.

Let's inspect them.

1. What data taype is `data_train`
> sklearn.datasets.base.Bunch
- Is it like a list? Or like a Dictionary? or what?
> Dict
- How many data points does it contain?
- Inspect the first data point, what does it look like?
> A blurb of text

In [8]:
print type(data_train)
print len(data_train.data)
print len(data_test.data)
data_train.data[0]

<class 'sklearn.datasets.base.Bunch'>
2034
1353


u"Hi,\n\nI've noticed that if you only save a model (with all your mapping planes\npositioned carefully) to a .3DS file that when you reload it after restarting\n3DS, they are given a default position and orientation.  But if you save\nto a .PRJ file their positions/orientation are preserved.  Does anyone\nknow why this information is not stored in the .3DS file?  Nothing is\nexplicitly said in the manual about saving texture rules in the .PRJ file. \nI'd like to be able to read the texture rule information, does anyone have \nthe format for the .PRJ file?\n\nIs the .CEL file format available from somewhere?\n\nRych"

## 2. Bag of Words model

Let's train a model using a simple count vectorizer

1. Initialize a standard CountVectorizer and fit the training data
- how big is the feature dictionary
- repeat eliminating english stop words
- is the dictionary smaller?
- transform the training data using the trained vectorizer
- what are the 20 words that are most common in the whole corpus?
- what are the 20 most common words in each of the 4 classes?
- evaluate the performance of a Lotistic Regression on the features extracted by the CountVectorizer
    - you will have to transform the test_set too. Be carefule to use the trained vectorizer, without re-fitting it
- try the following 3 modification:
    - restrict the max_features
    - change max_df and min_df
    - use a fixed vocabulary of size 80 combining the 20 most common words per group found earlier
- for each of the above print a confusion matrix and investigate what gets mixed
> Anwer: not surprisingly if we reduce the feature space we lose accuracy
- print out the number of features for each model

In [9]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer, TfidfTransformer

In [14]:
# Initialize CountVectorizer, fit and get features
v = CountVectorizer()
v.fit(data_train.data)
len(v.get_feature_names())

26879

In [13]:
# With English Stop words
v = CountVectorizer(stop_words='english')
v.fit(data_train.data)
len(v.get_feature_names())

26576

In [15]:
v_train = v.transform(data_train.data)

In [19]:
# Top 20 words
top_20 = CountVectorizer(stop_words='english', max_features = 20)
top_20.fit_transform(data_train.data)
top_20.get_feature_names()
top_20.vocabulary_

{u'data': 0,
 u'does': 1,
 u'don': 2,
 u'edu': 3,
 u'god': 4,
 u'good': 5,
 u'graphics': 6,
 u'image': 7,
 u'jesus': 8,
 u'just': 9,
 u'know': 10,
 u'like': 11,
 u'nasa': 12,
 u'people': 13,
 u'say': 14,
 u'space': 15,
 u'think': 16,
 u'time': 17,
 u'use': 18,
 u'way': 19}

In [28]:
df = pd.DataFrame()
df['class'] = data_train.target
df['data'] = data_train.data
df.head()

Unnamed: 0,class,data
0,1,"Hi,\n\nI've noticed that if you only save a mo..."
1,3,"\n\nSeems to be, barring evidence to the contr..."
2,2,\n >In article <1993Apr19.020359.26996@sq.sq.c...
3,0,I have a request for those who would like to s...
4,2,AW&ST had a brief blurb on a Manned Lunar Exp...


In [35]:
# 20 most common words in each category 
for i in range(0,4):
    top_20.fit_transform(df[df['class'] == i]['data'])
    print i,'\n', top_20.get_feature_names()
    print

0 
[u'argument', u'atheism', u'atheists', u'believe', u'bible', u'does', u'don', u'god', u'jesus', u'just', u'know', u'like', u'people', u'religion', u'said', u'say', u'think', u'time', u'true', u'way']

1 
[u'available', u'color', u'data', u'does', u'edu', u'file', u'files', u'format', u'ftp', u'gif', u'graphics', u'image', u'images', u'jpeg', u'know', u'like', u'program', u'pub', u'software', u'use']

2 
[u'data', u'don', u'earth', u'just', u'launch', u'like', u'lunar', u'mission', u'moon', u'nasa', u'new', u'orbit', u'people', u'program', u'satellite', u'shuttle', u'space', u'time', u'use', u'year']

3 
[u'believe', u'bible', u'christian', u'did', u'does', u'don', u'god', u'good', u'jesus', u'just', u'know', u'life', u'like', u'people', u'point', u'said', u'say', u'think', u'time', u'way']



In [99]:
# Logistic Regression for features from a vectorizer. 
# Pass a vectorizer to bellow function
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

def log_reg(vect):
    X_train = vect.fit_transform(data_train.data)
    y_train = data_train.target
    X_test = vect.transform(data_test.data)
    y_test = data_test.target
    
    lr = LogisticRegression()
    lr.fit(X_train,y_train)
    
    y_pred = lr.predict(X_test)
    
    print 'Accuracy Score:', lr.score(X_test, y_test)
    print 'Confusion Matrix: \n', confusion_matrix(y_test,y_pred)

v = CountVectorizer(stop_words='english')
log_reg(v)
print 'Number of Features:', len(v.get_feature_names())

 Accuracy Score: 0.745011086475
Confusion Matrix: 
[[187  16  46  70]
 [ 13 345  28   3]
 [ 22  23 333  16]
 [ 67  14  27 143]]
Number of Features: 26576


In [100]:
v = CountVectorizer(stop_words='english',max_features=20)
log_reg(v)
print 'Number of Features:', len(v.get_feature_names())

Accuracy Score: 0.49519586105
Confusion Matrix: 
[[159  98  30  32]
 [ 30 296  55   8]
 [ 47 147 186  14]
 [112  91  19  29]]
Number of Features: 20


In [101]:
v = CountVectorizer(stop_words='english', max_df=0.2, min_df= 2)
log_reg(v)
print 'Number of Features:', len(v.get_feature_names())

Accuracy Score: 0.747967479675
Confusion Matrix: 
[[187  16  47  69]
 [  8 347  30   4]
 [ 20  27 335  12]
 [ 67  17  24 143]]
Number of Features: 12144


In [102]:
# top 20 words for each category,  set will remove duplicates
words = []
top_20 = CountVectorizer(stop_words='english', max_features = 20)
for i in range(0,4):
    top_20.fit_transform(df[df['class'] == i]['data'])
    words += (top_20.get_feature_names())
    
v = CountVectorizer(stop_words='english',vocabulary=set(words))
log_reg(v)
print 'Number of Features:', len(v.get_feature_names())

Accuracy Score: 0.589061345159
Confusion Matrix: 
[[160  67  33  59]
 [ 25 314  44   6]
 [ 37  88 247  22]
 [ 90  70  15  76]]
Number of Features: 54


## 3. Hashing and TF-IDF

Let's see if Hashing or TF-IDF improves the accuracy.

1. Initialize a HashingVectorizer and repeat the test with no restriction on the number of features
- does the score improve with respect to the count vectorizer?
    - can you change any of the default parameters to improve it?
- print out the number of features for this model
- Initialize a TF-IDF Vectorizer and repeat the analysis above
- can you improve on your best score above?
    - can you change any of the default parameters to improve it?
- print out the number of features for this model

In [105]:
h = HashingVectorizer(stop_words='english')
log_reg(h)
print 'No. of features:', h.n_features

Accuracy Score: 0.736881005174
Confusion Matrix: 
[[197  15  65  42]
 [  9 347  32   1]
 [ 21  23 350   0]
 [ 86  18  44 103]]
No. of features: 1048576


In [112]:
h = HashingVectorizer(stop_words='english', n_features=50000)
log_reg(h)
print 'No. of features:', h.n_features

Accuracy Score: 0.730229120473
Confusion Matrix: 
[[187  18  64  50]
 [  8 347  33   1]
 [ 19  22 352   1]
 [ 85  19  45 102]]
No. of features: 50000


In [116]:
# TFIDF 
T = TfidfVectorizer()
log_reg(T)
print 'No. of features:', len(T.get_feature_names())

Accuracy Score: 0.733185513673
Confusion Matrix: 
[[199  26  55  39]
 [ 11 349  27   2]
 [ 17  39 338   0]
 [ 86  25  34 106]]
No. of features: 26879


In [132]:
# TFIDF 
T = TfidfVectorizer(stop_words='english', min_df=2, token_pattern='[a-zA-Z]{3,50}')
log_reg(T)
print 'No. of features:', len(T.get_feature_names())

Accuracy Score: 0.749445676275
Confusion Matrix: 
[[196  16  58  49]
 [  7 351  30   1]
 [ 19  21 354   0]
 [ 78  16  44 113]]
No. of features: 11190


## 4. Classifier comparison

Of all the vectorizers tested above, choose one that has a reasonable performance with a manageable number of features and compare the performance of these models:

- KNN
- Logistic Regression
- Decision Trees
- Support Vector Machine
- Random Forest
- Extra Trees

In order to speed up the calculation it's better to vectorize the data only once and then compare the models.

In [141]:
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier

models = [KNeighborsClassifier(), LogisticRegression(), DecisionTreeClassifier(), 
                SVC(), RandomForestClassifier(), ExtraTreesClassifier()]

def model_score(model, v):
    X_train = v.fit_transform(data_train.data)
    X_test = v.transform(data_test.data)
    y_train = data_train.target
    y_test = data_test.target
    
    model.fit(X_train, y_train)
    y_predictions = model.predict(X_test)
    
    print 'score:', model.score(X_test, y_test)
    print 'confusion matrix:','\n', confusion_matrix(y_test, y_predictions)
    print 'No. of features:', len(T.get_feature_names())
    print 

T = TfidfVectorizer(stop_words='english', min_df=2, token_pattern='[a-zA-Z]{3,50}')

for i in models:
    print i
    model_score(i,T)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
score: 0.288987435329
confusion matrix: 
[[108 100  57  54]
 [132 140  43  74]
 [113 124  92  65]
 [ 77  80  43  51]]
No. of features: 11190

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
score: 0.749445676275
confusion matrix: 
[[196  16  58  49]
 [  7 351  30   1]
 [ 19  21 354   0]
 [ 78  16  44 113]]
No. of features: 11190

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_stat

## Bonus: Other classifiers

Adapt the code from [this example](http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#example-text-document-classification-20newsgroups-py) to compare across all the classifiers suggested and to display the final plot

## Bonus: NLTK

NLTK is a vast library. Can you find some interesting bits to share with classmates?
Start here: http://www.nltk.org/