<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Natural Language Processing Lab

_Authors: Dave Yerrington (SF)_

---

In this lab we will further explore sklearn and NLTK's capabilities for processing text. We will use the 20 Newsgroup dataset, which is provided by sklearn.

In [1]:
# Standard Data Science Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Getting that SKLearn Dataset
from sklearn.datasets import fetch_20newsgroups

### 1. Use the `fetch_20newsgroups` function to download a training and testing set.

Look up the function documentation for how to grab the data.

You should pull these categories:
- `alt.atheism`
- `talk.religion.misc`
- `comp.graphics`
- `sci.space`

Also remove the headers, footers, and quotes using the `remove` keyword argument of the function.

In [3]:
# A:
newsgroups_train = fetch_20newsgroups(subset='train')
from pprint import pprint
pprint(list(newsgroups_train.target_names))

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


In [10]:
cats = ['alt.atheism','talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats, remove=('headers', 'footers', 'quotes'))
list(newsgroups_train.target_names)

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']

In [11]:
newsgroups_train.filenames.shape

(2034,)

In [12]:
newsgroups_train.target[:10]

array([1, 3, 2, 0, 2, 0, 2, 1, 2, 1], dtype=int64)

### 2. Data inspection

We have downloaded a few newsgroup categories and removed headers, footers and quotes.

Because this is an sklearn dataset, it comes with pre-split train and test sets (note we were able to call 'train' and 'test' in subset).

Let's inspect them.

1. What data type is `data_train`
- Is it like a list? Or like a Dictionary? or what?
- How many data points does it contain?
- Inspect the first data point, what does it look like?

In [32]:
# A: looks like a tuple, 2034 data points
print(type(newsgroups_train))
print(type(newsgroups_train['data']))
print(type(newsgroups_train['target']))

<class 'sklearn.utils.Bunch'>
<class 'list'>
<class 'numpy.ndarray'>


In [19]:
newsgroups_train['data'][0]

"Hi,\n\nI've noticed that if you only save a model (with all your mapping planes\npositioned carefully) to a .3DS file that when you reload it after restarting\n3DS, they are given a default position and orientation.  But if you save\nto a .PRJ file their positions/orientation are preserved.  Does anyone\nknow why this information is not stored in the .3DS file?  Nothing is\nexplicitly said in the manual about saving texture rules in the .PRJ file. \nI'd like to be able to read the texture rule information, does anyone have \nthe format for the .PRJ file?\n\nIs the .CEL file format available from somewhere?\n\nRych"

### 3. Bag of Words model

Let's train a model using a simple count vectorizer.

1. Initialize a standard CountVectorizer and fit the training data
- how big is the feature dictionary?
- repeat eliminating english stop words
- is the dictionary smaller?
- transform the training data using the trained vectorizer
- evaluate the performance of a Logistic Regression on the features extracted by the CountVectorizer
    - you will have to transform the test_set too. Be careful to use the trained vectorizer, without re-fitting it

**BONUS:**
- try a couple modifications:
    - restrict the max_features
    - change max_df and min_df

In [49]:
# A: 26879 features
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
vect = CountVectorizer()
newsgroups_train_dtm = vect.fit_transform(newsgroups_train.data)
newsgroups_train_dtm.shape

(2034, 26879)

In [35]:
# english stopping elimination did not delete too many features 
vect1 = CountVectorizer(stop_words='english')
newsgroups_train_dtm1 = vect1.fit_transform(newsgroups_train.data)
newsgroups_test = fetch_20newsgroups(subset='test', categories=cats, remove=('headers', 'footers', 'quotes'))
newsgroups_test_dtm1 = vect1.transform(newsgroups_test.data)
print(newsgroups_train_dtm1.shape)
print(newsgroups_test_dtm1.shape)

(2034, 26576)
(1353, 26576)


In [37]:
X_train = newsgroups_train_dtm1
y_train = newsgroups_train.target
X_test  = newsgroups_test_dtm1
y_test  = newsgroups_test.target

In [39]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)
y_pred_class = logreg.predict(X_test)
print((metrics.accuracy_score(y_test, y_pred_class)))

0.7420546932742055


In [45]:
vect2 = CountVectorizer(stop_words='english', min_df=2, max_df=10, max_features=10000)
newsgroups_train_dtm2 = vect2.fit_transform(newsgroups_train.data)
newsgroups_test_dtm2 = vect2.transform(newsgroups_test.data)
print(newsgroups_train_dtm2.shape)
print(newsgroups_test_dtm2.shape)

(2034, 9586)
(1353, 9586)


In [46]:
X_train = newsgroups_train_dtm2
X_test  = newsgroups_test_dtm2
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)
y_pred_class = logreg.predict(X_test)
print((metrics.accuracy_score(y_test, y_pred_class)))

0.6895787139689579


### 4. Hashing and TF-IDF

Let's see if Hashing or TF-IDF improves the accuracy.

1. Initialize a HashingVectorizer and repeat the test with no restriction on the number of features
- does the score improve with respect to the count vectorizer?
- print out the number of features for this model
- Initialize a TF-IDF Vectorizer and repeat the analysis above
- print out the number of features for this model

**BONUS:**
- Change the parameters of either (or both!) models to improve your score

In [51]:
# A:
vect3 = HashingVectorizer(stop_words='english')
newsgroups_train_dtm3 = vect3.fit_transform(newsgroups_train.data)
newsgroups_test_dtm3 = vect3.transform(newsgroups_test.data)
print(newsgroups_train_dtm3.shape)
print(newsgroups_test_dtm3.shape)

(2034, 1048576)
(1353, 1048576)


In [52]:
X_train = newsgroups_train_dtm3
X_test  = newsgroups_test_dtm3
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)
y_pred_class = logreg.predict(X_test)
print((metrics.accuracy_score(y_test, y_pred_class)))

0.7435328898743533


In [73]:
vect4 = TfidfVectorizer(stop_words='english', min_df=2, max_df=60, max_features=20000)
newsgroups_train_dtm4 = vect4.fit_transform(newsgroups_train.data)
newsgroups_test_dtm4 = vect4.transform(newsgroups_test.data)
print(newsgroups_train_dtm4.shape)
print(newsgroups_test_dtm4.shape)

(2034, 11882)
(1353, 11882)


In [74]:
X_train = newsgroups_train_dtm4
X_test  = newsgroups_test_dtm4
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)
y_pred_class = logreg.predict(X_test)
print((metrics.accuracy_score(y_test, y_pred_class)))

0.7634885439763488


Hashing got more than 1 million features and did not improve score much at all. But Tfidf has the same number of features and sscore imrpoved to 76% 