# NLP Text Classification Lab
Let's do some NLP feature engineering and model building on newsgroups posts and predict the category of the post.

## Objectives

You will learn

- Bag of words models
- CountVectorizer and TfidfVectorizer
- Strategies to generate more features

## First let's read in the data.

In [8]:
import pandas as pd

In [2]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

## Understanding the data
We have `target_names` which contains the human-readable name of the numerical category of each post

In [17]:
newsgroups_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

## Build the DataFrame
Let's construct a DataFrame with the input data and the target.

In [12]:
df = pd.DataFrame({"text": newsgroups_train.data, "y": newsgroups_train.target})

In [18]:
df.head()

Unnamed: 0,text,y
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14


Let's instantiate the DataFrame.

# Exercise
How many rows are there in each category?

In [19]:
# Solution
category_counts = df.y.value_counts()
category_counts

10    600
15    599
8     598
9     597
11    595
13    594
7     594
14    593
5     593
12    591
2     591
3     590
6     585
1     584
4     578
17    564
16    546
0     480
18    465
19    377
Name: y, dtype: int64

## Bag of Words Modeling
Please read [this article](https://medium.com/greyatom/an-introduction-to-bag-of-words-in-nlp-ac967d43b428) that explains the following concepts:

- Bag of Words
- Term Frequency
- Inverse Document Frequency

## Let's CountVectorize!
Let's take a look at [the documentation in sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) for CountVectorizer

In [22]:
from sklearn.feature_extraction.text import CountVectorizer

In [40]:
corpus = [
    "The brown fox ran quickly",
    "The green fox ran slowly",
    "Green and brown are my favorite colors",
    "Fox Mulder runs quickly"
]

In [43]:
vect = CountVectorizer()
X = vect.fit_transform(corpus)
X.toarray()

array([[0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1],
       [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1],
       [1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0]])

## Exercise
What are the columns? Find out the "features" of this matrix.

In [45]:
# Solution
print(vect.get_feature_names())

['and', 'are', 'brown', 'colors', 'favorite', 'fox', 'green', 'mulder', 'my', 'quickly', 'ran', 'runs', 'slowly', 'the']


## Exercise
Remove "stop words" like "the", etc. This reduces the "noise" in our data.

In [48]:
# Solution
vect = CountVectorizer(stop_words="english")
X = vect.fit_transform(corpus)
X.toarray()

array([[1, 0, 0, 1, 0, 0, 1, 1, 0, 0],
       [0, 0, 0, 1, 1, 0, 0, 1, 0, 1],
       [1, 1, 1, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 1, 1, 0, 1, 0]])

In [47]:
print(vect.get_feature_names())

['brown', 'colors', 'favorite', 'fox', 'green', 'mulder', 'quickly', 'ran', 'runs', 'slowly']


## NewsGroup Exercise 1
- Count-Vectorize just the `text` field to form a bag-of-words feature set for the HuffPost data.
- Use unigrams and bigrams by setting the `ngram_range` when instantiating `CountVectorizer`.
- Let the feature set be called `X`

In [20]:
df['text']

0        From: lerxst@wam.umd.edu (where's my thing)\nS...
1        From: guykuo@carson.u.washington.edu (Guy Kuo)...
2        From: twillis@ec.ecn.purdue.edu (Thomas E Will...
3        From: jgreen@amber (Joe Green)\nSubject: Re: W...
4        From: jcm@head-cfa.harvard.edu (Jonathan McDow...
                               ...                        
11309    From: jim.zisfein@factory.com (Jim Zisfein) \n...
11310    From: ebodin@pearl.tufts.edu\nSubject: Screen ...
11311    From: westes@netcom.com (Will Estes)\nSubject:...
11312    From: steve@hcrlgw (Steven Collins)\nSubject: ...
11313    From: gunning@cco.caltech.edu (Kevin J. Gunnin...
Name: text, Length: 11314, dtype: object

In [23]:
## Solution
vect = CountVectorizer(stop_words='english', ngram_range=(1,2))
X = vect.fit_transform(df['text'])

## Newsgroups Exercise 2
Take the target set and call it y.

In [25]:
## Solution
y = df['y']

## Newsgroups Exercise 3
Train a model on the data, say `RandomForest` or `SVC`. You can use `train_test_split`.

In [26]:
## Solution
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)

In [28]:
rf = RandomForestClassifier(n_estimators=20)

In [29]:
rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=20,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [30]:
rf.score(X_test, y_test)

0.7873343151693667

Not bad! Let's make it better.

## Newsgroups Exercise 4
Now let's see how TFIDF does.

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [32]:
## Solution
vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
X = vect.fit_transform(df['text'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)
rf = RandomForestClassifier(n_estimators=20)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7867452135493372

Ok, around the same!

## Newsgroups Exercise 5
- Low let's remove some noisy things from our data like headers etc and re-try and see if that feature engineering / data cleanup helps
- Also try `SVC` and `RandomForest` and other models and try to tune. (remember that for SVC, you should scale your data)

Look [here](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) for a hint.

In [33]:
## Solution
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
df = pd.DataFrame({"text": newsgroups_train.data, "y": newsgroups_train.target})

vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
X = vect.fit_transform(df['text'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)
rf = RandomForestClassifier(n_estimators=20)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.5905743740795287

## Brainstorming Exercise
Make a list of all of the things you could do to improve this model. This is always a good practice for creative problem solving when you hit a temporary roadblock.