# DS-SF-25 | Codealong and Lab 13 | Natural Language Processing and Text Classification

# Codealong - Text Processing with `sklearn`

In [None]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

from sklearn import feature_extraction, ensemble, cross_validation, metrics

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

The data is about sentiments on Amazon reviews.

In [None]:
reviews = []
sentiments = []

with open(os.path.join('..', 'datasets', 'amazon-reviews.txt')) as file:
    for line in file.readlines():
        line = line.strip('\n')
        review, sentiment = line.split('\t')
        sentiment = np.nan if sentiment == '' else int(sentiment)

        reviews.append(review)
        sentiments.append(sentiment)

df = pd.DataFrame({'review': reviews, 'sentiment': sentiments})

In [None]:
df

In [None]:
df.dropna(inplace = True) # Let's drop NaNs

In [None]:
df

In [None]:
# TODO

In [None]:
X

In [None]:
y

## Train/test sets

In [None]:
train_X, test_X, train_y, test_y = cross_validation.train_test_split(X, y, train_size = .6, random_state = 0)

In [None]:
train_X

## `CountVectorizer`

`CountVectorizer` converts a collection of text into a matrix of features.  Each row will be a sample (an article or piece of text) and each column will be a text feature (usually a count or binary feature per word).

Vectorizers are like other models in `sklearn`:
- We create a vectorizer object with the parameters of our feature space
- We fit a vectorizer to learn the vocabulary
- We transform a set of text into that feature space

(And check http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html as needed)

In [None]:
# TODO

In [None]:
vectorizer

Note: Stopwords are non-content words.  (e.g. 'to', 'the', and 'it'); they aren’t helpful for prediction, so we remove them.

In [None]:
# TODO

The bag-of-words:

In [None]:
# TODO

We now use `tranform` to generate the sample X word matrix; one column per feature (here, a word)

In [None]:
# TODO

In [None]:
train_X

While dense matrices store every entry in the matrix, sparse matrices only store the nonzero entries.  Sparse matrices don't have a lot of extra features, and some algorithms may not work for them so you use them when you need to work with matrices that would be too big for the computer to handle them, but they are mostly zero, so they compress easily.  You can convert from sparse matrices to dense matrices with `.todense()`

In [None]:
train_X.todense()

## Random Forest

We can now build a random forest model to predict "sentiment".

In [None]:
model = ensemble.RandomForestClassifier(n_estimators = 5)

cross_validation.cross_val_score(model, train_X, train_y, scoring = 'roc_auc')

In [None]:
model.fit(train_X, train_y)

In [None]:
model.score(train_X, train_y)

In [None]:
def roc_auc(model, train_X, train_y, title):
    train_y_hat = model.predict(train_X)

    fpr, tpr, thresholds = metrics.roc_curve(train_y, train_y_hat)

    plt.figure()
    plt.plot(fpr, tpr, label = 'ROC curve (area = %0.2f)' % metrics.auc(fpr, tpr))
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([.0, 1.])
    plt.ylim([.0, 1.1])
    plt.xlabel('FPR/Fall-out')
    plt.ylabel('TPR/Sensitivity')
    plt.title(title)
    plt.legend(loc = 'lower right')
    plt.show()

In [None]:
roc_auc(model, train_X, train_y, 'Sentiment ROC/AUC on Training Set')

In [None]:
model.score(test_X, test_y)

In [None]:
roc_auc(model, test_X, test_y, 'Sentiment ROC/AUC on Testing Set')

# Lab - TF-IDF

Directions: Redo the analysis above with `TfidfVectorizer` instead of `CountVectorizer`.  What do you get?

(Check http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html as needed)

In [None]:
# TODO