# Natural language Prorcessing (NLP)


## Movie Reviews

Following this tutorial on scikit-learn.org: https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

Adapted to work with off-line movie review corpus.
Also, check out documentation on dataset loading: http://scikit-learn.org/stable/datasets/

In [1]:
import sklearn
import pandas as pd
import numpy as np

In [2]:
from sklearn.datasets import load_files

In [3]:
# loading all movie files.
movie = load_files('./movie_reviews/', shuffle=True)

In [4]:
len(movie.data)

2000

In [5]:
# Target names ("classes") are automatically generated from subfolder names.
movie.target_names

['neg', 'pos']

In [6]:
# First file seems to be about a Schwarzenegger movie. 
movie.data[0][:500]

b"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \nin this so cal"

In [7]:
# First file is in "neg" folder
movie.filenames[0]

'./movie_reviews/neg/cv405_21868.txt'

In [8]:
# First file is a negative review and is mapped to 0 index 'neg' in target_names
movie.target[0]

0

## Testing out CountVectorizer & TF-IDF

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

In [9]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [10]:
# Three tiny "documents"
docs = ['a HPC is a HPC is a HPC is a HPC.',
        'Oh, what a fine practical this is.',
        "A practical ain't over till it's truly over."]

In [11]:
# Initialize a CountVectorizer to its 
#    default one (which ignores punctuation and stopwords). 
# Minimum document frequency set to 1. 
fooVzer = CountVectorizer(min_df=1)

In [12]:
# (1) fit: adapts fooVzer to the supplied text data (rounds up top words into vector space) 
# (2) transform: creates and returns a count-vectorized output of docs
docs_counts = fooVzer.fit_transform(docs)

# fooVzer now contains vocab dictionary which maps unique words to indexes
fooVzer.vocabulary_

{'hpc': 2,
 'is': 3,
 'oh': 5,
 'what': 11,
 'fine': 1,
 'practical': 7,
 'this': 8,
 'ain': 0,
 'over': 6,
 'till': 9,
 'it': 4,
 'truly': 10}

In [13]:
# docs_counts has a dimension of 3 (document count) by 12 (# of unique words)
docs_counts.shape

(3, 12)

In [14]:
# this vector is small enough to view in a full, non-sparse form! 
docs_counts.toarray()

array([[0, 0, 4, 3, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1],
       [1, 0, 0, 0, 1, 0, 2, 1, 0, 1, 1, 0]], dtype=int64)

### Lets Pretty print that

In [15]:
pd.DataFrame({i:docs_counts.toarray()[:,j] for i,j in fooVzer.vocabulary_.items()})

Unnamed: 0,hpc,is,oh,what,fine,practical,this,ain,over,till,it,truly
0,4,3,0,0,0,0,0,0,0,0,0,0
1,0,1,1,1,1,1,1,0,0,0,0,0
2,0,0,0,0,0,1,0,1,2,1,1,1


### Convert raw frequency counts into TF-IDF (Term Frequency -- Inverse Document Frequency)

https://en.wikipedia.org/wiki/Tf%E2%80%93idf


Term Frequency
$$ \mathrm {tf} (t,d)=0.5+0.5\cdot {\frac {f_{t,d}}{\max\{f_{t',d}:t'\in d\}}}$$

We denote the raw count by $f_{t,d}$

Inverse Document Frequency
$$\mathrm{idf}(t, D) =  \log \frac{N}{|\{d \in D: t \in d\}|}$$

For document $d$ in corpus $D$

TF-IDF
$$\mathrm {tfidf} (t,d,D)=\mathrm {tf} (t,d)\cdot \mathrm {idf} (t,D)$$

In [16]:
# Convert raw frequency counts into TF-IDF (Term Frequency -- Inverse Document Frequency) values
from sklearn.feature_extraction.text import TfidfTransformer
fooTfmer = TfidfTransformer()

# Again, fit and transform
docs_tfidf = fooTfmer.fit_transform(docs_counts)

In [17]:
# TF-IDF values
docs_tfidf.toarray()

array([[0.        , 0.        , 0.86862987, 0.49546155, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        ],
       [0.        , 0.44036207, 0.        , 0.3349067 , 0.        ,
        0.44036207, 0.        , 0.3349067 , 0.44036207, 0.        ,
        0.        , 0.44036207],
       [0.34142622, 0.        , 0.        , 0.        , 0.34142622,
        0.        , 0.68285244, 0.25966344, 0.        , 0.34142622,
        0.34142622, 0.        ]])

### And in pretty print

* raw counts have been normalized against document length, 
* terms that are found across many docs are weighted down ('a' vs. 'hpc')

In [18]:
pd.DataFrame({i:docs_tfidf.toarray()[:,j] for i,j in fooVzer.vocabulary_.items()})

Unnamed: 0,hpc,is,oh,what,fine,practical,this,ain,over,till,it,truly
0,0.86863,0.495462,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.334907,0.440362,0.440362,0.440362,0.334907,0.440362,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.259663,0.0,0.341426,0.682852,0.341426,0.341426,0.341426


In [19]:
# A list of new documents
newdocs = ["I have a hpc and a pc.", 
           "What a beautiful practical."]

# This time, no fitting needed: transform the new docs into count-vectorized form
# Unseen words ('pc', 'beautiful', 'have', etc.) are ignored
newdocs_counts = fooVzer.transform(newdocs)
newdocs_counts.toarray()

array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1]])

In [20]:
pd.DataFrame({i:newdocs_counts.toarray()[:,j] for i,j in fooVzer.vocabulary_.items()})

Unnamed: 0,hpc,is,oh,what,fine,practical,this,ain,over,till,it,truly
0,1,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,1,0,0,0,0,0,0


In [21]:
# Again, transform using tfidf 
newdocs_tfidf = fooTfmer.transform(newdocs_counts)
newdocs_tfidf.toarray()

array([[0.        , 0.        , 1.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.60534851, 0.        , 0.        ,
        0.        , 0.79596054]])

In [22]:
pd.DataFrame({i:newdocs_tfidf.toarray()[:,j] for i,j in fooVzer.vocabulary_.items()})

Unnamed: 0,hpc,is,oh,what,fine,practical,this,ain,over,till,it,truly
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.795961,0.0,0.605349,0.0,0.0,0.0,0.0,0.0,0.0


## Back to real data: movie reviews

In [23]:
# Split data into training and test sets
from sklearn.model_selection import train_test_split
docs_train, docs_test, y_train, y_test = train_test_split(movie.data, 
                                                          movie.target, 
                                                          test_size = 0.20, 
                                                          random_state = 12)

In [24]:
# initialize CountVectorizer
movieVzer= CountVectorizer(min_df=2, max_features=3000) # use top 3000 words only. 80.03% acc.
# movieVzer = CountVectorizer(min_df=2)                 # use all 21K words. Higher accuracy???

# fit and tranform using training text 
docs_train_counts = movieVzer.fit_transform(docs_train)

In [25]:
# 'screen' is found in the corpus, mapped to index 2290
movieVzer.vocabulary_.get('screen')

2279

In [26]:
# Likewise, Mr. Steven Seagal is present...
movieVzer.vocabulary_.get('seagal')

2286

In [27]:
# huge dimensions! 1,600 documents, 3K unique terms. 
docs_train_counts.shape

(1600, 3000)

In [28]:
# Convert raw frequency counts into TF-IDF values
movieTfmer = TfidfTransformer()
docs_train_tfidf = movieTfmer.fit_transform(docs_train_counts)

In [29]:
# Same dimensions, now with tf-idf values instead of raw frequency counts
docs_train_tfidf.shape

(1600, 3000)

## The feature extraction functions and training data are ready.
* Vectorizer and transformer have been built from the training data
* Training data text was also turned into TF-IDF vector form

## Next up: test data
* You have to prepare the test data using the same feature extraction scheme.

In [30]:
# Using the fitted vectorizer and transformer, tranform the test data
docs_test_counts = movieVzer.transform(docs_test)
docs_test_tfidf = movieTfmer.transform(docs_test_counts)

## Training and testing a SGDClassifier classifier

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html

In [31]:
# Now ready to build a classifier. 
# We will use Multinominal Naive Bayes as our model
from sklearn.linear_model import SGDClassifier

In [32]:
# Train a Logistic Regression Classifier. Again, we call it "fitting"
clf = SGDClassifier(random_state=42)
clf.fit(docs_train_tfidf, y_train)



SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=None,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=42, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)

In [33]:
# Predict the Test set results, find accuracy
y_pred = clf.predict(docs_test_tfidf)
sklearn.metrics.accuracy_score(y_test, y_pred)

0.835

### Consfusion Matrix

https://en.wikipedia.org/wiki/Confusion_matrix

<table class="wikitable" style="border:none; margin-top:0;">
<tbody><tr>
<th style="background:white; border:none;" colspan="2" rowspan="2">
</th>
<th colspan="3" style="background:none;">Actual class
</th></tr>
<tr>
<th>Cat
</th>
<th>Non-cat
</th></tr>
<tr>
<th rowspan="3" style="height:6em;"><div style="display: inline-block; -ms-transform: rotate(-90deg); -webkit-transform: rotate(-90deg); transform: rotate(-90deg);;">Predicted<br> class</div>
</th>
<th>Cat
</th>
<td>5 True Positives
</td>
<td>2 False Positives
</td></tr>
<tr>
<th>Non-cat
</th>
<td>3 False Negatives
</td>
<td>3 True Negatives
</td></tr>
</tbody></table>

In [34]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
pd.DataFrame(cm, index=['Predict Pos', 'Predict Neg'], columns=['Actual Pos', 'Actual Neg'])

Unnamed: 0,Actual Pos,Actual Neg
Predict Pos,167,39
Predict Neg,27,167


> Note: We  should use cross-validation to find the hyper-parameters to our models

## Trying the classifier on fake movie reviews

In [35]:
# very short and fake movie reviews
reviews_new = ['This movie was excellent', 'Absolute joy ride', 
               'Arnold was terrible', 'Arnold was excellent.', 
               'This was certainly a movie', 'Two thumbs up', 'I fell asleep halfway through', 
               "We can't wait for the sequel!!", 'I cannot recommend this highly enough', 
               'His performance was Oscar-worthy.', 'Steven Seagal was amazing',
               'instant classic.', 'Steven Seagal was amazing. His performance was Oscar-worthy.']

reviews_new_counts = movieVzer.transform(reviews_new)         # turn text into count vector
reviews_new_tfidf = movieTfmer.transform(reviews_new_counts)  # turn into tfidf vector

In [36]:
# have classifier make a prediction
pred = clf.predict(reviews_new_tfidf)

In [37]:
# print out results
for review, category in zip(reviews_new, pred):
    print('%r => %s' % (review, movie.target_names[category]))

'This movie was excellent' => pos
'Absolute joy ride' => pos
'Arnold was terrible' => neg
'Arnold was excellent.' => pos
'This was certainly a movie' => neg
'Two thumbs up' => neg
'I fell asleep halfway through' => neg
"We can't wait for the sequel!!" => neg
'I cannot recommend this highly enough' => neg
'His performance was Oscar-worthy.' => pos
'Steven Seagal was amazing' => neg
'instant classic.' => neg
'Steven Seagal was amazing. His performance was Oscar-worthy.' => neg


## Now your turn

Below we have a dataset of airline tweets.

1. Build a classifier to detect the sentiment of a tweet.
2. Find your accuracy and confusion matrix on a test set.

**Advanced:**

3. Use Cross-Validation to determine some of your hyper-parameters.
4. Compare a few different classifiers to see which performs best.

In [38]:
from sklearn.utils import Bunch

In [39]:
airline_tweets = pd.read_csv('./us_airlines/Tweets.csv')
airline_tweets = airline_tweets[airline_tweets.airline_sentiment_confidence > 0.5]
airline = Bunch(
    data = [i for i in airline_tweets.text],
    target_names = ['neg', 'neu', 'pos'],
    target_values = [-1,0,1],
    target = np.asarray([{'positive':1,'neutral':0,'negative':-1}[i] for i in airline_tweets.airline_sentiment])
)

In [40]:
airline.target, len(airline.target)

(array([ 0,  0, -1, ...,  0, -1,  0]), 14404)

In [41]:
airline.target_names, airline.target_values

(['neg', 'neu', 'pos'], [-1, 0, 1])

In [42]:
airline.data[1], airline.target[1]

("@VirginAmerica I didn't today... Must mean I need to take another trip!", 0)