# Case Study 3 : Textual analysis of movie reviews

<img src="https://getthematic.com/wp-content/uploads/2018/03/Harris-Word-Cloud-e1522406279125.png">

**TEAM Members:** Please EDIT this cell and add the names of all the team members in your team

    Mago Sheehy
    
    Brian Phillips
    
    ...

## Problem 1: Sentiment Analysis on movie reviews

* Installing scikit-learn using Anaconda does not necessarily download the example source-code.
* Accordingly, you may need to download these directly from Gitub at https://github.com/scikit-learn/scikit-learn:
    * The data can be downloaded using doc/tutorial/text_analytics/data/movie_reviews/fetch_data.py
    * A skeleton for the solution can be found in doc/tutorial/text_analytics/skeletons/exercise_02_sentiment.py
    * A completed solution can be found in doc/tutorial/text_analytics/solutions/exercise_02_sentiment.py
* **It is ok to use the solution provided in the scikit-learn distribution as a starting place for your work.**


In [2]:
import sys
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier

import numpy as np
import twitter
import json
import matplotlib.pyplot as plt
from collections import Counter
from urllib.parse import unquote

In [3]:
# the training data folder must be passed as first argument
dataset = load_files('txt_sentoken', shuffle=False)
print("n_samples: %d" % len(dataset.data))

# split the dataset in training and test set:
docs_train, docs_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.25, random_state=None)

n_samples: 2000


In [4]:
# TASK: Build a vectorizer / classifier pipeline that filters out tokens
# that are too rare or too frequent

pipeline = Pipeline([
    ('vect', CountVectorizer(min_df = 3, max_df = 0.95)),
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC())
])

In [5]:
# TASK: Build a grid search to find out whether unigrams or bigrams are
# more useful.
# Fit the pipeline on the training set using grid search for the parameters

parameters = {
    'vect__ngram_range': [(1, 1), (2, 2)],
}
grid_search = GridSearchCV(pipeline, parameters, cv = 5, n_jobs=-1)
grid_search.fit(docs_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('vect',
                                        CountVectorizer(max_df=0.95, min_df=3)),
                                       ('tfidf', TfidfTransformer()),
                                       ('clf', LinearSVC())]),
             n_jobs=-1, param_grid={'vect__ngram_range': [(1, 1), (2, 2)]})

In [6]:
# TASK: print the cross-validated scores for the each parameters set
# explored by the grid search

n_candidates = len(grid_search.cv_results_['params'])
for i in range(n_candidates):
    print(i + 1, 'word: mean - %0.2f; std - %0.2f' % (grid_search.cv_results_['mean_test_score'][i], grid_search.cv_results_['std_test_score'][i]))

1 word: mean - 0.84; std - 0.02
2 word: mean - 0.81; std - 0.02


In [7]:
# TASK: Predict the outcome on the testing set and store it in a variable
# named y_predicted

y_predicted = grid_search.predict(docs_test)

In [8]:
# Print the classification report
print(metrics.classification_report(y_test, y_predicted, target_names=dataset.target_names))

              precision    recall  f1-score   support

         neg       0.87      0.82      0.84       237
         pos       0.85      0.89      0.87       263

    accuracy                           0.86       500
   macro avg       0.86      0.85      0.86       500
weighted avg       0.86      0.86      0.86       500



In [9]:
# Print and plot the confusion matrix
print(metrics.confusion_matrix(y_test, y_predicted))

[[195  42]
 [ 30 233]]


## Problem 2: Explore the scikit-learn TfidVectorizer class

**Read the documentation for the TfidVectorizer class at http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html.** 
* Define the term frequency–inverse document frequency (TF-IDF) statistic (http://en.wikipedia.org/wiki/Tf%E2%80%93idf will likely help).
* Run the TfidVectorizer class on the training data above (docs_train).
* Explore the min_df and max_df parameters of TfidVectorizer.  What do they mean? How do they change the features you get?
* Explore the ngram_range parameter of TfidVectorizer.  What does it mean? How does it change the features you get? (Note, large values  of ngram_range may take a long time to run!)

In [10]:
vectorized = TfidfVectorizer(ngram_range = (1, 1)).fit_transform(docs_train)
print(vectorized.shape)

(1500, 35302)


## Problem 3: Machine learning algorithms


* Based upon Problem 2 pick some parameters for TfidfVectorizer
    * "fit" your TfidfVectorizer using docs_train
    * Compute "Xtrain", a Tf-idf-weighted document-term matrix using the transform function on docs_train
    * Compute "Xtest", a Tf-idf-weighted document-term matrix using the transform function on docs_test
    * Note, be sure to use the same Tf-idf-weighted class (**"fit" using docs_train**) to transform **both** docs_test and docs_train
* Examine two classifiers provided by scikit-learn 
    * LinearSVC
    * KNeighborsClassifier
    * Try a number of different parameter settings for each and judge your performance using a confusion matrix (see Problem 1 for an example).
* Does one classifier, or one set of parameters work better?
    * Why do you think it might be working better?
* For a particular choice of parameters and classifier, look at 2 examples where the prediction was incorrect.
    * Can you conjecture on why the classifier made a mistake for this prediction?

In [11]:
vectorized = TfidfVectorizer(ngram_range = (1, 1)).fit(docs_train)
Xtrain = vectorized.transform(docs_train)
Xtest = vectorized.transform(docs_test)

In [13]:
#linear = LinearSVC().fit(Xtrain, y_train)
#y_predicted = linear.predict(Xtest)
#print("LinearSVC with default parameters:\n", metrics.confusion_matrix(y_test, y_predicted))

#linear = LinearSVC(C = 1000).fit(Xtrain, y_train)
#y_predicted = linear.predict(Xtest)
#print("\nLinearSVC with C = 1000:\n", metrics.confusion_matrix(y_test, y_predicted))

pipeline = Pipeline([
    ('clf', LinearSVC())
])

parameters = {
    'clf__C': [1, 5, 10, 50, 100, 500],
    'clf__fit_intercept' : [True, False],
}
grid_search = GridSearchCV(pipeline, parameters, cv = 5, n_jobs=-1)
grid_search.fit(Xtrain, y_train)
n_candidates = len(grid_search.cv_results_['params'])

for i in range(n_candidates):
    y_predicted = LinearSVC(dual = False, fit_intercept = grid_search.cv_results_['param_clf__fit_intercept'][i], C = grid_search.cv_results_['param_clf__C'][i]).fit(Xtrain, y_train).predict(Xtest)
    print("\nLinearSVC with", grid_search.cv_results_['params'][i], "\n", metrics.confusion_matrix(y_test, y_predicted))
    print(grid_search.cv_results_['mean_test_score'][i])

print("\n Best parameters for LinearSVC: ", grid_search.best_params_)
print("\n Mean with best parameters: ", grid_search.best_score_)


LinearSVC with {'clf__C': 1, 'clf__fit_intercept': True} 
 [[206  38]
 [ 33 223]]
0.8253333333333334

LinearSVC with {'clf__C': 1, 'clf__fit_intercept': False} 
 [[207  37]
 [ 30 226]]
0.828

LinearSVC with {'clf__C': 5, 'clf__fit_intercept': True} 
 [[206  38]
 [ 30 226]]
0.8166666666666667

LinearSVC with {'clf__C': 5, 'clf__fit_intercept': False} 
 [[206  38]
 [ 30 226]]
0.8246666666666667

LinearSVC with {'clf__C': 10, 'clf__fit_intercept': True} 
 [[206  38]
 [ 31 225]]
0.818

LinearSVC with {'clf__C': 10, 'clf__fit_intercept': False} 
 [[207  37]
 [ 30 226]]
0.82

LinearSVC with {'clf__C': 50, 'clf__fit_intercept': True} 
 [[206  38]
 [ 32 224]]
0.8200000000000001

LinearSVC with {'clf__C': 50, 'clf__fit_intercept': False} 
 [[207  37]
 [ 31 225]]
0.82

LinearSVC with {'clf__C': 100, 'clf__fit_intercept': True} 
 [[206  38]
 [ 32 224]]
0.8200000000000001

LinearSVC with {'clf__C': 100, 'clf__fit_intercept': False} 
 [[207  37]
 [ 31 225]]
0.82

LinearSVC with {'clf__C': 500, 'cl

In [14]:
pipeline = Pipeline([
    ('clf', KNeighborsClassifier())
])

parameters = {
    'clf__n_neighbors': [1, 2, 3, 5, 10, 50],
    'clf__weights' : ['uniform', 'distance'],
}
grid_search = GridSearchCV(pipeline, parameters, cv = 5, n_jobs=-1)
grid_search.fit(Xtrain, y_train)
n_candidates = len(grid_search.cv_results_['params'])

for i in range(n_candidates):
    y_predicted = KNeighborsClassifier(n_neighbors = grid_search.cv_results_['param_clf__n_neighbors'][i], weights = grid_search.cv_results_['param_clf__weights'][i]).fit(Xtrain, y_train).predict(Xtest)
    print("\nKNeighborsClassifier with", grid_search.cv_results_['params'][i], "\n", metrics.confusion_matrix(y_test, y_predicted))
    print(grid_search.cv_results_['mean_test_score'][i])

print("\n Best parameters for KNeighborsClassifier: ", grid_search.best_params_)
print("\n Score with best parameters: ", grid_search.best_score_)


KNeighborsClassifier with {'clf__n_neighbors': 1, 'clf__weights': 'uniform'} 
 [[106 138]
 [ 43 213]]
0.6406666666666666

KNeighborsClassifier with {'clf__n_neighbors': 1, 'clf__weights': 'distance'} 
 [[106 138]
 [ 43 213]]
0.6406666666666666

KNeighborsClassifier with {'clf__n_neighbors': 2, 'clf__weights': 'uniform'} 
 [[143 101]
 [ 71 185]]
0.6346666666666667

KNeighborsClassifier with {'clf__n_neighbors': 2, 'clf__weights': 'distance'} 
 [[106 138]
 [ 43 213]]
0.6406666666666666

KNeighborsClassifier with {'clf__n_neighbors': 3, 'clf__weights': 'uniform'} 
 [[ 79 165]
 [ 17 239]]
0.6013333333333334

KNeighborsClassifier with {'clf__n_neighbors': 3, 'clf__weights': 'distance'} 
 [[ 81 163]
 [ 17 239]]
0.6026666666666667

KNeighborsClassifier with {'clf__n_neighbors': 5, 'clf__weights': 'uniform'} 
 [[ 57 187]
 [ 17 239]]
0.5559999999999999

KNeighborsClassifier with {'clf__n_neighbors': 5, 'clf__weights': 'distance'} 
 [[ 59 185]
 [ 17 239]]
0.5573333333333333

KNeighborsClassifie

In [15]:
# Code to print the predicted and actual values of the first 100 reviews to help find 2 that were wrong 
# documents 1 and 5 were selected because they were the first false positive and false negative respectively

#y_predicted = LinearSVC(penalty = 'l2', C = 10).fit(Xtrain, y_train).predict(Xtest)
#for i in range(100):
#    print(i, "Predicted: ", y_predicted[i], "Actual: ", y_test[i])

print("False positive:")
print(docs_test[1])
# I believe that the classifier falsely assumed this review was positive because it contained many words that 
# one would expect a positive review to have and only a small portion of the review is actually critical.
# Some of the positive words it contains include rewarding, grounded, emotional, interesting, elevate, best, 
# compelling, and artistic

print("\nFalse negative")
print(docs_test[5])
# I believe the classifier misidentified this review because it contains a significant ammount of negative
# language and there is a section at the end where the reviewer lists off some of the movies flaws.
# Some of the negative words it contains include long, dull, trimmer, unfortunately, cliched, regret, and bad

False positive:
b'it is an understood passion and an understood calm . \nbud white walks into the home of lynn bracken , a prostitute " cut to look like veronica lake . " \nhe\'s one of l . a . \'s finest investigating the murder of fellow cop , and one of the leads takes him to her home . \nit\'s understood that he is quiet thunder , a guy who\'s calm voice is more powerful than his arms . \nit\'s understood that she\'s supposed to be beautiful , but underneath her face is pain and scraped out lines that say her life could have been so much more . \nyou know without having to be told . \n " you\'re the first guy who hasn\'t told me i look like veronicca lake in under a minute , " she says . \n " you look ten times better . " \nhe says it without thinking . \nlike he knows without her having to say anything . \nwhite\'s face doesn\'t turn , it doesn\'t blush . \nyou see his eyes , and you believe him . \nit\'s a perfect moment in a near perfect movie . \nl . a . confidential is the bes


## Problem 4: Business question

* How could NLP be used to generate a Data Science oriented product?

* For example, can you take the machine learning algorithm above, which was trained on movie review data, and test it on your Twitter data?

* Does this provide a way to tell whether Tweets about a product are either positive or negative!?

In [12]:
CONSUMER_KEY = "RB4hX8gjnUlPX4Ijvuj5gL9LT"
CONSUMER_SECRET = "YovCvfis70dTuD1IuZMdHdhfiPPAr5nd22QkTIpnELq4r7Dw9j"
OAUTH_TOKEN = "571213367-fyYadzmC7wGWOkM6OCF99ZevVjWGDC3fnO5OoYGr"
OAUTH_TOKEN_SECRET = "OjRD5By0qU0q3g9DJXCpMvnJrdYe1KIj2G2BoGtRng9q5"

auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                           CONSUMER_KEY, CONSUMER_SECRET)

twitter_api = twitter.Twitter(auth=auth)

In [13]:
tweets = []

q = 'Samsung Galaxy S'
count = 100
search_results = twitter_api.search.tweets(q=q, count=count)
statuses = search_results['statuses']

# Iterating through more batches of results by following the cursor
while len(tweets) < 1000:
    try:
        next_results = search_results['search_metadata']['next_results']
    except KeyError as e: # No more results when next_results doesn't exist
        break

    # Create a dictionary from next_results, which has the following form:
    # ?max_id=847960489447628799&q=%23RIPSelena&count=100&include_entities=1
    kwargs = dict([ kv.split('=') for kv in unquote(next_results[1:]).split("&") ])

    search_results = twitter_api.search.tweets(**kwargs)
    statuses = search_results['statuses']
    for i in statuses:
        if (not ('retweeted_status' in i.keys())) and i['lang'] == 'en':
            tweets.append(str(i['text']))
    
tweets2 = []               
for i in tweets:
    tweets2.append(bytes(i, 'utf-8'))
X = TfidfVectorizer(ngram_range = (1, 1)).fit(docs_train)
Xtest = X.transform(tweets2)
y_predictor = LinearSVC(penalty = 'l2', C = 10).fit(Xtrain, y_train).fit(Xtrain, y_train)
y_predicted = y_predictor.predict(Xtest)
print("Tweets about Samsung Galaxy S are", round(np.count_nonzero(y_predicted == 1)/len(tweets)*100, 3), "% positve")

Tweets about Samsung Galaxy S are 15.8 % positve


In [19]:
tweets = []

q = 'Samsung Fridge'
count = 100
search_results = twitter_api.search.tweets(q=q, count=count)
statuses = search_results['statuses']

# Iterating through more batches of results by following the cursor
while len(tweets) < 1000:
    try:
        next_results = search_results['search_metadata']['next_results']
    except KeyError as e: # No more results when next_results doesn't exist
        break

    # Create a dictionary from next_results, which has the following form:
    # ?max_id=847960489447628799&q=%23RIPSelena&count=100&include_entities=1
    kwargs = dict([ kv.split('=') for kv in unquote(next_results[1:]).split("&") ])

    search_results = twitter_api.search.tweets(**kwargs)
    statuses = search_results['statuses']
    for i in statuses:
        if (not ('retweeted_status' in i.keys())) and i['lang'] == 'en':
            tweets.append(str(i['text']))
    
tweets2 = []               
for i in tweets:
    tweets2.append(bytes(i, 'utf-8'))
Xtest = X.transform(tweets2)
y_predicted = y_predictor.predict(Xtest)
print("Tweets about Samsung Fridge are", round(np.count_nonzero(y_predicted == 1)/len(tweets)*100, 3), "% positve")

Tweets about Samsung Fridge are 14.826 % positve


In [20]:
tweets = []

q = 'Samsung Buds Pro'
count = 100
search_results = twitter_api.search.tweets(q=q, count=count)
statuses = search_results['statuses']

# Iterating through more batches of results by following the cursor
while len(tweets) < 1000:
    try:
        next_results = search_results['search_metadata']['next_results']
    except KeyError as e: # No more results when next_results doesn't exist
        break

    # Create a dictionary from next_results, which has the following form:
    # ?max_id=847960489447628799&q=%23RIPSelena&count=100&include_entities=1
    kwargs = dict([ kv.split('=') for kv in unquote(next_results[1:]).split("&") ])

    search_results = twitter_api.search.tweets(**kwargs)
    statuses = search_results['statuses']
    for i in statuses:
        if (not ('retweeted_status' in i.keys())) and i['lang'] == 'en':
            tweets.append(str(i['text']))
    
tweets2 = []               
for i in tweets:
    tweets2.append(bytes(i, 'utf-8'))
Xtest = X.transform(tweets2)
y_predicted = y_predictor.predict(Xtest)
print("Tweets about the Samsung Buds Pro are ", np.count_nonzero(y_predicted == 1)/len(tweets)*100, "% positve")

Tweets about the Samsung Buds Pro are  38.31775700934579 % positve


In [21]:
tweets = []

q = 'Samsung Galaxy Watch'
count = 100
search_results = twitter_api.search.tweets(q=q, count=count)
statuses = search_results['statuses']

# Iterating through more batches of results by following the cursor
while len(tweets) < 1000:
    try:
        next_results = search_results['search_metadata']['next_results']
    except KeyError as e: # No more results when next_results doesn't exist
        break

    # Create a dictionary from next_results, which has the following form:
    # ?max_id=847960489447628799&q=%23RIPSelena&count=100&include_entities=1
    kwargs = dict([ kv.split('=') for kv in unquote(next_results[1:]).split("&") ])

    search_results = twitter_api.search.tweets(**kwargs)
    statuses = search_results['statuses']
    for i in statuses:
        if (not ('retweeted_status' in i.keys())) and i['lang'] == 'en':
            tweets.append(str(i['text']))
    
tweets2 = []               
for i in tweets:
    tweets2.append(bytes(i, 'utf-8'))
Xtest = X.transform(tweets2)
y_predicted = y_predictor.predict(Xtest)

print("Tweets about the Samsung Galaxy Watch are ", np.count_nonzero(y_predicted == 1)/len(tweets)*100, "% positve")

Tweets about the Samsung Galaxy Watch are  39.22924901185771 % positve


In [14]:
tweets = []

q = 'Chromebook'
count = 100
search_results = twitter_api.search.tweets(q=q, count=count)
statuses = search_results['statuses']

# Iterating through more batches of results by following the cursor
while len(tweets) < 1000:
    try:
        next_results = search_results['search_metadata']['next_results']
    except KeyError as e: # No more results when next_results doesn't exist
        break

    # Create a dictionary from next_results, which has the following form:
    # ?max_id=847960489447628799&q=%23RIPSelena&count=100&include_entities=1
    kwargs = dict([ kv.split('=') for kv in unquote(next_results[1:]).split("&") ])

    search_results = twitter_api.search.tweets(**kwargs)
    statuses = search_results['statuses']
    for i in statuses:
        if (not ('retweeted_status' in i.keys())) and i['lang'] == 'en':
            tweets.append(str(i['text']))
    
tweets2 = []               
for i in tweets:
    tweets2.append(bytes(i, 'utf-8'))
Xtest = X.transform(tweets2)
y_predicted = y_predictor.predict(Xtest)
print("Tweets about Chromebooks are ", np.count_nonzero(y_predicted == 1)/len(tweets)*100, "% positve")

Tweets about Chromebooks are  8.649173955296405 % positve


In [15]:
tweets = []

q = 'Samsung'
count = 100
search_results = twitter_api.search.tweets(q=q, count=count)
statuses = search_results['statuses']

# Iterating through more batches of results by following the cursor
while len(tweets) < 1000:
    try:
        next_results = search_results['search_metadata']['next_results']
    except KeyError as e: # No more results when next_results doesn't exist
        break

    # Create a dictionary from next_results, which has the following form:
    # ?max_id=847960489447628799&q=%23RIPSelena&count=100&include_entities=1
    kwargs = dict([ kv.split('=') for kv in unquote(next_results[1:]).split("&") ])

    search_results = twitter_api.search.tweets(**kwargs)
    statuses = search_results['statuses']
    for i in statuses:
        if (not ('retweeted_status' in i.keys())) and i['lang'] == 'en':
            tweets.append(str(i['text']))
    
tweets2 = []               
for i in tweets:
    tweets2.append(bytes(i, 'utf-8'))
Xtest = X.transform(tweets2)
y_predicted = y_predictor.predict(Xtest)
print("Tweets about Samsung are ", np.count_nonzero(y_predicted == 1)/len(tweets)*100, "% positve")

Tweets about Samsung are  12.574257425742575 % positve
