# Case Study 4 : Deep Learning

**TEAM Members:** Please EDIT this cell and add the names of all the team members in your team

    Mago Sheehy
    
    Brian Phillips
    
    ...

## Problem 1: Load in the movie review data, create TF-IDF features, and use two of your favorite classification algorithms from sci-kit learn for predicting sentiment

* This problem is, basically, already answered as part of case study 3, so it is fine to use your work from there to help answer this question.

In [1]:
# Importing useful libraries and classes
import sys
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

import timeit
import numpy as np
import twitter
import json
import matplotlib.pyplot as plt
from collections import Counter
from urllib.parse import unquote

In [2]:
# The training data folder must be passed as first argument
dataset = load_files('txt_sentoken', shuffle=False)

# Split the dataset in training and test set:
docs_train, docs_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.25, random_state=None)

# Turning the testing and training docs into TF-IDF tokens 
vectorized = TfidfVectorizer(ngram_range = (1, 1)).fit(docs_train)
Xtrain = vectorized.transform(docs_train)
Xtest = vectorized.transform(docs_test)

In [3]:
# Fitting a LinearSVC model to the training data and comparing the model to the testing data
y_predicted = LinearSVC(penalty = 'l2', C = 10).fit(Xtrain, y_train).predict(Xtest)

# Printing a report on the model's accuracy
print("LinearSVC:\n", metrics.classification_report(y_test, y_predicted, target_names=dataset.target_names))


# Fitting a KNeighborsClassifier model to the training data and comparing the model to the testing data
y_predicted = KNeighborsClassifier(n_neighbors = 3).fit(Xtrain, y_train).predict(Xtest)

# Printing a report on the model's accuracy
print("KNeighborsClassifier:\n", metrics.classification_report(y_test, y_predicted, target_names=dataset.target_names))

LinearSVC:
               precision    recall  f1-score   support

         neg       0.89      0.84      0.87       254
         pos       0.85      0.89      0.87       246

    accuracy                           0.87       500
   macro avg       0.87      0.87      0.87       500
weighted avg       0.87      0.87      0.87       500

KNeighborsClassifier:
               precision    recall  f1-score   support

         neg       0.76      0.32      0.45       254
         pos       0.56      0.89      0.69       246

    accuracy                           0.60       500
   macro avg       0.66      0.61      0.57       500
weighted avg       0.66      0.60      0.57       500



## Problem 2: Use a Multi-Layer Perceptron (MLP) for classifying the reviews.  Explore the parameters for the MLP and compare the accuracies against your baseline algorithms in Problem 1.

**Read the documentation for the MLPClassifier class at https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier.** 
* Try different values for "hidden_layer_sizes".  What do you observe in terms of accuracy?
* Try different values for "activation". What do you observe in terms of accuracy?
* Try different values for "solver". What do you observe in terms of accuracy?

In [4]:
# Creating the pipeline
pipeline = Pipeline([
    ('clf', MLPClassifier())
])

# Creating a set of parameters and running MLPClassifier on the testing and training
# data with each of them. Each parameter is adjusted individually to avoid a huge
# number of computations
parameters = {
    'clf__hidden_layer_sizes': [(200,), (100,), (100, 2), (50, 2), (10, 5), (5, 20)],
}
print('Performing MLPClassifier with different hidden_layer_sizes\n')
grid_searchH = GridSearchCV(pipeline, parameters, cv = 5, n_jobs=-1)
grid_searchH.fit(Xtrain, y_train)
n_candidatesH = len(grid_searchH.cv_results_['params'])

parameters = {
    'clf__activation' : ['identity', 'tanh', 'logistic', 'relu'],
}
print('Performing MLPClassifier with different activation\n')
grid_searchA = GridSearchCV(pipeline, parameters, cv = 5, n_jobs=-1)
grid_searchA.fit(Xtrain, y_train)
n_candidatesA = len(grid_searchA.cv_results_['params'])


parameters = {
    'clf__solver' : ['lbfgs', 'sgd', 'adam'],
}
print('Performing MLPClassifier with different solver\n')
grid_searchS = GridSearchCV(pipeline, parameters, cv = 5, n_jobs=-1)
grid_searchS.fit(Xtrain, y_train)
n_candidatesS = len(grid_searchS.cv_results_['params'])

print("Done")

Performing MLPClassifier with different hidden_layer_sizes

Performing MLPClassifier with different activation

Performing MLPClassifier with different solver

Done


In [5]:
for i in range(n_candidatesH):
    print("\nMLPClassifier with", grid_searchH.cv_results_['params'][i], "\n")
    print(grid_searchH.cv_results_['mean_test_score'][i])
    
# Adding one hidden layer with 200 nodes and adding one hidden layer with 100 nodes had 
# very similar accuracies (200 nodes had the slight edge), adding two layers of size 100
# produced a significantly less accurate machine than the machine with one layer of the 
# same size, two layers of 50 were less accurate than two layers of 100, and increasing 
# the number of layers beyond two(even when decreaseing the number of nodes) increased 
# accuracy.  We avoided implementing machines with high node counts and layer counts to
# be nice to our computers and to avoid overfitting.  When testing our machines on new 
# data we may want to try using both the one layer of 200 and the one layer of 100 machines
# to ensure that the increased accuracy of the machine with 200 nodes is not due to 
# overfitting


MLPClassifier with {'clf__hidden_layer_sizes': (200,)} 

0.8246666666666667

MLPClassifier with {'clf__hidden_layer_sizes': (100,)} 

0.8266666666666668

MLPClassifier with {'clf__hidden_layer_sizes': (100, 2)} 

0.6433333333333333

MLPClassifier with {'clf__hidden_layer_sizes': (50, 2)} 

0.5553333333333335

MLPClassifier with {'clf__hidden_layer_sizes': (10, 5)} 

0.82

MLPClassifier with {'clf__hidden_layer_sizes': (5, 20)} 

0.8273333333333334


In [6]:
for i in range(n_candidatesA):
    print("\nMLPClassifier with", grid_searchA.cv_results_['params'][i], "\n")
    print(grid_searchA.cv_results_['mean_test_score'][i])
    
# Using the 'identity', 'tanh', 'logistic', and 'relu' activation functions all
# produce very similar accuracies (when using the default 100 nodes and one layer).
# This is notable because it means that the problem can relatively accurately be 
# represented as a linear problem (as the identity function makes good predictions).
# The logistic function preformed the best and the relu function performed the worst.


MLPClassifier with {'clf__activation': 'identity'} 

0.8273333333333334

MLPClassifier with {'clf__activation': 'tanh'} 

0.8313333333333333

MLPClassifier with {'clf__activation': 'logistic'} 

0.8346666666666666

MLPClassifier with {'clf__activation': 'relu'} 

0.828


In [7]:
for i in range(n_candidatesS):
    print("\nMLPClassifier with", grid_searchS.cv_results_['params'][i], "\n")
    print(grid_searchS.cv_results_['mean_test_score'][i])
    
# The 'lbfgs' and 'adam' solvers had similar accuracies, while the 'sdg' solver
# was not able to run when the rest of the parameters were set to their default
# values.  'adam' performed the best and 'lbfgs' performed the worst.


MLPClassifier with {'clf__solver': 'lbfgs'} 

0.8400000000000001

MLPClassifier with {'clf__solver': 'sgd'} 

0.49800000000000005

MLPClassifier with {'clf__solver': 'adam'} 

0.826


In [8]:
# Running the MLPClassifier with every combination of top 2 parameters for
# hidden_layer_sizes, activation, and solver.
parameters = {
    'clf__hidden_layer_sizes': [(200,), (100,)],
    'clf__activation' : ['identity', 'logistic'],
    'clf__solver' : ['lbfgs', 'adam'],
}
print("Performing MLPClassifier with every combination of top 2 parameters for hidden_layer_sizes, activation, and solver:\n")
grid_search = GridSearchCV(pipeline, parameters, cv = 5, n_jobs=1)
grid_search.fit(Xtrain, y_train)
n_candidates = len(grid_search.cv_results_['params'])
print("Done")

Performing MLPClassifier with every combination of top 2 parameters for hidden_layer_sizes, activation, and solver:

Done


In [9]:
for i in range(n_candidates):
    print("\nMLPClassifier with", grid_search.cv_results_['params'][i], "\n")
    print(grid_search.cv_results_['mean_test_score'][i])

print("\n Best parameters for MLPClassifier: ", grid_search.best_params_)
print("\n Score with best parameters: ", grid_search.best_score_)

# The set of parameters that produced the machine with the highest accuracy:
# {'clf__activation': 'logistic', 'clf__hidden_layer_sizes': (100,), 'clf__solver': 'adam'}
# The set of parameters that produced the machine with the lowest accuracy:
# {'clf__activation': 'logistic', 'clf__hidden_layer_sizes': (200,), 'clf__solver': 'lbfgs'}


MLPClassifier with {'clf__activation': 'identity', 'clf__hidden_layer_sizes': (200,), 'clf__solver': 'lbfgs'} 

0.8373333333333333

MLPClassifier with {'clf__activation': 'identity', 'clf__hidden_layer_sizes': (200,), 'clf__solver': 'adam'} 

0.8313333333333335

MLPClassifier with {'clf__activation': 'identity', 'clf__hidden_layer_sizes': (100,), 'clf__solver': 'lbfgs'} 

0.836

MLPClassifier with {'clf__activation': 'identity', 'clf__hidden_layer_sizes': (100,), 'clf__solver': 'adam'} 

0.8286666666666666

MLPClassifier with {'clf__activation': 'logistic', 'clf__hidden_layer_sizes': (200,), 'clf__solver': 'lbfgs'} 

0.72

MLPClassifier with {'clf__activation': 'logistic', 'clf__hidden_layer_sizes': (200,), 'clf__solver': 'adam'} 

0.8333333333333334

MLPClassifier with {'clf__activation': 'logistic', 'clf__hidden_layer_sizes': (100,), 'clf__solver': 'lbfgs'} 

0.836

MLPClassifier with {'clf__activation': 'logistic', 'clf__hidden_layer_sizes': (100,), 'clf__solver': 'adam'} 

0.834



## Problem 3: Accuracy is not everything!  How fast are the algorithms versus their accuracy?
**Compare the runtime of your  baseline algorithms to the runtime of the MLPClassifier** 

**The jupyter command %timeit can be used to measure how long a calculation takes https://ipython.readthedocs.io/en/stable/interactive/magics.html.**
* Try different values for "hidden_layer_sizes".  What do you observe in term of runtime?
* Try different values for "activation". What do you observe in term of runtime?
* Try different values for "solver". What do you observe in term of runtime?
* How long does the "fit" function take as opposed to the "predict" function?  Can you explain why?

In [10]:
setup = """import sys;
from sklearn.feature_extraction.text import CountVectorizer;
from sklearn.feature_extraction.text import TfidfTransformer;
from sklearn.feature_extraction.text import TfidfVectorizer;
from sklearn.svm import LinearSVC;
from sklearn.pipeline import Pipeline;
from sklearn.model_selection import GridSearchCV;
from sklearn.datasets import load_files;
from sklearn.model_selection import train_test_split;
from sklearn import metrics;
from sklearn.neighbors import KNeighborsClassifier;
from sklearn.neural_network import MLPClassifier;

dataset = load_files('txt_sentoken', shuffle=False);
docs_train, docs_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.25, random_state=None);
vectorized = TfidfVectorizer(ngram_range = (1, 1)).fit(docs_train);
Xtrain = vectorized.transform(docs_train);
Xtest = vectorized.transform(docs_test);"""

In [11]:
parameters = [(100, ''), (50, 2), (10, 2), (10, 5), (5, 5)]
for i in parameters :
    stmt = 'MLPClassifier(hidden_layer_sizes = (' + ", ".join(map(str,i)) + ')).fit(Xtrain, y_train).predict(Xtest)'
    print('MLPClassifier time for hidden_layer_sizes:', i, '\n', round(timeit.timeit(setup = setup, stmt = stmt, number = 1), 3), "seconds\n")

    
parameters = ['identity', 'tanh', 'logistic', 'relu']
for i in parameters :
    stmt = 'MLPClassifier(activation = \'' + i +'\').fit(Xtrain, y_train).predict(Xtest)'
    print('MLPClassifier time for activation:', i, "\n", round(timeit.timeit(setup = setup, stmt = stmt, number = 1), 3), "seconds\n")

    
parameters = ['lbfgs', 'adam']
for i in parameters :
    stmt = 'MLPClassifier(solver = \'' + i +'\').fit(Xtrain, y_train).predict(Xtest)'
    print('MLPClassifier time for solver:', i, "\n", round(timeit.timeit(setup = setup, stmt = stmt, number = 1), 3), "seconds\n")

    
stmt = 'MLPClassifier().fit(Xtrain, y_train)'
print('MLPClassifier default parameters fit time:\n', round(timeit.timeit(setup = setup, stmt = stmt, number = 1), 3), "seconds\n")


setup2 = setup + '\ny = MLPClassifier().fit(Xtrain, y_train)'
stmt = 'y.predict(Xtest)'
print('MLPClassifier default parameters predict time:\n', round(timeit.timeit(setup = setup2, stmt = stmt, number = 1), 3), "seconds\n")


stmt = 'LinearSVC(penalty = \'l2\', C = 10).fit(Xtrain, y_train).predict(Xtest)'
print('LinearSVC time:\n', round(timeit.timeit(setup = setup, stmt = stmt, number = 1), 3), "seconds\n")

stmt = 'KNeighborsClassifier(n_neighbors = 3).fit(Xtrain, y_train).predict(Xtest)'
print('KNeighborsClassifiers time:\n', round(timeit.timeit(setup = setup, stmt = stmt, number = 1), 3), "seconds\n")

MLPClassifier time for hidden_layer_sizes: (100, '') 
 65.498 seconds





MLPClassifier time for hidden_layer_sizes: (50, 2) 
 98.028 seconds

MLPClassifier time for hidden_layer_sizes: (10, 2) 
 10.98 seconds

MLPClassifier time for hidden_layer_sizes: (10, 5) 
 10.841 seconds

MLPClassifier time for hidden_layer_sizes: (5, 5) 
 7.156 seconds

MLPClassifier time for activation: identity 
 66.035 seconds

MLPClassifier time for activation: tanh 
 67.079 seconds

MLPClassifier time for activation: logistic 
 145.971 seconds

MLPClassifier time for activation: relu 
 66.032 seconds

MLPClassifier time for solver: lbfgs 
 23.373 seconds

MLPClassifier time for solver: adam 
 65.669 seconds

MLPClassifier default parameters fit time:
 67.38 seconds

MLPClassifier default parameters predict time:
 0.017 seconds

LinearSVC time:
 0.272 seconds

KNeighborsClassifiers time:
 0.203 seconds





## Problem 4: Business question

* Suppose you had a machine learning algorithms that could detect the sentinment of tweets that was highly accurate.  What kind of business could you build around that?
* Who would be your competitors, and what are their sizes?
* What would be the size of the market for your product?
* In addition, assume that your machine learning was slow to train, but fast in making predicitions on new data.  How would that affect your business plan?
* How could you use the cloud to support your product?

In [4]:
CONSUMER_KEY = "RB4hX8gjnUlPX4Ijvuj5gL9LT"
CONSUMER_SECRET = "YovCvfis70dTuD1IuZMdHdhfiPPAr5nd22QkTIpnELq4r7Dw9j"
OAUTH_TOKEN = "571213367-fyYadzmC7wGWOkM6OCF99ZevVjWGDC3fnO5OoYGr"
OAUTH_TOKEN_SECRET = "OjRD5By0qU0q3g9DJXCpMvnJrdYe1KIj2G2BoGtRng9q5"

auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                           CONSUMER_KEY, CONSUMER_SECRET)

twitter_api = twitter.Twitter(auth=auth)

In [8]:
tweets = []

q = 'Netflix'
count = 100
search_results = twitter_api.search.tweets(q=q, count=count)
statuses = search_results['statuses']

# Iterating through more batches of results by following the cursor
while len(tweets) < 400:
    try:
        next_results = search_results['search_metadata']['next_results']
    except KeyError as e: # No more results when next_results doesn't exist
        break

    # Create a dictionary from next_results, which has the following form:
    # ?max_id=847960489447628799&q=%23RIPSelena&count=100&include_entities=1
    kwargs = dict([ kv.split('=') for kv in unquote(next_results[1:]).split("&") ])

    search_results = twitter_api.search.tweets(**kwargs)
    statuses = search_results['statuses']
    for i in statuses:
        if (not ('retweeted_status' in i.keys())) and i['lang'] == 'en':
            tweets.append(str(i['text']))
    
tweets2 = []               
for i in tweets:
    tweets2.append(bytes(i, 'utf-8'))
X = TfidfVectorizer(ngram_range = (1, 1)).fit(docs_train)
Xtest = X.transform(tweets2)
y_predictor = MLPClassifier(activation = 'logistic', hidden_layer_sizes = (100,), solver = 'adam').fit(Xtrain, y_train)
y_predicted = y_predictor.predict(Xtest)
print("Tweets about Netflix are", round(np.count_nonzero(y_predicted == 1)/len(tweets)*100, 3), "% positve")

Tweets about Netflix are 45.477 % positve


In [14]:
tweets = []

q = 'Hulu'
count = 100
search_results = twitter_api.search.tweets(q=q, count=count)
statuses = search_results['statuses']

# Iterating through more batches of results by following the cursor
while len(tweets) < 400:
    try:
        next_results = search_results['search_metadata']['next_results']
    except KeyError as e: # No more results when next_results doesn't exist
        break

    # Create a dictionary from next_results, which has the following form:
    # ?max_id=847960489447628799&q=%23RIPSelena&count=100&include_entities=1
    kwargs = dict([ kv.split('=') for kv in unquote(next_results[1:]).split("&") ])

    search_results = twitter_api.search.tweets(**kwargs)
    statuses = search_results['statuses']
    for i in statuses:
        if (not ('retweeted_status' in i.keys())) and i['lang'] == 'en':
            tweets.append(str(i['text']))
    
tweets2 = []               
for i in tweets:
    tweets2.append(bytes(i, 'utf-8'))
Xtest = X.transform(tweets2)
y_predicted = y_predictor.predict(Xtest)
print("Tweets about Hulu are", round(np.count_nonzero(y_predicted == 1)/len(tweets)*100, 3), "% positve")

Tweets about Hulu are 30.865 % positve


In [9]:
tweets = []

q = 'Prime Video'
count = 100
search_results = twitter_api.search.tweets(q=q, count=count)
statuses = search_results['statuses']

# Iterating through more batches of results by following the cursor
while len(tweets) < 400:
    try:
        next_results = search_results['search_metadata']['next_results']
    except KeyError as e: # No more results when next_results doesn't exist
        break

    # Create a dictionary from next_results, which has the following form:
    # ?max_id=847960489447628799&q=%23RIPSelena&count=100&include_entities=1
    kwargs = dict([ kv.split('=') for kv in unquote(next_results[1:]).split("&") ])

    search_results = twitter_api.search.tweets(**kwargs)
    statuses = search_results['statuses']
    for i in statuses:
        if (not ('retweeted_status' in i.keys())) and i['lang'] == 'en':
            tweets.append(str(i['text']))
    
tweets2 = []               
for i in tweets:
    tweets2.append(bytes(i, 'utf-8'))
Xtest = X.transform(tweets2)
y_predicted = y_predictor.predict(Xtest)
print("Tweets about Amazon Prime Video are", round(np.count_nonzero(y_predicted == 1)/len(tweets)*100, 3), "% positve")

Tweets about Amazon Prime Video are 40.05 % positve


In [10]:
tweets = []

q = 'Disney+'
count = 100
search_results = twitter_api.search.tweets(q=q, count=count)
statuses = search_results['statuses']

# Iterating through more batches of results by following the cursor
while len(tweets) < 400:
    try:
        next_results = search_results['search_metadata']['next_results']
    except KeyError as e: # No more results when next_results doesn't exist
        break

    # Create a dictionary from next_results, which has the following form:
    # ?max_id=847960489447628799&q=%23RIPSelena&count=100&include_entities=1
    kwargs = dict([ kv.split('=') for kv in unquote(next_results[1:]).split("&") ])

    search_results = twitter_api.search.tweets(**kwargs)
    statuses = search_results['statuses']
    for i in statuses:
        if (not ('retweeted_status' in i.keys())) and i['lang'] == 'en':
            tweets.append(str(i['text']))
    
tweets2 = []               
for i in tweets:
    tweets2.append(bytes(i, 'utf-8'))
Xtest = X.transform(tweets2)
y_predicted = y_predictor.predict(Xtest)
print("Tweets about Disney+ are", round(np.count_nonzero(y_predicted == 1)/len(tweets)*100, 3), "% positve")

Tweets about Disney+ are 65.783 % positve


In [11]:
# Evidence that the MLPClassifier is correctly handling twitter data
tweets = []

q = 'horrible'
count = 100
search_results = twitter_api.search.tweets(q=q, count=count)
statuses = search_results['statuses']

# Iterating through more batches of results by following the cursor
while len(tweets) < 400:
    try:
        next_results = search_results['search_metadata']['next_results']
    except KeyError as e: # No more results when next_results doesn't exist
        break

    # Create a dictionary from next_results, which has the following form:
    # ?max_id=847960489447628799&q=%23RIPSelena&count=100&include_entities=1
    kwargs = dict([ kv.split('=') for kv in unquote(next_results[1:]).split("&") ])

    search_results = twitter_api.search.tweets(**kwargs)
    statuses = search_results['statuses']
    for i in statuses:
        if (not ('retweeted_status' in i.keys())) and i['lang'] == 'en':
            tweets.append(str(i['text']))
    
tweets2 = []               
for i in tweets:
    tweets2.append(bytes(i, 'utf-8'))
Xtest = X.transform(tweets2)
y_predicted = y_predictor.predict(Xtest)
print("Tweets about 'horrible' are", round(np.count_nonzero(y_predicted == 1)/len(tweets)*100, 3), "% positve")

Tweets about 'horrible' are 20.948 % positve


In [12]:
# Evidence that the MLPClassifier is correctly handling twitter data
tweets = []

q = 'perfect'
count = 100
search_results = twitter_api.search.tweets(q=q, count=count)
statuses = search_results['statuses']

# Iterating through more batches of results by following the cursor
while len(tweets) < 400:
    try:
        next_results = search_results['search_metadata']['next_results']
    except KeyError as e: # No more results when next_results doesn't exist
        break

    # Create a dictionary from next_results, which has the following form:
    # ?max_id=847960489447628799&q=%23RIPSelena&count=100&include_entities=1
    kwargs = dict([ kv.split('=') for kv in unquote(next_results[1:]).split("&") ])

    search_results = twitter_api.search.tweets(**kwargs)
    statuses = search_results['statuses']
    for i in statuses:
        if (not ('retweeted_status' in i.keys())) and i['lang'] == 'en':
            tweets.append(str(i['text']))
    
tweets2 = []               
for i in tweets:
    tweets2.append(bytes(i, 'utf-8'))
Xtest = X.transform(tweets2)
y_predicted = y_predictor.predict(Xtest)
print("Tweets about 'perfect' are", round(np.count_nonzero(y_predicted == 1)/len(tweets)*100, 3), "% positve")

Tweets about 'perfect' are 71.462 % positve


In [13]:
#Use high tomato ratings to search different movies and their genres
#Toy story and 2 (Animation,Adventure,Comedy,Family,Fantasy)
#Mary Poppins (Comedy,Family,Fantasy,Musical)
#Pinoochio (Adventure,Comedy,Drama,Family,Musical,Romance)
#The Many Adventures of Winnie the Pooh (Animation,Adventure,Comedy,Family,Musical)
#Tinker Bell (Animation,Adventure,Family,Fantasy)
tweets = []

q = 'Toy Story'
count = 100
search_results = twitter_api.search.tweets(q=q, count=count)
statuses = search_results['statuses']

# Iterating through more batches of results by following the cursor
while len(tweets) < 400:
    try:
        next_results = search_results['search_metadata']['next_results']
    except KeyError as e: # No more results when next_results doesn't exist
        break

    # Create a dictionary from next_results, which has the following form:
    # ?max_id=847960489447628799&q=%23RIPSelena&count=100&include_entities=1
    kwargs = dict([ kv.split('=') for kv in unquote(next_results[1:]).split("&") ])

    search_results = twitter_api.search.tweets(**kwargs)
    statuses = search_results['statuses']
    for i in statuses:
        if (not ('retweeted_status' in i.keys())) and i['lang'] == 'en':
            tweets.append(str(i['text']))
    
tweets2 = []               
for i in tweets:
    tweets2.append(bytes(i, 'utf-8'))
X = TfidfVectorizer(ngram_range = (1, 1)).fit(docs_train)
Xtest = X.transform(tweets2)
y_predicted = y_predictor.predict(Xtest)
print("Tweets about Toy Story are", round(np.count_nonzero(y_predicted == 1)/len(tweets)*100, 3), "% positve")

Tweets about Toy Story are 65.281 % positve


In [14]:
#Use high tomato ratings to search different movies and their genres
#Toy story and 2 (Animation,Adventure,Comedy,Family,Fantasy)
#Mary Poppins (Comedy,Family,Fantasy,Musical)
#Pinoochio (Adventure,Comedy,Drama,Family,Musical,Romance)
#The Many Adventures of Winnie the Pooh (Animation,Adventure,Comedy,Family,Musical)
#Tinker Bell (Animation,Adventure,Family,Fantasy)
tweets = []

q = 'Mary Poppins'
count = 100
search_results = twitter_api.search.tweets(q=q, count=count)
statuses = search_results['statuses']

# Iterating through more batches of results by following the cursor
while len(tweets) < 400:
    try:
        next_results = search_results['search_metadata']['next_results']
    except KeyError as e: # No more results when next_results doesn't exist
        break

    # Create a dictionary from next_results, which has the following form:
    # ?max_id=847960489447628799&q=%23RIPSelena&count=100&include_entities=1
    kwargs = dict([ kv.split('=') for kv in unquote(next_results[1:]).split("&") ])

    search_results = twitter_api.search.tweets(**kwargs)
    statuses = search_results['statuses']
    for i in statuses:
        if (not ('retweeted_status' in i.keys())) and i['lang'] == 'en':
            tweets.append(str(i['text']))
    
tweets2 = []               
for i in tweets:
    tweets2.append(bytes(i, 'utf-8'))
X = TfidfVectorizer(ngram_range = (1, 1)).fit(docs_train)
Xtest = X.transform(tweets2)
y_predicted = y_predictor.predict(Xtest)
print("Tweets about Mary Poppins are", round(np.count_nonzero(y_predicted == 1)/len(tweets)*100, 3), "% positve")

Tweets about Mary Poppins are 45.721 % positve


In [15]:
#Use high tomato ratings to search different movies and their genres
#Toy story and 2 (Animation,Adventure,Comedy,Family,Fantasy)
#Mary Poppins (Comedy,Family,Fantasy,Musical)
#Pinoochio (Adventure,Comedy,Drama,Family,Musical,Romance)
#The Many Adventures of Winnie the Pooh (Animation,Adventure,Comedy,Family,Musical)
#Tinker Bell (Animation,Adventure,Family,Fantasy)
tweets = []

q = 'Old Yeller'
count = 100
search_results = twitter_api.search.tweets(q=q, count=count)
statuses = search_results['statuses']

# Iterating through more batches of results by following the cursor
while len(tweets) < 400:
    try:
        next_results = search_results['search_metadata']['next_results']
    except KeyError as e: # No more results when next_results doesn't exist
        break

    # Create a dictionary from next_results, which has the following form:
    # ?max_id=847960489447628799&q=%23RIPSelena&count=100&include_entities=1
    kwargs = dict([ kv.split('=') for kv in unquote(next_results[1:]).split("&") ])

    search_results = twitter_api.search.tweets(**kwargs)
    statuses = search_results['statuses']
    for i in statuses:
        if (not ('retweeted_status' in i.keys())) and i['lang'] == 'en':
            tweets.append(str(i['text']))
    
tweets2 = []               
for i in tweets:
    tweets2.append(bytes(i, 'utf-8'))
X = TfidfVectorizer(ngram_range = (1, 1)).fit(docs_train)
Xtest = X.transform(tweets2)
y_predicted = y_predictor.predict(Xtest)
print("Tweets about Old Yeller are", round(np.count_nonzero(y_predicted == 1)/len(tweets)*100, 3), "% positve")

Tweets about Old Yeller are 59.081 % positve


In [16]:
#Use high tomato ratings to search different movies and their genres
#Toy story and 2 (Animation,Adventure,Comedy,Family,Fantasy)
#Mary Poppins (Comedy,Family,Fantasy,Musical)
#Pinoochio (Adventure,Comedy,Drama,Family,Musical,Romance)
#The Many Adventures of Winnie the Pooh (Animation,Adventure,Comedy,Family,Musical)
#Tinker Bell (Animation,Adventure,Family,Fantasy)
tweets = []

q = 'Winnie the Pooh'
count = 100
search_results = twitter_api.search.tweets(q=q, count=count)
statuses = search_results['statuses']

# Iterating through more batches of results by following the cursor
while len(tweets) < 400:
    try:
        next_results = search_results['search_metadata']['next_results']
    except KeyError as e: # No more results when next_results doesn't exist
        break

    # Create a dictionary from next_results, which has the following form:
    # ?max_id=847960489447628799&q=%23RIPSelena&count=100&include_entities=1
    kwargs = dict([ kv.split('=') for kv in unquote(next_results[1:]).split("&") ])

    search_results = twitter_api.search.tweets(**kwargs)
    statuses = search_results['statuses']
    for i in statuses:
        if (not ('retweeted_status' in i.keys())) and i['lang'] == 'en':
            tweets.append(str(i['text']))
    
tweets2 = []               
for i in tweets:
    tweets2.append(bytes(i, 'utf-8'))
X = TfidfVectorizer(ngram_range = (1, 1)).fit(docs_train)
Xtest = X.transform(tweets2)
y_predicted = y_predictor.predict(Xtest)
print("Tweets about Winnie the Pooh are", round(np.count_nonzero(y_predicted == 1)/len(tweets)*100, 3), "% positve")

Tweets about Winnie the Pooh are 52.164 % positve


In [18]:
#Use high tomato ratings to search different movies and their genres
#Toy story and 2 (Animation,Adventure,Comedy,Family,Fantasy)
#Mary Poppins (Comedy,Family,Fantasy,Musical)
#Pinoochio (Adventure,Comedy,Drama,Family,Musical,Romance)
#The Many Adventures of Winnie the Pooh (Animation,Adventure,Comedy,Family,Musical)
#Tinker Bell (Animation,Adventure,Family,Fantasy)
tweets = []

q = 'Tinker Bell'
count = 100
search_results = twitter_api.search.tweets(q=q, count=count)
statuses = search_results['statuses']

# Iterating through more batches of results by following the cursor
while len(tweets) < 400:
    try:
        next_results = search_results['search_metadata']['next_results']
    except KeyError as e: # No more results when next_results doesn't exist
        break

    # Create a dictionary from next_results, which has the following form:
    # ?max_id=847960489447628799&q=%23RIPSelena&count=100&include_entities=1
    kwargs = dict([ kv.split('=') for kv in unquote(next_results[1:]).split("&") ])

    search_results = twitter_api.search.tweets(**kwargs)
    statuses = search_results['statuses']
    for i in statuses:
        if (not ('retweeted_status' in i.keys())) and i['lang'] == 'en':
            tweets.append(str(i['text']))
    
tweets2 = []               
for i in tweets:
    tweets2.append(bytes(i, 'utf-8'))
X = TfidfVectorizer(ngram_range = (1, 1)).fit(docs_train)
Xtest = X.transform(tweets2)
y_predicted = y_predictor.predict(Xtest)
print("Tweets about Tinker Bell are", round(np.count_nonzero(y_predicted == 1)/len(tweets)*100, 3), "% positve")

Tweets about Tinker Bell are 32.353 % positve
