# Movie reviews

This notebook takes you though a complete iteration of Machine Learning Assignment 1 - Movie reviews. The assignment details (including links to download the data) can be found [here](https://docs.google.com/document/d/1WGYw99e5q6j5V0Zrf2HveagU6URt_kVvdR8B9HYQ99E/edit?usp=sharing). 

You are encouraged to create features beyond those available in the feature extraction documentation. Possibilities include the length of the comment/review, its grammar, punctuation, etc.

The Machine Learning task: 
This is a classification task. You will predict **Sentiment**, which is equal to **1 if the movie review is a “good” review** or **0 if it is not a “good” review**. 

Variables that should never be on the X side of the equation: 
- id
- Sentiment
- OR ANY VARIATION OF THESE!


Your work will be assessed on: 
- how accurately your model classifies on a test set
- how well your model generalizes
- the organization and documentation of your Jupyter Notebooks
- communication of your work in class reflections and final presentations
- model improvement over the semester
-  -10 points if you use a random_seed of 74 in your train/test data split

## import libraries we will use:

In [8]:
# all imports and magic commands
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# from my_measures import BinaryClassificationPerformance
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
%matplotlib inline

using NLTK, let's explore the data

## Local BinaryClassificationPerformance()

In [9]:
# my_measures
class BinaryClassificationPerformance():
    '''Performance measures to evaluate the fit of a binary classification model, v1.02'''
    
    def __init__(self, predictions, labels, desc, probabilities=None):
        '''Initialize attributes: predictions-vector of predicted values for Y, labels-vector of labels for Y'''
        '''probabilities-optional, probability that Y is equal to True'''
        self.probabilities = probabilities
        self.performance_df = pd.concat([pd.DataFrame(predictions), pd.DataFrame(labels).reset_index(drop=True)], axis=1, ignore_index=True)
        self.performance_df.columns = ['preds', 'labls']
        self.desc = desc
        self.performance_measures = {}
        self.image_indices = {}
  
    def compute_measures(self):
        '''Compute performance measures defined by Flach p. 57'''
        self.performance_measures['Pos'] = self.performance_df['labls'].sum()
        self.performance_measures['Neg'] = self.performance_df['labls'].shape[0] - self.performance_df['labls'].sum()
        self.performance_measures['TP'] = ((self.performance_df['preds'] == True) & (self.performance_df['labls'] == True)).sum()
        self.performance_measures['TN'] = ((self.performance_df['preds'] == False) & (self.performance_df['labls'] == False)).sum()
        self.performance_measures['FP'] = ((self.performance_df['preds'] == True) & (self.performance_df['labls'] == False)).sum()
        self.performance_measures['FN'] = ((self.performance_df['preds'] == False) & (self.performance_df['labls'] == True)).sum()
        self.performance_measures['Accuracy'] = (self.performance_measures['TP'] + self.performance_measures['TN']) / (self.performance_measures['Pos'] + self.performance_measures['Neg'])
        self.performance_measures['Precision'] = self.performance_measures['TP'] / (self.performance_measures['TP'] + self.performance_measures['FP'])
        self.performance_measures['Recall'] = self.performance_measures['TP'] / self.performance_measures['Pos']
        self.performance_measures['desc'] = self.desc

    def img_indices(self):
        '''Get the indices of true and false positives to be able to locate the corresponding images in a list of image names'''
        self.performance_df['tp_ind'] = ((self.performance_df['preds'] == True) & (self.performance_df['labls'] == True))
        self.performance_df['fp_ind'] = ((self.performance_df['preds'] == True) & (self.performance_df['labls'] == False))
        self.image_indices['TP_indices'] = np.where(self.performance_df['tp_ind']==True)[0].tolist()
        self.image_indices['FP_indices'] = np.where(self.performance_df['fp_ind']==True)[0].tolist()

### IMPORTANT!!! Make sure you are using `BinaryClassificationPerformance` v1.02

In [10]:
help(BinaryClassificationPerformance)

Help on class BinaryClassificationPerformance in module __main__:

class BinaryClassificationPerformance(builtins.object)
 |  BinaryClassificationPerformance(predictions, labels, desc, probabilities=None)
 |  
 |  Performance measures to evaluate the fit of a binary classification model, v1.02
 |  
 |  Methods defined here:
 |  
 |  __init__(self, predictions, labels, desc, probabilities=None)
 |      Initialize attributes: predictions-vector of predicted values for Y, labels-vector of labels for Y
 |  
 |  compute_measures(self)
 |      Compute performance measures defined by Flach p. 57
 |  
 |  img_indices(self)
 |      Get the indices of true and false positives to be able to locate the corresponding images in a list of image names
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object

In [16]:
movie_data = pd.read_csv('../week04/final_data/moviereviews_train.tsv', sep='\t')

In [17]:
movie_data.describe()

Unnamed: 0,sentiment
count,25000.0
mean,0.5
std,0.50001
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


In [18]:
movie_data.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [25]:
sample_review=movie_data['review'][0]
print(sample_review)

With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally star

## Basic features:

- How long is the text?

In [21]:
sample_review.split()

['With',
 'all',
 'this',
 'stuff',
 'going',
 'down',
 'at',
 'the',
 'moment',
 'with',
 'MJ',
 "i've",
 'started',
 'listening',
 'to',
 'his',
 'music,',
 'watching',
 'the',
 'odd',
 'documentary',
 'here',
 'and',
 'there,',
 'watched',
 'The',
 'Wiz',
 'and',
 'watched',
 'Moonwalker',
 'again.',
 'Maybe',
 'i',
 'just',
 'want',
 'to',
 'get',
 'a',
 'certain',
 'insight',
 'into',
 'this',
 'guy',
 'who',
 'i',
 'thought',
 'was',
 'really',
 'cool',
 'in',
 'the',
 'eighties',
 'just',
 'to',
 'maybe',
 'make',
 'up',
 'my',
 'mind',
 'whether',
 'he',
 'is',
 'guilty',
 'or',
 'innocent.',
 'Moonwalker',
 'is',
 'part',
 'biography,',
 'part',
 'feature',
 'film',
 'which',
 'i',
 'remember',
 'going',
 'to',
 'see',
 'at',
 'the',
 'cinema',
 'when',
 'it',
 'was',
 'originally',
 'released.',
 'Some',
 'of',
 'it',
 'has',
 'subtle',
 'messages',
 'about',
 "MJ's",
 'feeling',
 'towards',
 'the',
 'press',
 'and',
 'also',
 'the',
 'obvious',
 'message',
 'of',
 'drugs',
 

In [22]:
len(sample_review.split())

433

In [28]:
# let's clean it for < br/>
sample_review = sample_review.replace("<br />", " ").replace("\\", "").replace("\'", "'")

In [29]:
sample_review

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.  Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.  The actual feature film bit when it finally starts is only on for 2

In [35]:
movie_data['review'][5]

'I dont know why people think this is such a bad movie. Its got a pretty good plot, some good action, and the change of location for Harry does not hurt either. Sure some of its offensive and gratuitous but this is not the only movie like that. Eastwood is in good form as Dirty Harry, and I liked Pat Hingle in this movie as the small town cop. If you liked DIRTY HARRY, then you should see this one, its a lot better than THE DEAD POOL. 4/5'

In [39]:
'test' in 'this was a test'

True

In [44]:
'boring' in sample_review

True

In [114]:
negative_words=["stupid", "liar", "guilty", "hate", 
                "boring", "bad", "not", "hates", "bad", 
                "terrible", "waste", "hated", "awful", "dumb", "idiot",
               "badly", "mediocore", "ridiculous", "nonsense", "disaster", "monstrously",
               "ill"]
positive_words=["great", "loved", "fun", "cute", "entertaining", "good", "wholesome",
               "nice", "impressive", "funny", "classic", "pretty", "beautiful"]
# count the occurance of any of the posittive words iin the text:

        
        
def getOccurrences(str, word):
    # split the string by spaces in a
    a = str.split(" ")
 
    # search for pattern in a
    count = 0
    for i in range(0, len(a)):
         
        # if match found increase count
        if (word == a[i]):
           count = count + 1
            
    return count 

def word_test(sample_review):
    tot_pos=0
    tot_neg=0
    for word in range(0, len(negative_words)):
        num=getOccurrences(sample_review, negative_words[word])
        if num>0:
            tot_neg+=num
    for word in range(0, len(positive_words)):
        num=getOccurrences(sample_review, positive_words[word])
        if num>0:
            tot_pos+=num
    if (tot_pos+tot_neg)==0:
        return 1.0
    else:
#     print(tot_pos, tot_neg)
        return((tot_pos-tot_neg) / (tot_pos+tot_neg))

word_test(sample_review)

-0.42857142857142855

In [115]:
l=movie_data['review'].str.split(' ').str.len()
type(l)

pandas.core.series.Series

In [116]:
# list of strings
lst = []
# Calling DataFrame constructor on list
# df = pd.DataFrame(lst)
# df
for i in range(0, len(movie_data['review'])): 
    n= word_test(movie_data['review'][i])
    lst.append(n+1)
    
#     print("movie_data['sent_count']",n)
df = pd.DataFrame(lst)
df  
# word_test(movie_data['review'])

Unnamed: 0,0
0,0.571429
1,1.600000
2,0.000000
3,0.000000
4,1.000000
...,...
24995,2.000000
24996,1.500000
24997,1.000000
24998,0.000000


# Function for feature building and extraction on natural language data

In [111]:
# function that takes raw data and completes all preprocessing required before model fits
def process_raw_data(fn, my_random_seed, test=False):
    # read and summarize data
    movie_data = pd.read_csv(fn, sep='\t')
    print("movie_data is:", type(movie_data))
    print("movie_data has", movie_data.shape[0], "rows and", movie_data.shape[1], "columns", "\n")
    print("the data types for each of the columns in movie_data:")
    print(movie_data.dtypes, "\n")
    print("the first 10 rows in movie_data:")
    print(movie_data.head(5))
    if (not test):
        print("The rate of 'good' movie reviews in the dataset: ")
        print(movie_data['sentiment'].mean())

    # vectorize Bag of Words from review text; as sparse matrix
    if (not test): # fit_transform()
        hv = HashingVectorizer(n_features=2 ** 17, alternate_sign=False)
        X_hv = hv.fit_transform(movie_data.review)
        fitted_transformations.append(hv)
        print("Shape of HashingVectorizer X:")
        print(X_hv.shape)
    else: # transform() 
        X_hv = fitted_transformations[0].transform(movie_data.review)
        print("Shape of HashingVectorizer X:")
        print(X_hv.shape)
    
    # http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
    if (not test):
        transformer = TfidfTransformer()
        X_tfidf = transformer.fit_transform(X_hv)
        fitted_transformations.append(transformer)
    else:
        X_tfidf = fitted_transformations[1].transform(X_hv)
    
    # create additional quantitative features
    # features from Amazon.csv to add to feature set
    
    #FEATURES:
    movie_data['word_count'] = movie_data['review'].str.split(' ').str.len()
#     movie_data['punc_count'] = movie_data['review'].str.count("\.")
#     movie_data['sent_count']=[]
    lst = []
    # Calling DataFrame constructor on list
    # df = pd.DataFrame(lst)
    # df
    for i in range(0, len(movie_data['review'])): 
        n= word_test(movie_data['review'][i])
        lst.append(n+1)
    #     print("movie_data['sent_count']",n)
    movie_data['sent_count']= pd.DataFrame(lst)
#     df 

#     X_quant_features = movie_data[["word_count", "punc_count",  'sent_count']]
    X_quant_features = movie_data[["word_count", 'sent_count']]
    print("Look at a few rows of the new quantitative features: ")
    print(X_quant_features.head(10))
    
    # Combine all quantitative features into a single sparse matrix
    X_quant_features_csr = csr_matrix(X_quant_features)
    X_combined = hstack([X_tfidf, X_quant_features_csr])
    X_matrix = csr_matrix(X_combined) # convert to sparse matrix
    print("Size of combined bag of words and new quantitative variables matrix:")
    print(X_matrix.shape)
    
    # Create `X`, scaled matrix of features
    # feature scaling
    if (not test):
        sc = StandardScaler(with_mean=False)
        X = sc.fit_transform(X_matrix)
        fitted_transformations.append(sc)
        print(X.shape)
        y = movie_data['sentiment']
    else:
        X = fitted_transformations[2].transform(X_matrix)
        print(X.shape)
    
    # Create Training and Test Sets
    # enter an integer for the random_state parameter; any integer will work
    if (test):
        X_submission_test = X
        print("Shape of X_test for submission:")
        print(X_submission_test.shape)
        print('SUCCESS!')
        return(movie_data, X_submission_test)
    else: 
        X_train, X_test, y_train, y_test, X_raw_train, X_raw_test = train_test_split(X, y, movie_data, test_size=0.2, random_state=my_random_seed)
        print("Shape of X_train and X_test:")
        print(X_train.shape)
        print(X_test.shape)
        print("Shape of y_train and y_test:")
        print(y_train.shape)
        print(y_test.shape)
        print("Shape of X_raw_train and X_raw_test:")
        print(X_raw_train.shape)
        print(X_raw_test.shape)
        print('SUCCESS!')
        return(X_train, X_test, y_train, y_test, X_raw_train, X_raw_test)

# Create training and test sets from function

In [112]:
# create an empty list to store any use of fit_transform() to transform() later
# it is a global list to store model and feature extraction fits
fitted_transformations = []

# CHANGE FILE PATH and my_random_seed number (any integer other than 74 will do): 
X_train, X_test, y_train, y_test, X_raw_train, X_raw_test = process_raw_data(fn='../week04/final_data/moviereviews_train.tsv', my_random_seed=74)

print("Number of fits stored in `fitted_transformations` list: ")
print(len(fitted_transformations))

movie_data is: <class 'pandas.core.frame.DataFrame'>
movie_data has 25000 rows and 3 columns 

the data types for each of the columns in movie_data:
id           object
sentiment     int64
review       object
dtype: object 

the first 10 rows in movie_data:
       id  sentiment                                             review
0  5814_8          1  With all this stuff going down at the moment w...
1  2381_9          1  \The Classic War of the Worlds\" by Timothy Hi...
2  7759_3          0  The film starts with a manager (Nicholas Bell)...
3  3630_4          0  It must be assumed that those who praised this...
4  9495_8          1  Superbly trashy and wondrously unpretentious 8...
The rate of 'good' movie reviews in the dataset: 
0.5
Shape of HashingVectorizer X:
(25000, 131072)
Look at a few rows of the new quantitative features: 
   word_count  sent_count
0         433    0.571429
1         158    1.600000
2         378    0.000000
3         379    0.000000
4         367    1.000000


In [113]:
print(X_train)

  (0, 1595)	19.763338341424575
  (0, 1786)	25.518105503367355
  (0, 4412)	4.5794880691760484
  (0, 8564)	1.2782559780665708
  (0, 13294)	154.92272391038324
  (0, 13677)	0.6349093904227626
  (0, 15860)	4.657509877485911
  (0, 20378)	4.743304840658256
  (0, 24734)	2.6512647577129154
  (0, 25246)	12.799365661906835
  (0, 28331)	1.7134238552579635
  (0, 31028)	2.8230648691907927
  (0, 31622)	9.483268592868663
  (0, 32253)	3.224875421155861
  (0, 34278)	19.615848858270766
  (0, 37777)	1.4945210406243707
  (0, 38851)	8.071036362504625
  (0, 38990)	1.5107090199206104
  (0, 39480)	3.627010710953719
  (0, 43099)	2.4835132475673753
  (0, 43902)	1.4745069343835042
  (0, 45581)	4.467626782299751
  (0, 49453)	1.7203672715874432
  (0, 51110)	53.02056973611625
  (0, 55847)	11.679937689184646
  :	:
  (19999, 94639)	1.5616914302247829
  (19999, 96143)	11.076422982872923
  (19999, 96784)	4.374479051206937
  (19999, 97317)	2.502888237852477
  (19999, 99537)	3.8744056382235406
  (19999, 104082)	8.36427467

# Fit (and tune) Various Models

### MODEL: ordinary least squares

In [92]:
from sklearn import linear_model
ols = linear_model.SGDClassifier(loss="squared_loss")
ols.fit(X_train, y_train)

ols_performance_train = BinaryClassificationPerformance(ols.predict(X_train), y_train, 'ols_train')
ols_performance_train.compute_measures()
print(ols_performance_train.performance_measures)

{'Pos': 10033, 'Neg': 9967, 'TP': 4567, 'TN': 5372, 'FP': 4595, 'FN': 5466, 'Accuracy': 0.49695, 'Precision': 0.4984719493560358, 'Recall': 0.45519784710455496, 'desc': 'ols_train'}


### MODEL: SVM, linear

In [93]:
from sklearn import linear_model
svm = linear_model.SGDClassifier()
svm.fit(X_train, y_train)

svm_performance_train = BinaryClassificationPerformance(svm.predict(X_train), y_train, 'svm_train')
svm_performance_train.compute_measures()
print(svm_performance_train.performance_measures)

{'Pos': 10033, 'Neg': 9967, 'TP': 10033, 'TN': 9966, 'FP': 1, 'FN': 0, 'Accuracy': 0.99995, 'Precision': 0.9999003388479171, 'Recall': 1.0, 'desc': 'svm_train'}


### MODEL: logistic regression

In [94]:
from sklearn import linear_model
lgs = linear_model.SGDClassifier(loss='log')
lgs.fit(X_train, y_train)

lgs_performance_train = BinaryClassificationPerformance(lgs.predict(X_train), y_train, 'lgs_train')
lgs_performance_train.compute_measures()
print(lgs_performance_train.performance_measures)

{'Pos': 10033, 'Neg': 9967, 'TP': 10033, 'TN': 9967, 'FP': 0, 'FN': 0, 'Accuracy': 1.0, 'Precision': 1.0, 'Recall': 1.0, 'desc': 'lgs_train'}


### MODEL: Naive Bayes

In [95]:
from sklearn.naive_bayes import MultinomialNB
nbs = MultinomialNB()
nbs.fit(X_train, y_train)

nbs_performance_train = BinaryClassificationPerformance(nbs.predict(X_train), y_train, 'nbs_train')
nbs_performance_train.compute_measures()
print(nbs_performance_train.performance_measures)

ValueError: Negative values in data passed to MultinomialNB (input X)

### MODEL: Perceptron

In [96]:
from sklearn import linear_model
prc = linear_model.SGDClassifier(loss='perceptron')
prc.fit(X_train, y_train)

prc_performance_train = BinaryClassificationPerformance(prc.predict(X_train), y_train, 'prc_train')
prc_performance_train.compute_measures()
print(prc_performance_train.performance_measures)

{'Pos': 10033, 'Neg': 9967, 'TP': 10033, 'TN': 9967, 'FP': 0, 'FN': 0, 'Accuracy': 1.0, 'Precision': 1.0, 'Recall': 1.0, 'desc': 'prc_train'}


### MODEL: Ridge Regression Classifier

In [97]:
from sklearn import linear_model
rdg = linear_model.RidgeClassifier()
rdg.fit(X_train, y_train)

rdg_performance_train = BinaryClassificationPerformance(rdg.predict(X_train), y_train, 'rdg_train')
rdg_performance_train.compute_measures()
print(rdg_performance_train.performance_measures)

{'Pos': 10033, 'Neg': 9967, 'TP': 10033, 'TN': 9967, 'FP': 0, 'FN': 0, 'Accuracy': 1.0, 'Precision': 1.0, 'Recall': 1.0, 'desc': 'rdg_train'}


### MODEL: Random Forest Classifier

In [98]:
from sklearn.ensemble import RandomForestClassifier
rdf = RandomForestClassifier(max_depth=2, random_state=0)
rdf.fit(X_train, y_train)

rdf_performance_train = BinaryClassificationPerformance(rdf.predict(X_train), y_train, 'rdf_train')
rdf_performance_train.compute_measures()
print(rdf_performance_train.performance_measures)

{'Pos': 10033, 'Neg': 9967, 'TP': 8847, 'TN': 6588, 'FP': 3379, 'FN': 1186, 'Accuracy': 0.77175, 'Precision': 0.7236217896286602, 'Recall': 0.8817900926941095, 'desc': 'rdf_train'}


### ROC plot to compare performance of various models and fits

In [99]:
fits = [ols_performance_train, svm_performance_train, lgs_performance_train, nbs_performance_train, prc_performance_train, rdg_performance_train, rdf_performance_train]

for fit in fits:
    plt.plot(fit.performance_measures['FP'] / fit.performance_measures['Neg'], 
             fit.performance_measures['TP'] / fit.performance_measures['Pos'], 'bo')
    plt.text(fit.performance_measures['FP'] / fit.performance_measures['Neg'], 
             fit.performance_measures['TP'] / fit.performance_measures['Pos'], fit.desc)
plt.axis([0, 1, 0, 1])
plt.title('ROC plot: test set')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.show()

NameError: name 'nbs_performance_train' is not defined

### looking at reviews based on their classification

Let's say we decide that Ordinary Least Squares (OLS) Regression is the best model for generalization. Let's take a look at some of the reviews and try to make a (subjective) determination of whether it's generalizing well. 

In [100]:
ols_predictions = ols.predict(X_train)

### let's look at some false positives:

In [101]:
# false positives

print("Examples of false positives:")

import random, time

for i in range(0, len(ols_predictions)):
    if (ols_predictions[i] == 1):
        if (X_raw_train.iloc[i]['sentiment'] == 0):
            if (random.uniform(0, 1) < 0.05): # to print only 5% of the false positives
                print(i)
                print(X_raw_train.iloc[i]['review'])
                print('* * * * * * * * * ')


Examples of false positives:
10
FAIL. I'd love to give this crap a 0. Yes, I registered just to rate this garbage. I want to go back in time and cut my wrist. Heres some copy and paste to take up 10 lines. FAIL. I'd love to give this crap a 0. Yes, I registered just to rate this garbage. I want to go back in time and cut my wrist. Heres some copy and paste to take up 10 lines. FAIL. I'd love to give this crap a 0. Yes, I registered just to rate this garbage. I want to go back in time and cut my wrist. Heres some copy and paste to take up 10 lines. FAIL. I'd love to give this crap a 0. Yes, I registered just to rate this garbage. I want to go back in time and cut my wrist. Heres some copy and paste to take up 10 lines. FAIL. I'd love to give this crap a 0. Yes, I registered just to rate this garbage. I want to go back in time and cut my wrist. Heres some copy and paste to take up 10 lines.
* * * * * * * * * 
30
OK, I love bad horror. I especially love horror bad enough to make fun of. D

4341
one may ask why? the characters snarl, yell, and chew the scenery without any perceptible reason except someone wanted to make a movie in barcelona. billie baldwin, is that the right one?, is forgettable in the cop/estranged-husband/loving-father-of-cute-little-blond-girl role. the story seems to have been cut and pasted from the scenes thrown away from adventure films in the last three years. ellen pompeo's lack of charisma is a black hole that seems to suck the energy out of every scene she is in. her true acting range is displayed when she takes her blouse off as the movies careens from one limp chase scene to another. unfortunately, the directing rarely goes bad enough to be camp or a parody. it is all just cliché, familiar in every respect. the director cast his own daughter as the precocious brat probably because no respectable agent would have permitted a client to ruin a career by being in such a lame, contrived and uninteresting movie. the only heist here is the theft of 

7733
I was VERY disappointed with this film. I expected more of a Thelma and Louise female-buddy crime movie. Instead, the women prison escapees in this flick, had no sense of loyalty to one another. They were an extremely vulgar pack of hyenas, who beat each other up, double-crossed each other, and even committed lesbian rape against other women in the film.<br /><br />Instead of being shrewed thieves, who stuck together to plan their escape and find the hidden stash of money, the women escapees were too selfish and vicious, to trust each other for long. These women weren't liberated in a positive sense. They just ended up being a bunch of loose-cannons, incapable of respect for themselves, or each other. If you like 70s female crime caper films, skip this bomb, and see The Great Texas Dynamite Chase, which stars Claudia Jennings and Jocelyn Jones.
* * * * * * * * * 
7928
::POTENTIAL SPOILERS::<br /><br />Man, this movie was awful. A Catholic/superstitious/suspense thriller it goes ov

10408
SPOILERS. Strange people with generous tastes have been reviewing this film. Allow me to add balance by pointing out the following:<br /><br />Script: Dreadful. As Tom and Dan are \getting to know each other,\" bantering about films, the talk is clearly that of one person, and I suspect it was the director, who carefully worked his words to sound intelligent. At one point, Dan asks, \"Have you heard of the HIV virus?\" and it sounds about as natural as asking, \"Have you communicated with the nine alien races?\"<br /><br />Acting: White teeth do and a chiseled face do not a sensitive performer make. Speedman did well enough with what he was given, I suppose, but Marsden was terrible -- unsympathetic, unbelievable, and downright smug and smarmy throughout his captivity. There is an emptiness to his performances (also see Interstate 60).<br /><br />Plot: Spare me! The moments of half-escape were not thrilling but irritating and weak. Recall Marsden pretending to try keys in the doo

12702
Iberia is nice to see on TV. But why see this in silver screen? Lot of dance and music. If you like classical music or modern dance this could be your date movie. But otherwise one and half hour is just too long time. If you like to see skillful dancing in silver screen it's better to see Bollywood movie. They know how to combine breath taking dancing to long movie. Director Carlos Saura knows how to shoot dancing from old experience. And time to time it's look really good. but when the movie is one and hour it should be at least most of time interesting. There are many kind of art not everything is bigger then life and this film is not too big.
* * * * * * * * * 
12860
I saw the 7.5 IMDb rating on this movie and on the basis of that decided to watch this movie which my roommate had rented. She said she had seen it before. \It's funny and sad! I cried the first time I saw it,\" she gushed. Maybe compared to other Bollywood movies this deserves a 7.5 out of 10, but in comparison t

14829
Seldom do I give up on a movie without seeing the entire show. This is particularly true when I have rented it on DVD. Syriana was one in which I did give up. Half way through I turned it off in bored disgust.<br /><br />This movie is disjointed, boring, confusing and lackluster. The acting was dry and without credible portrayals. The general plot was good but developed in such an insipid and boring fashion that it failed to grasp my attention or interest. The multiple sub plots often failed to connect to each other and seemed more like random stories than an actual connected plot. Too bad such a serious subject and such great actors could create such a flop. I cannot imagine this movie receiving any nominations much less an award.
* * * * * * * * * 
14851
Not too keen on this really. The story is pretty horrid and unconvincing. I enjoyed the first 10 minutes, bill nunns good. After that it was pretty appalling. Tim doesn't fit the role, he comes across as a smug self inflated as

16353
An intriguing premise of hand-drawn fantasy come to life in a child's fever dreams. However, I imagine the average nonfictional child is far more adept at scaring themselves than Bernard Rose is at riveting the viewer. The duel between Anna's two realities drags on far too long to sustain interest, especially considering that the little girl playing her is the most abrasive child actor I've ever seen.<br /><br />Use only for kindling.
* * * * * * * * * 
16426
I would have rated this film a minus 10 but sadly it is not offered.<br /><br />Why I didn't walk out in the first five minutes of this movie I cannot say. I should have gone with my instinct and left immediately!! Several people in our theater did and sadly I didn't follow them out.<br /><br />The story lacked all criteria for a movie. NO plot. Awful acting! Even Robin Williams was so disappointing that I may never see another film he is in. Not a single relationship in the story went beyond parlor talk. I did like the taze

19146
Alas, poor Hamlet. I knew him, dear reader, and let me tell thee, THIS VERSION SUCKS! I don't know who of all people put up the money for this flotsam, but I hope that they're proud of themselves. They took THE classic play and turned it into the most boring melodrama imaginable. This version is quite literally so bad, that not even the presence of a great thespian like Maximilian Schell in the title role can save it. This movie's only redeeming quality is that it made great fodder for \MST3K\"; Mike, Servo and Crow had a lot of fun with this one.<br /><br />But either way, I'm sure that Shakespeare, had he been alive when they made this, would not have wanted his name associated with it. This \"Hamlet\" is not even so bad that's it good; it's just plain bad. Absolutely dreadful."
* * * * * * * * * 
19216
The plot is about a female nurse, named Anna, is caught in the middle of a world-wide chaos as flesh-eating zombies begin rising up and taking over the world and attacking the l

---

# <span style="color:red">WARNING: Don't look at test set performance too much!</span>

---

The following cells show performance on your test set. Do not look at this too often! 

# Look at performance on the test set

### MODEL: ordinary least squares

In [102]:
ols_performance_test = BinaryClassificationPerformance(ols.predict(X_test), y_test, 'ols_test')
ols_performance_test.compute_measures()
print(ols_performance_test.performance_measures)

{'Pos': 2467, 'Neg': 2533, 'TP': 1045, 'TN': 1458, 'FP': 1075, 'FN': 1422, 'Accuracy': 0.5006, 'Precision': 0.49292452830188677, 'Recall': 0.42359140656668015, 'desc': 'ols_test'}


### MODEL: SVM, linear

In [103]:
svm_performance_test = BinaryClassificationPerformance(svm.predict(X_test), y_test, 'svm_test')
svm_performance_test.compute_measures()
print(svm_performance_test.performance_measures)

{'Pos': 2467, 'Neg': 2533, 'TP': 2054, 'TN': 2096, 'FP': 437, 'FN': 413, 'Accuracy': 0.83, 'Precision': 0.8245684464070654, 'Recall': 0.8325901905147953, 'desc': 'svm_test'}


### MODEL: logistic regression

In [104]:
lgs_performance_test = BinaryClassificationPerformance(lgs.predict(X_test), y_test, 'lgs_test')
lgs_performance_test.compute_measures()
print(lgs_performance_test.performance_measures)

{'Pos': 2467, 'Neg': 2533, 'TP': 2031, 'TN': 2085, 'FP': 448, 'FN': 436, 'Accuracy': 0.8232, 'Precision': 0.8192819685356999, 'Recall': 0.8232671260640454, 'desc': 'lgs_test'}


### MODEL: Naive Bayes

In [105]:
nbs_performance_test = BinaryClassificationPerformance(nbs.predict(X_test), y_test, 'nbs_test')
nbs_performance_test.compute_measures()
print(nbs_performance_test.performance_measures)

AttributeError: 'MultinomialNB' object has no attribute 'feature_log_prob_'

### MODEL: Perceptron

In [106]:
prc_performance_test = BinaryClassificationPerformance(prc.predict(X_test), y_test, 'prc_test')
prc_performance_test.compute_measures()
print(prc_performance_test.performance_measures)

{'Pos': 2467, 'Neg': 2533, 'TP': 2064, 'TN': 2104, 'FP': 429, 'FN': 403, 'Accuracy': 0.8336, 'Precision': 0.8279181708784596, 'Recall': 0.83664369679773, 'desc': 'prc_test'}


### MODEL: Ridge Regression Classifier

In [107]:
rdg_performance_test = BinaryClassificationPerformance(rdg.predict(X_test), y_test, 'rdg_test')
rdg_performance_test.compute_measures()
print(rdg_performance_test.performance_measures)

{'Pos': 2467, 'Neg': 2533, 'TP': 2037, 'TN': 2017, 'FP': 516, 'FN': 430, 'Accuracy': 0.8108, 'Precision': 0.7978848413631022, 'Recall': 0.8256992298338063, 'desc': 'rdg_test'}


### MODEL: Random Forest Classifier

In [108]:
rdf_performance_test = BinaryClassificationPerformance(rdf.predict(X_test), y_test, 'rdf_test')
rdf_performance_test.compute_measures()
print(rdf_performance_test.performance_measures)

{'Pos': 2467, 'Neg': 2533, 'TP': 2152, 'TN': 1677, 'FP': 856, 'FN': 315, 'Accuracy': 0.7658, 'Precision': 0.7154255319148937, 'Recall': 0.8723145520875557, 'desc': 'rdf_test'}


### ROC plot to compare performance of various models and fits

In [109]:
fits = [ols_performance_test, svm_performance_test, lgs_performance_test, nbs_performance_test, prc_performance_test, rdg_performance_test, rdf_performance_test]

for fit in fits:
    plt.plot(fit.performance_measures['FP'] / fit.performance_measures['Neg'], 
             fit.performance_measures['TP'] / fit.performance_measures['Pos'], 'bo')
    plt.text(fit.performance_measures['FP'] / fit.performance_measures['Neg'], 
             fit.performance_measures['TP'] / fit.performance_measures['Pos'], fit.desc)
plt.axis([0, 1, 0, 1])
plt.title('ROC plot: test set')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.show()

NameError: name 'nbs_performance_test' is not defined

---

# <span style="color:red">SUBMISSION</span>

---

In [24]:
# read in test data for submission
# CHANGE FILE PATH and my_random_seed number (any integer other than 74 will do): 
raw_data, X_test_submission = process_raw_data(fn='../week04/final_data/moviereviews_test.tsv', my_random_seed=74, test=True)
print("Number of rows in the submission test set (should be 25,000): ")

movie_data is: <class 'pandas.core.frame.DataFrame'>
movie_data has 25000 rows and 2 columns 

the data types for each of the columns in movie_data:
id        object
review    object
dtype: object 

the first 10 rows in movie_data:
         id                                             review
0  12311_10  Naturally in a film who's main themes are of m...
1    8348_2  This movie is a disaster within a disaster fil...
2    5828_4  All in all, this is a movie for kids. We saw i...
3    7186_2  Afraid of the Dark left me with the impression...
4   12128_7  A very accurate depiction of small time mob li...
Shape of HashingVectorizer X:
(25000, 131072)
Look at a few rows of the new quantitative features: 
   word_count  punc_count
0         131           5
1         169          15
2         176          18
3         112           5
4         133           8
5         331          20
6         121          18
7         230          22
8          59           3
9         224          14
Size

---

Choose a <span style="color:red">*single*</span> model for your submission. In this code, I am choosing the Ordinary Least Squares model fit, which is in the `ols` object. But you should choose the model that is performing the best for you! 

In [25]:
# store the id from the raw data
my_submission = pd.DataFrame(raw_data["id"])
# concatenate predictions to the id
my_submission["prediction"] = ols.predict(X_test_submission)
# look at the proportion of positive predictions
print(my_submission['prediction'].mean())

0.40176


In [26]:
raw_data.head()

Unnamed: 0,id,review,word_count,punc_count
0,12311_10,Naturally in a film who's main themes are of m...,131,5
1,8348_2,This movie is a disaster within a disaster fil...,169,15
2,5828_4,"All in all, this is a movie for kids. We saw i...",176,18
3,7186_2,Afraid of the Dark left me with the impression...,112,5
4,12128_7,A very accurate depiction of small time mob li...,133,8


In [27]:
my_submission.head()

Unnamed: 0,id,prediction
0,12311_10,0
1,8348_2,0
2,5828_4,0
3,7186_2,0
4,12128_7,1


In [28]:
my_submission.shape

(25000, 2)

In [29]:
# export submission file as pdf
# CHANGE FILE PATH: 
my_submission.to_csv('/home/ec2-user/data/moviereviews_submission.csv', index=False)

FileNotFoundError: [Errno 2] No such file or directory: '/home/ec2-user/data/moviereviews_submission.csv'

# Submit to Canvas: 1) the CSV file that was written in the previous cell and 2) the url to the repository (GitHub or other) that contains your code and documentation