In [17]:

import pandas as pd
import sklearn

In [18]:
df = pd.read_csv('movie.csv') # Pandas already has handy functions/methods for reading CSVs
# That's it! "df" here is short for "dataframe", a.k.a. a fancy spreadsheet.

In [19]:
# This will show the first 5 rows in the spreadsheet so we can see what we're working with:
df.head()

Unnamed: 0,text,label
0,I grew up (b. 1965) watching and loving the Th...,0
1,"When I put this movie in my DVD player, and sa...",0
2,Why do people who do not know what a particula...,0
3,Even though I have great interest in Biblical ...,0
4,Im a die hard Dads Army fan and nothing will e...,1


In [20]:
# Look at a random selection of 5 reviews and the labels they're given.
# Every time re-run this cell, you'll see 5 different randomly selected reviews
for idx, row in df.sample(n=5).iterrows():
    print(f"Review {idx} - Label: {row['label']} ({'Negative' if row['label'] == 0 else 'Positive'}))")
    print(f"{row['text']}\n")

Review 16061 - Label: 1 (Positive))
Being in the suburbs of New York when the Z-Boys were creating history in Dogtown, I was only exposed to a glimpse of what was going on. I had a P-O-S Black Knight skateboard with clay wheels. It is long gone, and on the ash heap of my personal life. But I never forgot. It's like watching long-lost brothers and friends, and it hits me right where I live. I cannot watch this film enough. Every time I view it, some other aspect rises to the top, some other viewpoint come into sharp focus. The vintage footage, the incredible stills, the current personalities intermeshed with the vivid shadows of the brightly lit past, the heartfelt and not over-done narrative, all beautifully edited together in such a way as to make a landmark documentary of a genuine slice of American history. In the words of Glen Friedman - "It was F-ing unbelievable."

Review 1511 - Label: 1 (Positive))
There were times when this movie seemed to get a whole lot more complicated than 

In [21]:
# Split into training and test data - there's a scikit-learn function for that, import it here:
from sklearn.model_selection import train_test_split

# From googling, I know that this 'train_test_split' expects us to have a list of texts (we'll call that 'X')
# and a list of labels (which we'll call 'Y'), so I will set that up first:
X = df['text'] # Meaning: our 'X' data is the list of all the stuff in the 'text' column from our DataFrame
Y = df['label'] # As above, but with the 0 or 1 labels

# 'train_test_split' needs us to tell it what percent of the data we should extract and put in the TEST set,
# and we will also set a parameter called 'random_state' - when train_test_split splits up the data, it will
# pick randomly from positive/negative reviews - but we want this to give us the same results every time we run it,
# and giving a number as a 'random_state' will make it pick 'randomly' (ish) in a reproducible way.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=195) # The 'test_size' here means: withhold 20% of the data for testing

print(f"There are {len(X_train)} texts and {len(Y_train)} labels in the training set.")
print(f"There are {len(X_test)} texts and {len(Y_test)} labels in the test set")

There are 32000 texts and 32000 labels in the training set.
There are 8000 texts and 8000 labels in the test set


In [22]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()

X_train_counts = count_vectorizer.fit_transform(X_train)

# Print out what the result looks like via its '.shape' property.
# The first number in the .shape is the number of inputs, and the second number is the number of things (words, for us) it found in those inputs.
print(f"This should be (32000, 84966):  {X_train_counts.shape}")

This should be (32000, 84966):  (32000, 84966)


In [23]:
# Import the TF-IDF tool after googling how to use it:
from sklearn.feature_extraction.text import TfidfTransformer

# Make an instance of the TfidfTransformer and have it 'fit' to our 'X_train_counts' processed data, and
# use that information to make a 'transformer' called tf_transformer that can use to 'transform' any document
# into the same kind of format:

tfidf_transformer = TfidfTransformer()

# Make training data, but using the TF-IDF approach rather than just counts:
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

print("This should be the same numbers we got before:")
print(X_train_tfidf.shape)

This should be the same numbers we got before:
(32000, 84966)


In [24]:
# We'll try Logistic Regression first and see if it works at all.
from sklearn.linear_model import LogisticRegression

# Try these later by setting 'trial_model' to be one of these (see cells after this one) rather than 'LogisticRegression'"assignment 1 submissions"
# - the names of the models are on the right / they are the things you're importing!

from sklearn.linear_model import SGDClassifier # Stochastic Graident Descent classifier - slightly more advanced, but will it work better? 
from sklearn.naive_bayes import GaussianNB # "Gaussian Naïve Bayes" - uses Bayes' theorem; might work better on our tf-idf text data...
from sklearn.naive_bayes import MultinomialNB # "Multinomial Naive Bayes" - in theory, it can work well on tf-idf text stuff like ours. We'll find out!

In [25]:
trial_model = LogisticRegression() # change this later using the same kind of formatting to try different models.

# Remember: we transformed our original X_train data into something called 'X_train_tfidf'
# but the labels (Y_train) are still the same.

# a little code here to change the input to a .toarray() form *IF* the model is GaussianNB, because GaussianNB
# expects a slightly different input format.
trial_model.fit(X_train_tfidf.toarray() if isinstance(trial_model, GaussianNB) else X_train_tfidf, Y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [26]:
# Convert our test data to the right format by using our 'tfidf_transformer' to process the X_test texts.
# But we need to do all the stuff we did to the training data, too - we have to first convert the text to
# counts, using our 'count_vectorizer' - THEN give those counts to the tfidf transformer!
X_test_counts = count_vectorizer.transform(X_test)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

In [27]:
# Now: let's get some predictions from our model!
test_predictions = trial_model.predict(X_test_tfidf.toarray() if isinstance(trial_model, GaussianNB) else X_test_tfidf)

In [28]:
# Import the 'metrics' library from sklearn, that's what we need to make our performance report
from sklearn import metrics 

# Make a classification report from our test_predictions and the Y_test labels, which are the 'truth':
report = metrics.classification_report(Y_test, test_predictions, target_names = ['Negative (0)', 'Positive (1)'])

print(f"Performance report using {type(trial_model)} model:\n")
print(report)

Performance report using <class 'sklearn.linear_model._logistic.LogisticRegression'> model:

              precision    recall  f1-score   support

Negative (0)       0.91      0.88      0.89      4075
Positive (1)       0.88      0.91      0.89      3925

    accuracy                           0.89      8000
   macro avg       0.89      0.89      0.89      8000
weighted avg       0.89      0.89      0.89      8000



In [29]:
my_reviews = {
    "<hey i absolutely hate this movie cause it is so good and i could NOT stop thinking about it for the rest of the day>": 1,
    
    "<I really like the storyline of this movie>": 1,
    "<I really love the storyline of this movie>": 1,
    "<this is something i like, just like a chocolate ice cream in the summer>": 1,
    "<it was like watching an old grandma show, but i like it>": 1,

    "<the lighting was very clear and really helped the audience develop a immersive feeling while watching it>": 1,
    "<the lighting was clear and helped the audience develop a immersive feeling while watching it>": 1,
    "<the lighting was very clear and helped the audience develop a immersive feeling while watching it>": 1,
    "<the lighting was clear and really helped the audience develop a immersive feeling while watching it>": 1,

    "<This movie is so weird but this also makes it very interesting to watch. Some part of it is so unexpected but fantastic>": 1,
    "<this is pretty cliche>": 0,

    "<it's so cool>": 1,
    "<IT'S SO COOL>": 1,
    "<i did not understand what is going on>": 0,
    "<I DID NOT UNDERSTAND WHAT IS GOING ON>": 0,

    "<it is so complex and hard to understand>": 0,
    "<I have to say, i like the first season a lot better...what happened???>": 0,
}

In [30]:
# Now: let's have our model predict 0 or 1, negative or positive, for the reviews just wrote, and compare to the label:

my_texts = my_reviews.keys()

# Do the transformations like with the other texts - first count vectors, then tf-idf:
texts_as_count_vectors = count_vectorizer.transform(my_texts)
texts_as_tfidf_vectors = tfidf_transformer.transform(texts_as_count_vectors)

# NOW do the predictions on those vectors:
model_predictions = trial_model.predict(texts_as_tfidf_vectors.toarray())


for review_num, (review, label) in enumerate(my_reviews.items(), start=1):
    this_review_prediction = model_predictions[review_num-1]
    print(f"Review {review_num}:")
    print(review)
    print(f"\tModel predicted: {'Negative (0)' if this_review_prediction == 0 else 'Positive (1)'} - My label: {'Negative (0)' if label == 0 else 'Positive (1)'}")
    print(f"\tThe model {'succeeded' if this_review_prediction == label else 'failed'}!\n")
        

Review 1:
<hey i absolutely hate this movie cause it is so good and i could NOT stop thinking about it for the rest of the day>
	Model predicted: Negative (0) - My label: Positive (1)
	The model failed!

Review 2:
<I really like the storyline of this movie>
	Model predicted: Negative (0) - My label: Positive (1)
	The model failed!

Review 3:
<I really love the storyline of this movie>
	Model predicted: Positive (1) - My label: Positive (1)
	The model succeeded!

Review 4:
<this is something i like, just like a chocolate ice cream in the summer>
	Model predicted: Negative (0) - My label: Positive (1)
	The model failed!

Review 5:
<it was like watching an old grandma show, but i like it>
	Model predicted: Negative (0) - My label: Positive (1)
	The model failed!

Review 6:
<the lighting was very clear and really helped the audience develop a immersive feeling while watching it>
	Model predicted: Positive (1) - My label: Positive (1)
	The model succeeded!

Review 7:
<the lighting was clear