<a href="https://colab.research.google.com/github/tuccib/CUNY_LAG/blob/Analytics/211220_IMDB_Customer_Review_Bruno_Tucci.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd

In [None]:
import numpy as np

In [None]:
#bt:
imdb_sentiment = pd.read_csv('https://raw.githubusercontent.com/niteen11/data301_predictive_analytics_machine_learning/main/data/imdb_labelled.txt', sep='\t', names=['review', 'class'])

In [None]:
imdb_sentiment.head()

Unnamed: 0,review,class
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [None]:
imdb_sentiment.tail()

Unnamed: 0,review,class
743,I just got bored watching Jessice Lange take h...,0
744,"Unfortunately, any virtue in this film's produ...",0
745,"In a word, it is embarrassing.",0
746,Exceptionally bad!,0
747,All in all its an insult to one's intelligence...,0


In [None]:
imdb_sentiment['review'][0]

'A very, very, very slow-moving, aimless movie about a distressed, drifting young man.  '

In [None]:
imdb_sentiment['review'][5]

"The rest of the movie lacks art, charm, meaning... If it's about emptiness, it works I guess because it's empty.  "

In [None]:
#BT: the length of line 5 is 114 characters
len(imdb_sentiment['review'][5])

114

In [None]:
#BT: The longest row has 7,944 characters
max(imdb_sentiment['review'].apply(len))

7944

In [None]:
msg_7944 = imdb_sentiment[imdb_sentiment['review'].apply(len)==7944]

In [None]:
#BT: the review with a length of 7,944 characters occurs on line 136 and is as follows:
# Actually it occurs on line 213. The line count is wrong. Looks like the same thing happened
# on line 197, which has a single quote that doesn't end until the next occurrence of a quote, which 
# is on line 213. 
# The reason why this review is so long may be because it starts with a quotation but doesn't end
# with a quoation. Python may think the end of the quotation is the start of the next review that 
# begins with a quotation and that occurs on row 323
msg_7944

Unnamed: 0,review,class
136,"In fact, it's hard to remember that the part ...",0


In [None]:
# BT: The fact that 748 rows are listed but a visual inspection of the data shows that there are 1,000 rows
# confirms the row count issue discussed above, wherein Python is counting as one row the words between quotation marks, 
# even when the contents between quotation marks span multiple rows in the data set.
imdb_sentiment.shape

(748, 2)

### Data Pre-Processing

In [None]:
import string

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords

In [None]:
stopwords.words('english')[0:12]

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll"]

In [None]:
def review_text_pre_process(text_review):
  remove_punct = [char for char in text_review if char not in string.punctuation]
  remove_punct = ''.join(remove_punct)
  remove_stopwords = [word for word in remove_punct.split() if word.lower() not in stopwords.words('english')]
  return remove_stopwords

In [None]:
imdb_sentiment['review'].head(10).apply(review_text_pre_process)

0    [slowmoving, aimless, movie, distressed, drift...
1    [sure, lost, flat, characters, audience, nearl...
2    [Attempting, artiness, black, white, clever, c...
3                     [little, music, anything, speak]
4    [best, scene, movie, Gerardo, trying, find, so...
5    [rest, movie, lacks, art, charm, meaning, empt...
6                                 [Wasted, two, hours]
7    [Saw, movie, today, thought, good, effort, goo...
8                                   [bit, predictable]
9    [Loved, casting, Jimmy, Buffet, science, teacher]
Name: review, dtype: object

In [None]:
imdb_sentiment.head()

Unnamed: 0,review,class
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


# Vectorization

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
bag_of_words = CountVectorizer(analyzer=review_text_pre_process).fit(imdb_sentiment['review'])

In [None]:
bag_of_words_trf = bag_of_words.transform(imdb_sentiment['review'])

# TF-IDF (Transformer)

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

In [None]:
tfidf_fit = TfidfTransformer().fit(bag_of_words_trf)

In [None]:
tfidf_trf = tfidf_fit.transform(bag_of_words_trf)

# Model Building

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
cust_review_model = MultinomialNB().fit(tfidf_trf,imdb_sentiment['class'])

In [None]:
test_review = imdb_sentiment['review'][6]

In [None]:
test_review

'Wasted two hours.  '

In [None]:
bag_of_words_test_review = bag_of_words.transform([test_review])

In [None]:
tfidf_test_review = tfidf_fit.transform(bag_of_words_test_review)

In [None]:
cust_review_model.predict(tfidf_test_review)[0]

0

In [None]:
imdb_sentiment.head()

Unnamed: 0,review,class
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [None]:
prediction_for_all_reviews = cust_review_model.predict(tfidf_trf)

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(imdb_sentiment['class'],prediction_for_all_reviews))

              precision    recall  f1-score   support

           0       0.98      0.98      0.98       362
           1       0.98      0.98      0.98       386

    accuracy                           0.98       748
   macro avg       0.98      0.98      0.98       748
weighted avg       0.98      0.98      0.98       748



BT: there are 362 negative reviews and 386 positive reviews. The Natural Language Processing model was 98% accurate in predicting whether a review would be positive and negative, as indicated by the f1-score. 

# Train Test Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
review_train, review_test, class_train, class_test = train_test_split(imdb_sentiment['review'],imdb_sentiment['class']) 

In [None]:
print(review_train.shape)
print(review_test.shape)
print(class_train.shape)
print(class_test.shape)

(561,)
(187,)
(561,)
(187,)


BT: The default of 187 / (187 + 561) = 25% of the reviews are used for training. 

# Pipeline Building

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
text_pipeline = Pipeline([
                          ('bag_of_words',CountVectorizer(analyzer=review_text_pre_process)),
                          ('tfidf',TfidfTransformer()),
                          ('classifier', MultinomialNB())
])

In [None]:
text_pipeline.fit(review_train, class_train)

Pipeline(steps=[('bag_of_words',
                 CountVectorizer(analyzer=<function review_text_pre_process at 0x7f8c90268950>)),
                ('tfidf', TfidfTransformer()),
                ('classifier', MultinomialNB())])

In [None]:
text_pred = text_pipeline.predict(review_test)

In [None]:
print(classification_report(text_pred, class_test))

              precision    recall  f1-score   support

           0       0.75      0.70      0.73        88
           1       0.75      0.79      0.77        99

    accuracy                           0.75       187
   macro avg       0.75      0.75      0.75       187
weighted avg       0.75      0.75      0.75       187



BT: The 187 reviews used for training showed an overall accuracy of 75% when predicting whether the review would be positive (1) or negative (0). 

In [None]:
review_test.iloc[0]

"It was a long time that i didn't see a so charismatic actor on screen.  "

BT: This review was 823rd out of 1,000 in the orginal data set, but now it is 1st.  

In [None]:
class_test.iloc[0]

1

In [None]:
text_pred

array([0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0,
       1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0,
       1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0])

BT: These are the 187 train reviews, as tagged with 1 or 0 for a positive or negative review, respectively. 

In [None]:
class_test.iloc[1]

1

In [None]:
class_test

675    1
634    1
341    0
388    0
195    1
      ..
697    1
713    1
28     1
715    1
44     0
Name: class, Length: 187, dtype: int64