# Movie Reviews for Text and Sentiment Analysis

The data in this homework is 20k movie reviews which have been already labelled as positive or negative. We will apply our text toolbox to see if we can fit an effective supervised model.

The Movie Review data can be [downloaded from this link](https://drive.google.com/uc?export=download&id=1UA9CyRd8y7Wi4RKruXfItXadT3hY92bE)

## Importing Libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import re
from sklearn.model_selection import train_test_split

## Loading Data

In [None]:
df = pd.read_csv('movie_reviews_20k.csv')
df.head()

Unnamed: 0,text,label
0,I grew up (b. 1965) watching and loving the Th...,0
1,"When I put this movie in my DVD player, and sa...",0
2,Why do people who do not know what a particula...,0
3,Even though I have great interest in Biblical ...,0
4,Im a die hard Dads Army fan and nothing will e...,1


The data is simply a full text review, and a rating of 0(bad) or 1(good), determined by a human labeller.


We need to remove HTML tags...you can run the following code to remove them.

In [3]:
# remove html tags
df['text'] = df['text'].apply(lambda x: re.sub('<[^<]+?>', '', x))

In [4]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import nltk

# Download WordNet data for lemmatization
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Stopwords list
stop_words = set(stopwords.words('english'))

# Function to clean text
def clean_text(text_list):
    cleaned_texts = []
    for text in text_list:
        # Remove punctuation and lowercase the text
        text = re.sub(r'[^\w\s]', '', text.lower())
        # remove numbers
        text = re.sub(r'[\d]', '', text)

        # Tokenize the text - split into an array of words.
        words = word_tokenize(text)

        # Remove stopwords and lemmatize each word
        words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]

        # Join the words back into a single string
        cleaned_text = ' '.join(words)
        cleaned_texts.append(cleaned_text)

    return cleaned_texts

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
df['clean_review'] = clean_text(df['text'])

In [6]:
df.head()

Unnamed: 0,text,label,clean_review
0,I grew up (b. 1965) watching and loving the Th...,0,grew b watching loving thunderbird mate school...
1,"When I put this movie in my DVD player, and sa...",0,put movie dvd player sat coke chip expectation...
2,Why do people who do not know what a particula...,0,people know particular time past like feel nee...
3,Even though I have great interest in Biblical ...,0,even though great interest biblical movie bore...
4,Im a die hard Dads Army fan and nothing will e...,1,im die hard dad army fan nothing ever change g...


In [7]:
X_train, X_test, y_train, y_test = train_test_split(df['clean_review'], df['label'], test_size=0.2, random_state=42)

In [8]:
print(f"X_train Shape : {X_train.shape}")
print(f"X_test  Shape : {X_test.shape}")
print(f"y_train Shape : {y_train.shape}")
print(f"y_test  Shape : {y_test.shape}")

X_train Shape : (15999,)
X_test  Shape : (4000,)
y_train Shape : (15999,)
y_test  Shape : (4000,)


In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=0.7, min_df=2)

train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)

feature_names = vectorizer.get_feature_names_out()

In [10]:
print(f"Train Vectors Shape : {train_vectors.shape}")
print(f"Test Vectors Shape  : {test_vectors.shape}")

Train Vectors Shape : (15999, 37509)
Test Vectors Shape  : (4000, 37509)


I'm using a XGBoost Classifier

In [11]:
# importing XGBoostClassifier from xgboost library
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score

In [12]:
# initializing a XGBClassifier with log-loss evaluation metric and fitting on
# train vectors

model = XGBClassifier(eval_metric='logloss')

model.fit(train_vectors, y_train)

In [13]:
# getting the prediction on test vectors and calculating the AUC score
y_pred = model.predict(test_vectors)
auc = roc_auc_score(y_test, y_pred)
print(f"AUC : {auc:.3f}")

AUC : 0.841


Our XGBoost Classifier achieves a strong AUC score of 0.841.

In [14]:
# getting the model's predicted probabilities and extracting the 2nd column
# creating a dataframe of y_test and y_pred_prob values
# sorting this dataframe based on y_pred_prob values and filtering the rows
# where y_test = 1
# the top row of this would be the review which has the lowest probability but
# has the label = 1

y_pred_prob = model.predict_proba(test_vectors)[:,1]

y_pred_prob_df = pd.DataFrame({'y_test': y_test, 'y_pred_prob': y_pred_prob})
y_pred_prob_df = y_pred_prob_df.sort_values(by='y_pred_prob')

In [15]:
y_pred_prob_df[y_pred_prob_df['y_test'] == 1].head(1)

Unnamed: 0,y_test,y_pred_prob
2937,1,0.002428


In [16]:
# printing the review based on its index number - 2937
print("Review : ")
df.iloc[2937]['text']

Review : 


'**SPOILERS AHEAD**It is really unfortunate that a movie so well produced turns out to besuch a disappointment. I thought this was full of (silly) clichés andthat it basically tried to hard. To the (American) guys out there: how many of you spend yourtime jumping on your girlfriend\'s bed and making monkeysounds? To the (married) girls: how many of you have suddenlygone from prudes to nymphos overnight--but not with yourhusband? To the French: would you really ask about someonebeing "à la fac" when you know they don\'t speak French? Wouldn\'tyou use a more common word like "université"? I lived in France for a while and I sort of do know and understandEurope (and I love it), but my (German) roommate and I found thispretty insulting overall. It looked like a movie funded by theEuropean Parliament, and it tried too hard basically. It had allsorts of differences that it tried to tie together (not a bad thing initself) but the result is at best awkward, but in fact ridiculous--toomany clas

The reviewer focuses mostly on the film's flaws, dedicating much of the review to criticism and ending with only a slightly positive comment about its form. I did manage to find the original review [here](https://www.imdb.com/title/tt0283900/reviews/?item=rw0808450&ref_=ext_shr_lnk).

Although the user gave it a rating of 7 (which likely led the labeler to mark it as a positive review), the model's prediction — a probability closer to 0 — makes sense, since the review is largely negative with only brief praise.

In [17]:
# initializing the sentiment analyzer

from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')
analyzer = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [18]:
# applying the sentiment analyzer on the review

text = df.iloc[2937]['text']
scores = analyzer.polarity_scores(text)
print(scores)

{'neg': 0.103, 'neu': 0.789, 'pos': 0.108, 'compound': 0.5735}


In [19]:
print(f"Negativity Score : {scores['neg']:.3f}")
print(f"Positivity Score : {scores['pos']:.3f}")
print(f"Neutrality Score : {scores['neu']:.3f}")
print(f"Compound Score   : {scores['compound']:.3f}")

Negativity Score : 0.103
Positivity Score : 0.108
Neutrality Score : 0.789
Compound Score   : 0.574


The sentiment analysis (somewhat) supports my earlier conclusion. Although the review ends with some praise for the film's technical aspects, the majority of it is spent critiquing the plot, characters, and cultural clichés — often through (neutral) rhetorical questions that carry a negative tone. This aligns with the model predicting a probability closer to 0, indicating a negative review.

The labeler likely marked it as positive based on the user's rating, but VADER's scores show a mostly neutral tone with slight positivity, likely due to the ending. The review is clearly mixed, and VADER's slightly positive compound score reflects the balance between criticism and mild praise, with the high neutral score capturing the review's descriptive, reflective style.

In [20]:
# installing the eli5 library

!pip install eli5

import eli5



In [21]:
# using show_weights() method to get the top 20 feeatures of our XGBoostClassifier model

eli5.show_weights(model, top=20, feature_names=feature_names)

Weight,Feature
0.0196,worst
0.0157,waste
0.0142,bad
0.0097,stupid
0.0096,supposed
0.0095,worse
0.0094,awful
0.0088,boring
0.0085,wonderful
0.0072,nothing


From the output, we see that the following words contribute most to negative predictions: worst, waste, bad, stupid, worse, awful, boring, terrible, nothing, horrible, poorly, and poor.

Words like wonderful, excellent, great, perfect, and loved contribute most to positive predictions.