In [36]:
# ----------------------------------------------------------------------------
# Title: Assignment 3.2
# Author: Surenther Selvaraj
# Date: 26 September 2025
# Modified By: Surenther Selvaraj
# Description: Sentiment Analysis and Preprocessing Text
# Data: https://www.kaggle.com/c/word2vec-nlp-tutorial/data
# ----------------------------------------------------------------------------

In [37]:
# --- Importing Libraries ---
import pandas as pd
from textblob import TextBlob
from sklearn.metrics import accuracy_score
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [38]:
#Import the movie review data

# The name of the file
file_name = "labeledTrainData.tsv"

# Using pandas.read_csv with the correct delimiter for a .tsv file
df = pd.read_csv(file_name, sep='\t')

# Check if the data is loaded properly by displaying the first few rows
print("\nFirst 5 rows of the DataFrame:")
print(df.head())



First 5 rows of the DataFrame:
       id  sentiment                                             review
0  5814_8          1  With all this stuff going down at the moment w...
1  2381_9          1  \The Classic War of the Worlds\" by Timothy Hi...
2  7759_3          0  The film starts with a manager (Nicholas Bell)...
3  3630_4          0  It must be assumed that those who praised this...
4  9495_8          1  Superbly trashy and wondrously unpretentious 8...


In [39]:
# Count the number of positive and negative reviews
sentiment_counts = df['sentiment'].value_counts()

# Print the results with clear labels
print("Number of positive and negative reviews:")
print(f"Positive Reviews (1): {sentiment_counts[1]}")
print(f"Negative Reviews (0): {sentiment_counts[0]}")

Number of positive and negative reviews:
Positive Reviews (1): 12500
Negative Reviews (0): 12500


In [40]:
# --- TextBlob Sentiment Analysis ---

# Analyzes the sentiment of a given text using TextBlob. Returns 'Positive' if polarity >= 0, otherwise 'Negative'.
def get_textblob_sentiment(text):
    analysis = TextBlob(text)
    if analysis.sentiment.polarity >= 0:
        return 'Positive'
    else:
        return 'Negative'

# Apply the sentiment analysis function to the 'review' column
print("Classifying movie reviews using TextBlob...")
df['TextBlob_Sentiment'] = df['review'].apply(get_textblob_sentiment)

# Display Results
print("\nFirst 5 rows with the new TextBlob_Sentiment column:")
print(df[['review', 'sentiment', 'TextBlob_Sentiment']].head())
print("\n--- Summary of TextBlob Sentiment Classification ---")
print(df['TextBlob_Sentiment'].value_counts())


Classifying movie reviews using TextBlob...

First 5 rows with the new TextBlob_Sentiment column:
                                              review  sentiment  \
0  With all this stuff going down at the moment w...          1   
1  \The Classic War of the Worlds\" by Timothy Hi...          1   
2  The film starts with a manager (Nicholas Bell)...          0   
3  It must be assumed that those who praised this...          0   
4  Superbly trashy and wondrously unpretentious 8...          1   

  TextBlob_Sentiment  
0           Positive  
1           Positive  
2           Negative  
3           Positive  
4           Negative  

--- Summary of TextBlob Sentiment Classification ---
TextBlob_Sentiment
Positive    19017
Negative     5983
Name: count, dtype: int64


In [41]:
# --- TextBlob Accuracy Calculation ---

# Map the original numerical sentiment to text labels for comparison
df['true_sentiment_text'] = df['sentiment'].map({1: 'Positive', 0: 'Negative'})

# Calculate the accuracy by comparing the true labels to TextBlob's predictions
accuracy = accuracy_score(df['true_sentiment_text'], df['TextBlob_Sentiment'])

# Display Results

print("\n--- Model Accuracy ---")
print(f"The accuracy of the TextBlob model is: {accuracy:.2f}")


--- Model Accuracy ---
The accuracy of the TextBlob model is: 0.69


### Conclusion and Analysis for TextBlob

The model's accuracy is calculated by comparing the sentiment predicted by TextBlob with the true sentiment labels provided in the dataset. Since the dataset is perfectly balanced with 12,500 positive and 12,500 negative reviews, a random guess would have an expected accuracy of 50%. The TextBlob model, leveraging a pre-trained sentiment lexicon, perform much better than this baseline. It's accuracy was 69% (0.69)

Therefore, the TextBlob model is significantly better than random guessing. Its accuracy, as you will see from the script's output, will be a good indicator of its effectiveness in classifying movie review sentiment without any prior training.

In [42]:
# Download the VADER lexicon
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/surenther/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [43]:
# --- VADER Sentiment Analysis ---

# Analyzes the sentiment of a given text using VADER. Returns 'Positive' if polarity >= 0, otherwise 'Negative'.
def get_vader_sentiment(text):
    """
    Analyzes the sentiment of a given text using VADER.
    Returns 'Positive' if compound score > 0, otherwise 'Negative'.
    """
    analyzer = SentimentIntensityAnalyzer()
    compound_score = analyzer.polarity_scores(text)['compound']
    if compound_score > 0:
        return 'Positive'
    else:
        return 'Negative'

# Apply the sentiment analysis function to the 'review' column
print("Classifying movie reviews using VADER...")
df['VADER_Sentiment'] = df['review'].apply(get_vader_sentiment)

# Map the original numerical sentiment to text labels for comparison
df['true_sentiment_text'] = df['sentiment'].map({1: 'Positive', 0: 'Negative'})

# Calculate the accuracy by comparing the true labels to VADER's predictions
accuracy = accuracy_score(df['true_sentiment_text'], df['VADER_Sentiment'])

# --- Display Results ---
print("\n--- Model Accuracy ---")
print(f"The accuracy of the VADER model is: {accuracy:.2f}")

Classifying movie reviews using VADER...

--- Model Accuracy ---
The accuracy of the VADER model is: 0.69


### Conclusion and Analysis for VADER
The VADER model, a lexicon and rule-based sentiment analyzer, achieved an accuracy of 69% in classifying the movie reviews. This is significantly better than a random guess, which would yield an accuracy of 50% on this balanced dataset. The model's performance demonstrates its effectiveness in identifying sentiment in text, even without being explicitly trained on this movie review data. However, it's worth noting that its accuracy of 0.69 suggests that it still has limitations in handling the nuances of natural language, such as sarcasm, complex sentence structures, or domain-specific terminology that might differ from its pre-built lexicon. Overall, VADER provides a fast and solid baseline for sentiment analysis, but its performance could likely be surpassed by more sophisticated machine-learning models like those in the Hugging Face Transformers library.

In [44]:
# Download the stopwords list if not already available
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    print("Downloading NLTK stopwords...")
    nltk.download('stopwords')

# Define a function for text cleaning and stemming
def clean_and_stem_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove HTML tags (common in this dataset)
    text = re.sub(r'<.*?>', '', text)
    # Remove punctuation and special characters, keeping only letters and numbers
    text = re.sub(r'[^a-z0-9\s]', '', text)
    # Tokenize the text and remove stop words
    stop_words = set(stopwords.words('english'))
    words = text.split()
    cleaned_words = [word for word in words if word not in stop_words]
    # Apply PorterStemmer
    porter = PorterStemmer()
    stemmed_words = [porter.stem(word) for word in cleaned_words]
    # Join the words back into a single string
    return ' '.join(stemmed_words)

# Apply the cleaning and stemming function
df['stemmed_review'] = df['review'].apply(clean_and_stem_text)

print("Text preprocessing and stemming complete.")
print("\nFirst 5 stemmed reviews:")
print(df['stemmed_review'].head())

# --- Vectorization ---

# Create a Bag-of-Words matrix
print("\nCreating Bag-of-Words matrix...")
count_vectorizer = CountVectorizer()
bag_of_words_matrix = count_vectorizer.fit_transform(df['stemmed_review'])

print(f"Bag-of-Words matrix dimensions: {bag_of_words_matrix.shape}")

# Create a TF-IDF matrix
print("\nCreating TF-IDF matrix...")
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['stemmed_review'])

print(f"TF-IDF matrix dimensions: {tfidf_matrix.shape}")

# Verify dimensions
if bag_of_words_matrix.shape[0] == df.shape[0] and tfidf_matrix.shape[0] == df.shape[0]:
    print("\nVerified: Number of rows in matrices match original DataFrame.")
else:
    print("\nWarning: Number of rows in matrices do not match original DataFrame.")

Text preprocessing and stemming complete.

First 5 stemmed reviews:
0    stuff go moment mj ive start listen music watc...
1    classic war world timothi hine entertain film ...
2    film start manag nichola bell give welcom inve...
3    must assum prais film greatest film opera ever...
4    superbl trashi wondrous unpretenti 80 exploit ...
Name: stemmed_review, dtype: object

Creating Bag-of-Words matrix...
Bag-of-Words matrix dimensions: (25000, 112735)

Creating TF-IDF matrix...
TF-IDF matrix dimensions: (25000, 112735)

Verified: Number of rows in matrices match original DataFrame.


### Conclusion
Based on the output, the data preprocessing steps were successful. The Bag-of-Words and TF-IDF matrices were created, and their dimensions, (25000, 112735), confirm that the number of rows matches the original dataset, with 112,735 unique stemmed words serving as features.

Combining this with our earlier findings from the VADER and TextBlob analyses (both with an accuracy of 0.69), we can draw a cohesive conclusion. The lexicon-based models provided a respectable baseline, outperforming a random guess, but they demonstrated the limitations of their rule-based approach. The successful creation of the Bag-of-Words and TF-IDF matrices now provides a solid foundation for building a custom, machine learning model. These matrices are a numerical representation of the text, which is the required input for more advanced classifiers (e.g., Logistic Regression, Support Vector Machines) that can learn from the data itself. By using these matrices, we can now attempt to create a model that learns the nuances of the movie review language, potentially leading to a much higher accuracy than the pre-built lexicon-based analyzers.
