# Movie Review Sentiment Analysis

## 0. Project Overview

This project demonstrates the application of Natural Language Processing (NLP) techniques to perform sentiment analysis on movie reviews. We will train a supervised machine learning model to predict whether a given movie review is positive or negative.

## Dataset

We utilize the movie review dataset from the Natural Language Toolkit (NLTK) library. The dataset consists of:

- 1000 positive reviews
- 1000 negative reviews

This balanced collection provides an excellent foundation for training our binary classification model.

## Methodology

Our approach involves:

1. Data preprocessing
2. Feature extraction using NLP techniques
3. Model training and evaluation

In [1]:
import nltk
nltk.download("movie_reviews")
import pandas as pd
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk.corpus import movie_reviews
nltk.download('punkt')
stop = stopwords.words('english')
import string
import re
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/nivyni/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to /Users/nivyni/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
negative_fileids = movie_reviews.fileids('neg')
positive_fileids = movie_reviews.fileids('pos')

pos_document = [(' '.join(movie_reviews.words(file_id)),category) for file_id in movie_reviews.fileids() for category in movie_reviews.categories(file_id) if category == 'pos']
neg_document = [(' '.join(movie_reviews.words(file_id)),category) for file_id in movie_reviews.fileids() for category in movie_reviews.categories(file_id) if category == 'neg']

# List of postive and negative reviews
pos_list = [pos[0] for pos in pos_document]
neg_list = [neg[0] for neg in neg_document]

In [3]:
reviews_list = pos_document + neg_document
reviews_df = pd.DataFrame(reviews_list, columns=['reviews','label'])

In [5]:
reviews_X = reviews_df['reviews'].to_frame()
reviews_y = reviews_df['label'].to_frame()
reviews_X_dev, reviews_X_test,reviews_y_dev,reviews_y_test = train_test_split(reviews_X,reviews_y, test_size=0.2, random_state=0)

## 1. Data preprocessing

Remove `#` symbol, hyperlinks, stop words & punctuations from the data.

In [6]:
# Remove hashtags from reviews
reviews_X_dev['reviews'] = reviews_X_dev['reviews'].str.replace(r'#\w+', '', regex=True)
reviews_X_test['reviews'] = reviews_X_test['reviews'].str.replace(r'#\w+', '', regex=True)

In [7]:
# Remove URLs from reviews
url_pattern = r'(https?://\S+|www\.\S+)'
reviews_X_dev['reviews'] = reviews_X_dev['reviews'].str.replace(url_pattern, '', regex=True)
reviews_X_test['reviews'] = reviews_X_test['reviews'].str.replace(url_pattern, '', regex=True)

In [8]:
# Remove stop words from reviews
stop = set(stopwords.words('english'))
for i in stop:
    reviews_X_dev['reviews'] = reviews_X_dev['reviews'].str.replace(r'\b{}\b'.format(i), '', regex=True)
    reviews_X_test['reviews'] = reviews_X_test['reviews'].str.replace(r'\b{}\b'.format(i), '', regex=True)

In [9]:
# Remove punctuation from reviews
punctuation_pattern = rf'[{re.escape(string.punctuation)}]'
reviews_X_dev['reviews'] = reviews_X_dev['reviews'].str.replace(punctuation_pattern, '', regex=True)
reviews_X_test['reviews'] = reviews_X_test['reviews'].str.replace(punctuation_pattern, '', regex=True)

### Apply Stemming to Development and Test Datasets

In this step, we'll apply the Porter stemming algorithm to our review texts. Stemming reduces words to their root form, which can help in reducing vocabulary size and improving text analysis.

We'll use the Porter stemmer from the NLTK library and apply it to both our development and test datasets.

In [10]:
def stem_sentence(sentence):
    stemmer = PorterStemmer()
    return ' '.join(stemmer.stem(word) for word in word_tokenize(sentence))

# Apply stemming to development and test datasets
reviews_X_dev['reviews_stem'] = reviews_X_dev['reviews'].apply(stem_sentence)
reviews_X_test['reviews_stem'] = reviews_X_test['reviews'].apply(stem_sentence)

## 2. Feature Extraction

After preprocessing our text data, we'll now perform feature extraction using the Bag of Words model. This step converts our text data into a numerical format that machine learning algorithms can process.


### Logic Behind Bag of Words Feature Extraction

The Bag of Words (BoW) model is a simple yet powerful technique for representing text data in machine learning. Here's the underlying logic:

1. **Vocabulary Creation**:
   - The `fit` method of CountVectorizer scans all documents in the development set.
   - It creates a vocabulary of unique words across all documents.
   - Each word in this vocabulary becomes a feature.

2. **Vector Representation**:
   - Each document (review in our case) is represented as a vector.
   - The length of this vector is equal to the size of the vocabulary.
   - Each element in the vector represents a word count.

3. **Sparse Matrix**:
   - The result is a sparse matrix where:
     - Rows represent documents (reviews)
     - Columns represent words in the vocabulary
     - Cell values represent the count of a word in a document

4. **Feature Matrix Creation**:
   - `fit_transform` on the development set:
     - Creates the vocabulary
     - Transforms the development set into a feature matrix
   - `transform` on the test set:
     - Uses the vocabulary created from the development set
     - Ensures consistency in features between development and test sets

5. **Importance of Separate Handling**:
   - We only `fit` on the development set to prevent data leakage.
   - The test set is transformed using the vocabulary from the development set, simulating real-world scenarios where we'd apply our model to unseen data.

This approach turns our text data into a numerical format that machine learning models can process, while maintaining the frequency information of words in each document.

In [12]:
vector = CountVectorizer()
dev_X = vector.fit_transform(reviews_X_dev['reviews_stem'])
test_X = vector.transform(reviews_X_test['reviews_stem'])
dev_y = reviews_y_dev['label']
test_y = reviews_y_test['label']

### Logic Behind TF-IDF Vectorization

1. **TF-IDF Concept**:
   - TF-IDF stands for Term Frequency-Inverse Document Frequency.
   - It evaluates the importance of a word to a document in a collection or corpus.

2. **Term Frequency (TF)**:
   - Measures how frequently a term occurs in a document.
   - TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

3. **Inverse Document Frequency (IDF)**:
   - Measures how important a term is across the entire corpus.
   - IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

4. **TF-IDF Score**:
   - TF-IDF = TF * IDF
   - Increases with term frequency in a document and rarity across the corpus.

5. **Vectorization Process**:
   - `fit_transform` on development set: Computes vocabulary, IDF values, and transforms documents.
   - `transform` on test set: Uses vocabulary and IDF from development set to transform test documents.

6. **Advantages**:
   - Reduces impact of frequent but less informative words.
   - Emphasizes rare terms that are important in specific documents.

7. **Stop Words Removal**:
   - `stop_words='english'` automatically removes common English stop words.

In [None]:
tfidf_vector = TfidfVectorizer(stop_words = 'english')
dev_X_tfidf = tfidf_vector.fit_transform(reviews_X_dev['reviews_stem'])
test_X_tfidf = tfidf_vector.transform(reviews_X_test['reviews_stem'])

## 3. Model training and evaluation

### Logistic Regression

In [13]:
lr = LogisticRegression().fit(dev_X, dev_y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [15]:
lr_tfidf = LogisticRegression().fit(dev_X_tfidf, dev_y)

In [16]:
print('Score for Bag of Words Model:  {}'.format(lr.score(test_X, test_y)))
print('Score for TF IDF Model:  {}'.format(lr_tfidf.score(test_X_tfidf,test_y)))

Score for Bag of Words Model:  0.83
Score for TF IDF Model:  0.8375


### Comparison

The TF-IDF Model has a slightly higher score. The reasons for this are:

1. **Vectorization Approach**:
   - Bag of Words: Simply vectorizes the text.
   - TF-IDF: Distributes weight based on term importance.

2. **Weight Distribution**:
   - TF-IDF assigns:
     - More weight to important terms
     - Less weight to common (less important) terms

3. **Learning Efficiency**:
   - TF-IDF model can learn more effectively from the training set.

4. **Prediction Quality**:
   - Due to its weighted approach, TF-IDF tends to make better predictions on the test set.

In summary, the TF-IDF model's ability to distinguish between important and common terms allows it to capture more nuanced information from the text, leading to improved performance compared to the simpler Bag of Words approach.