### Task: Sentiment Classification of Movie Reviews  


Alice is a time traveler who visits different eras in the past to solve important missions. While there, she must always be careful to disguise herself so that no one will know she is from the future. This time, she joined an NLP company in 2014 year and was assigned the task of sentiment analysis on user reviews for movies. Help Alice with this task.

You need to solve sentiment classification task using the imdb movie review dataset. Each review is labeled as either positive (1) or negative (0), indicating its sentiment. You will be provided by basic LinearSVC classifier with TF-IDF features.

You need to solve 3 tasks:

1.   Task1: Text Preprocessing with spaCy (this is your baseline)
2.   Task 2: Adding Part-of-Speech (POS) Features as a TF-IDF for Each POS Category
3.   Task 3: Development of new features to improve classification accuracy

**Note!** Do not change the classifier. Change only cells with TODO mark.



In [1]:
import os
import random
import re
import numpy as np
import pandas as pd
import spacy

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import (
    TfidfVectorizer,
    CountVectorizer,
)
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

from tqdm.auto import tqdm
tqdm.pandas()

In [2]:
os.environ["PYTHONHASHSEED"] = str(42)

random.seed(42)
np.random.seed(42)

### Loading the dataset

In [3]:
! gdown --id 1C6TIP8c33fHM6dxs6DoxJeKY6ZXGWpBx
! gdown --id 1K8WBFVVvVlsvIMRG8HiaFkldiyuNkLD2

Downloading...
From: https://drive.google.com/uc?id=1C6TIP8c33fHM6dxs6DoxJeKY6ZXGWpBx
To: /home/gotheartem/Projects/ioai-hw/hw-9/imdb_train_hw1.csv
100%|██████████████████████████████████████| 8.25M/8.25M [00:01<00:00, 6.04MB/s]
Downloading...
From: https://drive.google.com/uc?id=1K8WBFVVvVlsvIMRG8HiaFkldiyuNkLD2
To: /home/gotheartem/Projects/ioai-hw/hw-9/imdb_test_hw1.csv
100%|██████████████████████████████████████| 2.10M/2.10M [00:00<00:00, 10.1MB/s]


In [4]:
df_train = pd.read_csv("imdb_train_hw1.csv")
df_test = pd.read_csv("imdb_test_hw1.csv")
df_train.sample(5)

Unnamed: 0.1,Unnamed: 0,label,text
8681,8681,1,I noticed this movie was getting trashed well ...
2362,2362,1,When it comes to creating a universe George Lu...
6232,6232,0,"""National Treasure"" (2004) is a thoroughly mis..."
1318,1318,1,I must admit - the only reason I bought this m...
543,543,1,Ten out of the 11 short films in this movie ar...


In [5]:
y_train = df_train["label"]
y_test = df_test["label"]

Since the classes in our dataset are nearly balanced, we can use accuracy as the evaluation metric. Accuracy provides a straightforward measure of how well the model classifies reviews correctly across both sentiment classes.  

However, we will consider the F1-score for a more detailed performance assessment. Even with balanced classes, the model might still be biased towards one class due to feature distributions (e.g., it may predict negative reviews more confidently than positive ones).  

The F1-score, which is the harmonic mean of precision and recall, helps us identify such imbalances. It ensures that both false positives and false negatives are accounted for, providing a better understanding of how well the model performs on each sentiment class.

## 0. LinearSVC with TF-IDF Features  

We will now train a LinearSVC model using TF-IDF (Term Frequency-Inverse Document Frequency) as features.

In [6]:
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(df_train["text"])
X_test_tfidf = vectorizer.transform(df_test["text"])

In [7]:
y_train = df_train["label"]
y_test = df_test["label"]

In [8]:
model = LinearSVC(random_state=42)
model.fit(X_train_tfidf, y_train)
y_pred = model.predict(X_test_tfidf)
print("Accuracy (TF-IDF):", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy (TF-IDF): 0.841747984726347
              precision    recall  f1-score   support

           0       0.85      0.84      0.85      1213
           1       0.83      0.84      0.84      1144

    accuracy                           0.84      2357
   macro avg       0.84      0.84      0.84      2357
weighted avg       0.84      0.84      0.84      2357



The model's accuracy using TF-IDF is 0.8417 (84.17%) this our **baseline result**.

## Task1: Text Preprocessing with spaCy

Lemmatize original review texts with [spacy ](https://spacy.io/usage/linguistic-features#lemmatization)library.
With spacy remove:

*   stop words
*   punctuation
*   digits
*   emails
*   numbers
*   empty word

Train classifier with a new tf-idf representation of text. Obtain baseline classification metrics.

In [9]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [10]:
#TODO function take text as an argument and return cleaned text

nlp = spacy.load("en_core_web_sm")
def clean_text(text):
    doc = nlp(text)
    cleaned_tokens = []
    
    for token in doc:
        if token.is_stop or token.is_punct or token.is_digit or token.like_email or token.like_num or token.text.strip() == '':
            continue
        cleaned_tokens.append(token.lemma_.lower().strip())
    
    return " ".join(cleaned_tokens)

In [11]:
df_train["text_lemmatized"] = df_train["text"].progress_apply(clean_text)
df_test["text_lemmatized"] = df_test["text"].progress_apply(clean_text)

  0%|          | 0/9427 [00:00<?, ?it/s]

  0%|          | 0/2357 [00:00<?, ?it/s]

In [12]:
# TODO get tf-idf vectors for your lemmatized texts

vectorizer = TfidfVectorizer()
X_train_tfidf_lemmatized = vectorizer.fit_transform(df_train["text_lemmatized"])
X_test_tfidf_lemmatized = vectorizer.transform(df_test["text_lemmatized"])

In [13]:
model = LinearSVC(random_state=42)
model.fit(X_train_tfidf_lemmatized, y_train)
y_pred = model.predict(X_test_tfidf_lemmatized)
print("Accuracy (TF-IDF):", accuracy_score(y_test, y_pred))

Accuracy (TF-IDF): 0.8413237165888842


This is your **baseline** metrics!

## Task 2: Adding Part-of-Speech (POS) Features as a TF-IDF for Each POS Category

For each text add part-of-speach (pos) tags as feature in TF-IDF manner. Use Spacy to get pos tag features. Combine them with lemmatized tf-idf features, obtained in the Task1.

For example, if you have two sentences with following tf-idf vectors:

1.   sent1: "The cat sat on the mat." -> [0.63, 0.44, 0.31, 0.31, 0.44, 0, 0]
2.   sent2: "The dog sat on the floor. " -> [0.63, 0, 0.31, 0.31, 0, 0.44, 0.44]

And you obtained the following pos tag features (with dictionary {'det': 1, 'noun': 2, 'verb': 3, 'adp': 0}):

*   sent1: [0.63, 0.63, 0.31, 0.31]
*   sent2: [0.63, 0.63, 0.31, 0.31]


Then final representation should be:

*   sent1: [0.63, 0.44, 0.31, 0.31, 0.44, 0, 0, 0.63, 0.63, 0.31, 0.31]
*   sent2: [0.63, 0, 0.31, 0.31, 0, 0.44, 0.44, 0.63, 0.63, 0.31, 0.31]

**Note!** Do not use pos tags punctuation and empty words

In [14]:
# TODO function takes text as input and return string with pos tags joined by a space.

def extract_pos_tags(text):
    doc = nlp(text)
    pos_tags = [token.pos_ for token in doc if not token.is_punct and token.text.strip() != '']
    return " ".join(pos_tags)

In [15]:
df_train["pos_text"] = df_train["text"].progress_apply(extract_pos_tags)
df_test["pos_text"] = df_test["text"].progress_apply(extract_pos_tags)

  0%|          | 0/9427 [00:00<?, ?it/s]

  0%|          | 0/2357 [00:00<?, ?it/s]

We need to bring the features obtained by CountVectorizer for POS tags to the same scale as TF-IDF. The easiest way is to apply TfidfTransformer to the CountVectorizer result.

In [16]:
#TODO train bag of words with pos tag features, then normalize them with TfidfTransformer, combine with X_train_tfidf_lemmatized
# and X_test_tfidf_lemmatized features, save resulted features to the following variables:

from sklearn.feature_extraction.text import TfidfTransformer
from scipy.sparse import hstack

pos_vectorizer = CountVectorizer()
X_train_pos = pos_vectorizer.fit_transform(df_train["pos_text"])
X_test_pos = pos_vectorizer.transform(df_test["pos_text"])

tfidf_transformer = TfidfTransformer()
X_train_pos_tfidf = tfidf_transformer.fit_transform(X_train_pos)
X_test_pos_tfidf = tfidf_transformer.transform(X_test_pos)

X_train_combined = hstack([X_train_tfidf_lemmatized, X_train_pos_tfidf])
X_test_combined = hstack([X_test_tfidf_lemmatized, X_test_pos_tfidf])

In [17]:
lr_combined = LinearSVC(random_state=42)
lr_combined.fit(X_train_combined, y_train)
y_pred_combined = lr_combined.predict(X_test_combined)

print("Accuracy (tf-idf + POS):", accuracy_score(y_test, y_pred_combined))

Accuracy (tf-idf + POS): 0.8447178616885872


## Task 3: Development of new features to improve classification accuracy

Come up with another feature or set of features and help Alice improve the quality. Remember that Alice is in the past and does not have access to any . Additional training data cannot be used either. You can use third-party resources to generate features.

Compare with result of your **baseline** from the Task 1. Any improvement will be counted. Use X_train_tfidf_lemmatized and X_test_tfidf_lemmatized, add combine your features with them as in task 2.

In [18]:
# TODO create your features function here, add feature explanation

def get_custom_feature(text):
    doc = nlp(text)

    num_nums = 0
    for token in doc:
        if token.like_num or token.is_digit:
            num_nums += 1
    
    num_exclamations = text.count('!')
    num_questions = text.count('?')
    
    return [num_nums, num_exclamations, num_questions]

In [19]:
# Combine your features with X_train_tfidf_lemmatized and X_test_tfidf_lemmatized

from sklearn.preprocessing import MinMaxScaler

train_custom_features = np.array(df_train["text"].progress_apply(get_custom_feature).tolist())
test_custom_features = np.array(df_test["text"].progress_apply(get_custom_feature).tolist())

scaler = MinMaxScaler()
train_custom_features = scaler.fit_transform(train_custom_features)
test_custom_features = scaler.transform(test_custom_features)

X_train_combined = hstack([X_train_tfidf_lemmatized, train_custom_features])
X_test_combined = hstack([X_test_tfidf_lemmatized, test_custom_features])

  0%|          | 0/9427 [00:00<?, ?it/s]

  0%|          | 0/2357 [00:00<?, ?it/s]

In [20]:
lr_combined = LinearSVC(random_state=42)
lr_combined.fit(X_train_combined, y_train)
y_pred_combined = lr_combined.predict(X_test_combined)

print("Accuracy (tf-idf + Custom feature):", accuracy_score(y_test, y_pred_combined))

Accuracy (tf-idf + Custom feature): 0.8468392023759016
