# Assignment 1: Sentiment Analysis Classifier

##### Group 26: Michal Dawid Kowalski (up202401554) | Pedro Maria Passos Ribeiro do Carmo Pereira (up) | Santiago Romero Pineda (up)

In this assignment, we will build a sentiment analysis classifier using traditional machine learning techniques. The process includes pre-processing, feature extraction, and exploring both sparse and dense feature representations like word embeddings. We will use "traditional" machine learning classifier instead of deep learning models (CNNs, RNNs, Transformers). The focus will be on understanding text classification techniques and evaluating their performance on the given dataset using common classification metrics like accuracy, precision, recall, and F1-score.



In [None]:
# Import  libraries
from our_eda import *
from our_modeling import *
from our_preprocessing import *
from our_feature_extraction import *
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import numpy as np
import pandas as pd

# 1. BESSTIE Dataset

## 1.1 Uploading Dataset Files from HuggingFace (https://huggingface.co/mindhunter23)

The dataset is hosted on Hugging Face under the username "mindhunter23." It consists of text data collected from Reddit and Google for the countries UK, AU, and IN. All texts are in English and are labeled with sentiment values: 0 for negative sentiment and 1 for positive sentiment. The dataset is already split into training and validation sets, making it ready for sentiment analysis tasks. It offers diverse content from different regions and platforms.

### - BESSTIE-reddit-sentiment-uk/

In [2]:
import pandas as pd

splits = {'train': 'reddit-sentiment-uk-train.jsonl', 'validation': 'reddit-sentiment-uk-valid.jsonl'}
df_reddit_sentiment_uk = pd.read_json("hf://datasets/mindhunter23/BESSTIE-reddit-sentiment-uk/" + splits["train"], lines=True)
df_reddit_sentiment_uk_val = pd.read_json("hf://datasets/mindhunter23/BESSTIE-reddit-sentiment-uk/" + splits["validation"], lines=True)
df_reddit_sentiment_uk

  from .autonotebook import tqdm as notebook_tqdm


Unnamed: 0,id,text,sentiment_label
0,1cimjpr,"So instead of making savings, they continued t...",0
1,1d35qlg,Needless story to have dragged into the electi...,0
2,1d3i3mt,"Now, in an ideal world there would be insight ...",0
3,1d5a8wa,How did you not get mind controlled at birth t...,0
4,1d5l3e9,"Talk lately of conscription, having a store of...",0
...,...,...,...
1002,1b5iodf,How is this a non-story? A shop will bow to th...,0
1003,1cv6iym,"The smoke screen is real, and Suella (along wi...",0
1004,1c9kjse,It's a really serious problem that young white...,0
1005,1b22zae,Good luck. Imagine if you refuse to see a PA b...,0


In [3]:
class_distribution(df_reddit_sentiment_uk) 

   Count  Percentage
0    892       88.58
1    115       11.42


### - BESSTIE-reddit-sentiment-au/

In [4]:
splits = {'train': 'reddit-sentiment-au-train.jsonl', 'validation': 'reddit-sentiment-au-valid.jsonl'}
df_reddit_sentiment_au = pd.read_json("hf://datasets/mindhunter23/BESSTIE-reddit-sentiment-au/" + splits["train"], lines=True)
df_reddit_sentiment_au_val = pd.read_json("hf://datasets/mindhunter23/BESSTIE-reddit-sentiment-au/" + splits["validation"], lines=True)
df_reddit_sentiment_au

Unnamed: 0,id,text,sentiment_label
0,1d2d56d,"No its more about risk management, why accept ...",1
1,1d2cfsd,I don’t play this game. \n\nThem: “What are yo...,0
2,1cw9vcr,Well I'm not really confident that we'll see m...,0
3,1czvemb,He's not wrong though alot of media is RW sla...,0
4,1d3x6bo,Please contact safe transport Victoria. This i...,1
...,...,...,...
1758,1d3yzbw,I just went through this with my tenant who wa...,1
1759,1d3yn9l,Yeah my rental has a shitty little 1kw split s...,0
1760,1d0upb1,Lol like what ? go back in time and buy 150mil...,0
1761,1d639g5,"From my point of view, there is absolutely not...",0


In [5]:
class_distribution(df_reddit_sentiment_au)

   Count  Percentage
0   1200       68.07
1    563       31.93


### - BESSTIE-google-sentiment-uk

In [8]:
splits = {'train': 'google-sentiment-uk-train.jsonl', 'validation': 'google-sentiment-uk-valid.jsonl'}
df_google_sentiment_uk = pd.read_json("hf://datasets/mindhunter23/BESSTIE-google-sentiment-uk/" + splits["train"], lines=True)
df_google_sentiment_uk_val = pd.read_json("hf://datasets/mindhunter23/BESSTIE-google-sentiment-uk/" + splits["validation"], lines=True)
df_google_sentiment_uk

Unnamed: 0,id,text,sentiment_label
0,1.046000e+20,Tricky me because I was checking in over midni...,1
1,1.161344e+20,It's lots more cheaper than the Odeon although...,1
2,1.034757e+20,My first time and last time in this place. It ...,0
3,1.073389e+20,"You know, its not bad at all, you get plenty o...",1
4,1.172204e+20,It's. It's OK for a quick fix of junk food. Re...,0
...,...,...,...
1812,1.156926e+20,Great service by the staff! The food was good ...,1
1813,1.087021e+20,Food was delicious chicken bacon avocado in br...,1
1814,1.176660e+20,It was a very nice moon display with some love...,1
1815,1.114779e+20,The food was nice but the service from one sta...,1


In [9]:
class_distribution(df_google_sentiment_uk)

   Count  Percentage
1   1359       74.79
0    458       25.21


### - BESSTIE-google-sentiment-au

In [10]:
splits = {'train': 'data/google-sentiment-au-train.jsonl', 'validation': 'data/google-sentiment-au-valid.jsonl'}
df_google_sentiment_au = pd.read_json("hf://datasets/mindhunter23/BESSTIE-google-sentiment-au/" + splits["train"], lines=True)
df_google_sentiment_au_val = pd.read_json("hf://datasets/mindhunter23/BESSTIE-google-sentiment-au/" + splits["validation"], lines=True)
df_google_sentiment_au

Unnamed: 0,id,text,sentiment_label
0,1.132555e+20,This was one of the best dishes I've EVER had!...,1
1,1.101411e+20,This Mexican restaurant in Penrith is a great ...,1
2,1.103038e+20,"This was not to bad, I ordered the big pork ri...",1
3,1.107520e+20,Clean cool and a nice smaller casino to check ...,1
4,1.152390e+20,Well set out. Great areas to enjoy. Good food ...,1
...,...,...,...
941,1.087996e+20,Beautiful meals and fast cocktails. Waitress w...,1
942,1.044438e+20,With a reputation for great food and terrific ...,1
943,1.134915e+20,"Nice movie theatre, only downfall are the seat...",1
944,1.073785e+20,A beautifully styled space with the fun of a s...,1


In [11]:
class_distribution(df_google_sentiment_au)

   Count  Percentage
1    695       73.47
0    251       26.53


### - BESSTIE-reddit-sentiment-in

In [12]:
splits = {'train': 'reddit-sentiment-in-train.jsonl', 'validation': 'reddit-sentiment-in-valid.jsonl'}
df_reddit_sentiment_in = pd.read_json("hf://datasets/mindhunter23/BESSTIE-reddit-sentiment-in/" + splits["train"], lines=True)
df_reddit_sentiment_in_val = pd.read_json("hf://datasets/mindhunter23/BESSTIE-reddit-sentiment-in/" + splits["validation"], lines=True)
df_reddit_sentiment_in

Unnamed: 0,id,text,sentiment_label
0,1d2o00l,Zepto has a mandate that the delivery boy need...,1
1,1d5fcvf,Mujhe bhi thoda paisa do,0
2,1d04uk7,Nooo don't protest against secular freedom fig...,0
3,1d5dl6q,Har 3 mahine baad kisi bhi global celebrity ko...,0
4,1d66tng,Just because you don't find anything serious b...,0
...,...,...,...
1680,1d4p9vp,Rectal Cargo.,0
1681,1d3wg8h,I mean this is the equivalent of u go to US an...,0
1682,1czj077,what went wrong with the genes after Rajiv Gan...,0
1683,1d43657,The image says the data from 2016-2017. 8 year...,0


In [13]:
class_distribution(df_reddit_sentiment_in)

   Count  Percentage
0   1256       74.54
1    429       25.46


### - BESSTIE-google-sentiment-in

In [18]:
splits = {'train': 'google-sentiment-in-train.jsonl', 'validation': 'google-sentiment-in-valid.jsonl'}
df_google_sentiment_in = pd.read_json("hf://datasets/mindhunter23/BESSTIE-google-sentiment-in/" + splits["train"], lines=True)
df_google_sentiment_in_val = pd.read_json("hf://datasets/mindhunter23/BESSTIE-google-sentiment-in/" + splits["validation"], lines=True)
df_google_sentiment_in

Unnamed: 0,id,text,sentiment_label
0,1.114268e+20,They have an amazing hospitality structure loc...,1
1,1.116605e+20,The attender attitude is not welcoming. Ordere...,0
2,1.134214e+20,The taste is good.. Decent staff.. But the atm...,0
3,1.055034e+20,"Wahi purani jagah, wahi purani yaadein..... Ra...",1
4,1.092858e+20,"An extremely over hyped biriyani, Definitely i...",1
...,...,...,...
1643,1.045586e+20,Location wise it's Good. Parking at road side ...,0
1644,1.051527e+20,"The experience was good, the atmosphere also g...",1
1645,1.032750e+20,I tried the laxmi special sandwich double chee...,1
1646,1.152006e+20,I love o yes but recently they were done blund...,0


In [19]:
class_distribution(df_google_sentiment_in)

   Count  Percentage
1   1232       74.76
0    416       25.24


# 2. Initial Data Preprocessing

## 2.1 Testing text_preprocess() func

In [20]:
# Test the preprocessing function 
print('Original:\n', df_reddit_sentiment_uk.loc[0].text,'\n')
print('Lemmatization:\n',text_preprocess(df_reddit_sentiment_uk.loc[0].text, remove_digits=True, stemmer=Stemmer.WordNet),'\n')
print('Stemming with stopwords:\n',text_preprocess(df_reddit_sentiment_uk.loc[0].text),'\n')

Original:
 So instead of making savings, they continued to spend money they didn’t have, yes that sounds very responsible. Maybe if the government had continued spending, the whole country would be in the same financial mess Birmingham is in. 

Lemma. with stopwords:
 so instead of make saving they continue to spend money they do not have yes that sound very responsible maybe if the government have continue spending the whole country would be in the same financial mess birmingham be in 

Lemma. without stopwords:
 instead make saving continue spend money not yes sound responsible maybe government continue spending whole country would financial mess birmingham 

Stemmer with stopwords:
 so instead of make save they continu to spend money they did not have ye that sound veri respons mayb if the govern had continu spend the whole countri would be in the same financi mess birmingham is in 

Stemmer without stopwords:
 instead make save continu spend money not ye sound veri respons mayb gov

## 2.2 Concatening datasets
### SENTIMENT DATASET

In [21]:
# Assue all datasets are already loaded as DataFrames
combined_sentiment_df = pd.concat(
    [
        df_reddit_sentiment_uk,
        df_reddit_sentiment_au,
        df_google_sentiment_uk,
        df_google_sentiment_au,
        df_reddit_sentiment_in,
        df_google_sentiment_in
    ],
    axis=0,  # Concatenate vertically (row-wise)
    ignore_index=True  # Reset the index in the combined DataFrame
)

# Assue all datasets are already loaded as DataFrames
combined_sentiment_df_val = pd.concat(
    [
        df_reddit_sentiment_uk_val,
        df_reddit_sentiment_au_val,
        df_google_sentiment_uk_val,
        df_google_sentiment_au_val,
        df_reddit_sentiment_in_val,
        df_google_sentiment_in_val
    ],
    axis=0,  # Concatenate vertically (row-wise)
    ignore_index=True  # Reset the index in the combined DataFrame
)

Total rows in combined dataset: 8866

Class distribution:

   Count  Percentage
0   4473       50.45
1   4393       49.55

Dataframe:


Unnamed: 0,id,text,sentiment_label
0,1cimjpr,"So instead of making savings, they continued t...",0
1,1d35qlg,Needless story to have dragged into the electi...,0
2,1d3i3mt,"Now, in an ideal world there would be insight ...",0
3,1d5a8wa,How did you not get mind controlled at birth t...,0
4,1d5l3e9,"Talk lately of conscription, having a store of...",0


# 3. EDA

In [None]:
combined_sentiment_df = pd.read_csv("data_sentiment_preprocessed.csv")
combined_sentiment_df_val = pd.read_csv("data_sentiment_preprocessed_val.csv")

# Display the combined DataFrame
print("Training Dataset\n")
print(combined_sentiment_df.head())
print(f"Total rows in combined dataset: {len(combined_sentiment_df)}")
class_distribution(combined_sentiment_df)

In [None]:
# Display the combined DataFrame
print("Validation Dataset\n")
print(combined_sentiment_df_val.head())
print(f"Total rows in combined dataset: {len(combined_sentiment_df_val)}")
class_distribution(combined_sentiment_df_val)

#### Number of characters per review:

In [None]:
plt.xlabel('Char Count')
plt.ylabel('Sample Count')
combined_sentiment_df['text'].str.len().hist()

In [None]:
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(12,8))
ax1.hist(combined_sentiment_df[combined_sentiment_df['sentiment_label']==1]['text'].str.len())
ax1.set_title( 'Positive Reviews')
ax1.set_xlabel('Char Count')
ax2.hist(combined_sentiment_df[combined_sentiment_df['sentiment_label']==0]['text'].str.len())
ax2.set_title( 'Negative Reviews')
ax2.set_xlabel('Char Count')

#### Most common words:

In [None]:
# POSITIVE SENTIMENT
text = " ".join(i for i in combined_sentiment_df[combined_sentiment_df['sentiment_label']==1]['text'])
wordcloud = WordCloud(background_color="white").generate(text)

plt.figure(figsize=(15,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title('Wordcloud for positive review')
plt.show()

In [None]:
# NEGATIVE SENTIMENT
text = " ".join(i for i in combined_sentiment_df[combined_sentiment_df['sentiment_label']==0]['text'])
wordcloud = WordCloud( background_color="white").generate(text)

plt.figure( figsize=(15,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title('Wordcloud for negative review')
plt.show()

# 4. Text Preprocessing

### - Training Dataset

In [24]:
# Preprocessing + Lemmatization 
combined_sentiment_df['clean_text'] = combined_sentiment_df['text'].apply(lambda x: text_preprocess(x, remove_digits=True, stemmer=Stemmer.WordNet))

In [None]:
# Tokenization
combined_sentiment_df['tokenized_text'] = combined_sentiment_df['clean_text'].apply(lambda x: word_tokenize(x))
combined_sentiment_df.head(5)

In [27]:
# Save preprocessed training data
combined_sentiment_df.to_csv('data_sentiment_preprocessed.csv', index=False)

### - Validation Dataset

In [134]:
# Preprocessing + Lemmatization 
combined_sentiment_df_val['clean_text'] = combined_sentiment_df_val['text'].apply(lambda x: text_preprocess(x, remove_digits=True, stemmer=Stemmer.WordNet))
# Tokenization
combined_sentiment_df_val['tokenized_text'] = combined_sentiment_df_val['clean_text'].apply(lambda x: word_tokenize(x))
combined_sentiment_df_val.head(5)
# Save preprocessed validation data
combined_sentiment_df_val.to_csv('data_sentiment_preprocessed_val.csv', index=False)

#### Missing Values:

In [4]:
print(combined_sentiment_df.isnull().value_counts())
combined_sentiment_df = combined_sentiment_df.dropna() # Drop rows where preprocessing didnt extract any tokens

id     text   sentiment_label  clean_text
False  False  False            False         8860
                               True             6
dtype: int64


In [5]:
print(combined_sentiment_df_val.isnull().value_counts())
combined_sentiment_df_val = combined_sentiment_df_val.dropna()

id     text   sentiment_label  clean_text
False  False  False            False         1211
                               True             1
dtype: int64


# 5. Features Extraction

In [None]:
# Read preprocessed datasets from .csv files
combined_sentiment_df = pd.read_csv("data_sentiment_preprocessed.csv")
combined_sentiment_df_val = pd.read_csv("data_sentiment_preprocessed_val.csv")

from our_feature_extraction import basic_bag, tf_idf
# Split the data
X_train = combined_sentiment_df.tokenized_text
y_train = combined_sentiment_df.sentiment_label
X_val = combined_sentiment_df_val.tokenized_text
y_val = combined_sentiment_df_val.sentiment_label

## 5.1 Basic BoW
+ removing words that occurs less than 3 times

In [80]:
word_counts, vocab, selected_words, vectorizer, X_train_vec, X_val_vec = basic_bag(X_train, X_val, min_refs=3, debug=True)

Shape (X_train_vec) before reuction:  (8860, 13942)
Shape (X_train_vec) after reuction:  (8860, 6231)
Shape (X_val_vec):  (1211, 6231)


In [70]:
# 10 most common words
word_counts = np.asarray(X_train_vec.sum(axis=0)).flatten()
vocab = np.array(vectorizer.get_feature_names_out())

top_indices = np.argsort(word_counts)[::-1]
top_words = vocab[top_indices[:10]]
top_counts = word_counts[top_indices[:10]]

print('Most common words\n:')
for word, count in zip(top_words, top_counts):
    print(f"{word}: {count}")

Most common words:
not: 4523
good: 3415
food: 2608
get: 1843
go: 1610
place: 1569
like: 1549
would: 1527
time: 1374
one: 1363


In [44]:
# Just test
np.unique(X_train_vec[2].toarray())

array([0, 1, 2])

## 5.2 1-hot BoW
+ removing words that occurs less than 3 times

In [55]:
word_counts, vocab, selected_words, vectorizer, X_train_hot, X_val_hot = basic_bag(X_train, X_val, min_refs=3, ohe=True, debug=True)

Shape (X_train_hot):  (8860, 6231)
Shape (X_val_hot):  (1211, 6231)


In [56]:
# Checking if dataset is binary
unique = np.unique(X_train_hot.toarray())
print('Unique values:', unique)

Unique values: [0 1]


## 5.3 TF-IDF

In [57]:
word_counts, vocab, selected_words, vectorizer, X_train_vec_tf, X_val_vec_tf = tf_idf(X_train, X_val, min_refs=3, debug=True)

Shape (X_train_tf) after reuction:  (8860, 6231)
Shape (X_val_tf):  (1211, 6231)


In [None]:
word_counts

## 5.4 N-grams

### 5.4.1 Bigrams

In [89]:
word_counts, vocab, selected_words, vectorizer, X_train_vec_bi, X_val_vec_bi = basic_bag(X_train, X_val, ngram_range=(2,2), min_refs=3, debug=True)

Shape (X_train_bi):  (8860, 21711)
Shape (X_val_bi):  (1211, 21711)


In [101]:
bigram_vocab = vectorizer.get_feature_names_out()
bigram_counts = np.asarray(X_train_vec_bi.sum(axis=0)).flatten()

bigram_freq = list(zip(bigram_vocab, bigram_counts))

# Soritng
sorted_bigram_freq = sorted(bigram_freq, key=lambda x: x[1], reverse=True)

print("10 most common bigrams\n:")
for bigram, count in sorted_bigram_freq[:10]:
    print(f"{bigram}: {count}")

10 most common bigrams:
food good: 194
staff friendly: 160
not good: 145
really good: 128
good place: 126
good food: 124
look like: 122
taste good: 118
not even: 110
service good: 109


## 5.5 Words Embedding (Word2Vec)

In [None]:
# code

# 6. Modeling

## 6.1 Naive Bayes Model

### - Basic BoW

In [None]:
nb(X_train_vec, X_val_vec, y_train, y_val)

### - 1-hot BoW

In [None]:
nb(X_train_hot, X_val_hot, y_train, y_val)

### - TF-IDF

In [None]:
nb(X_train_vec_tf, X_val_vec_tf, y_train, y_val)

### - Bigrams

In [None]:
nb(X_train_vec_bi, X_val_vec_bi, y_train, y_val)

## 6.2 Support Vector Machine (SVM)

### - Basic BoW

In [None]:
support_vector_machine(X_train_vec, X_val_vec, y_train, y_val)

### - 1-hot BoW

In [None]:
support_vector_machine(X_train_hot, X_val_hot, y_train, y_val)

### - TF-IDF

In [None]:
support_vector_machine(X_train_vec_tf, X_val_vec_tf, y_train, y_val)

### - Bigrams

In [None]:
support_vector_machine(X_train_vec_bi, X_val_vec_bi, y_train, y_val)

## 6.3 Random Forest

### - Basic BoW

In [None]:
random_forest(X_train_vec, X_val_vec, y_train, y_val)

### - 1-hot BoW

In [None]:
random_forest(X_train_hot, X_val_hot, y_train, y_val)

### - TF-IDF

In [None]:
random_forest(X_train_vec_tf, X_val_vec_tf, y_train, y_val)

### - Bigrams

In [None]:
random_forest(X_train_vec_bi, X_val_vec_bi, y_train, y_val)