# Assignment 1: Sentiment Analysis Classifier

##### Group 26: Michal Dawid Kowalski (up202401554) | Pedro Maria Passos Ribeiro do Carmo Pereira (up201708807) | Santiago Romero Pineda (up)

In this assignment, we will build a sentiment analysis classifier using traditional machine learning techniques. The process includes pre-processing, feature extraction, and exploring both sparse and dense feature representations like word embeddings. We will use "traditional" machine learning classifier instead of deep learning models (CNNs, RNNs, Transformers). The focus will be on understanding text classification techniques and evaluating their performance on the given dataset using common classification metrics like accuracy, precision, recall, and F1-score.



In [None]:
# Import libraries 
from our_eda import *
from our_modeling import *
# from our_preprocessing import *
from our_feature_extraction import *
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import numpy as np
import pandas as pd
from our_feature_selection import *
import gensim.downloader as api

# 1. BESSTIE Dataset

## 1.1 Uploading Dataset Files from HuggingFace (https://huggingface.co/mindhunter23)

The dataset is hosted on Hugging Face under the username "mindhunter23." It consists of text data collected from Reddit and Google for the countries UK, AU, and IN. All texts are in English and are labeled with sentiment values: 0 for negative sentiment and 1 for positive sentiment. The dataset is already split into training and validation sets, making it ready for sentiment analysis tasks. It offers diverse content from different regions and platforms.

### - BESSTIE-reddit-sentiment-uk/

In [None]:
splits = {'train': 'reddit-sentiment-uk-train.jsonl', 'validation': 'reddit-sentiment-uk-valid.jsonl'}
df_reddit_sentiment_uk = pd.read_json("hf://datasets/mindhunter23/BESSTIE-reddit-sentiment-uk/" + splits["train"], lines=True)
df_reddit_sentiment_uk_val = pd.read_json("hf://datasets/mindhunter23/BESSTIE-reddit-sentiment-uk/" + splits["validation"], lines=True)
df_reddit_sentiment_uk.head(5)

In [None]:
print('Training CLasses Distribution\n')
print(class_distribution(df_reddit_sentiment_uk))
print('Validation CLasses Distribution\n')
print(class_distribution(df_reddit_sentiment_uk_val))

### - BESSTIE-reddit-sentiment-au/

In [None]:
splits = {'train': 'reddit-sentiment-au-train.jsonl', 'validation': 'reddit-sentiment-au-valid.jsonl'}
df_reddit_sentiment_au = pd.read_json("hf://datasets/mindhunter23/BESSTIE-reddit-sentiment-au/" + splits["train"], lines=True)
df_reddit_sentiment_au_val = pd.read_json("hf://datasets/mindhunter23/BESSTIE-reddit-sentiment-au/" + splits["validation"], lines=True)
df_reddit_sentiment_au.head(5)

In [None]:
print('Training CLasses Distribution\n')
print(class_distribution(df_reddit_sentiment_au))
print('Validation CLasses Distribution\n')
print(class_distribution(df_reddit_sentiment_au_val))

### - BESSTIE-google-sentiment-uk

In [None]:
splits = {'train': 'google-sentiment-uk-train.jsonl', 'validation': 'google-sentiment-uk-valid.jsonl'}
df_google_sentiment_uk = pd.read_json("hf://datasets/mindhunter23/BESSTIE-google-sentiment-uk/" + splits["train"], lines=True)
df_google_sentiment_uk_val = pd.read_json("hf://datasets/mindhunter23/BESSTIE-google-sentiment-uk/" + splits["validation"], lines=True)
df_google_sentiment_uk.head(5)

In [None]:
print('Training CLasses Distribution\n')
print(class_distribution(df_google_sentiment_uk))
print('Validation CLasses Distribution\n')
print(class_distribution(df_google_sentiment_uk_val))

### - BESSTIE-google-sentiment-au

In [None]:
splits = {'train': 'data/google-sentiment-au-train.jsonl', 'validation': 'data/google-sentiment-au-valid.jsonl'}
df_google_sentiment_au = pd.read_json("hf://datasets/mindhunter23/BESSTIE-google-sentiment-au/" + splits["train"], lines=True)
df_google_sentiment_au_val = pd.read_json("hf://datasets/mindhunter23/BESSTIE-google-sentiment-au/" + splits["validation"], lines=True)
df_google_sentiment_au.head(5)

In [None]:
print('Training CLasses Distribution\n')
print(class_distribution(df_google_sentiment_au))
print('Validation CLasses Distribution\n')
print(class_distribution(df_google_sentiment_au_val))

### - BESSTIE-reddit-sentiment-in

In [None]:
splits = {'train': 'reddit-sentiment-in-train.jsonl', 'validation': 'reddit-sentiment-in-valid.jsonl'}
df_reddit_sentiment_in = pd.read_json("hf://datasets/mindhunter23/BESSTIE-reddit-sentiment-in/" + splits["train"], lines=True)
df_reddit_sentiment_in_val = pd.read_json("hf://datasets/mindhunter23/BESSTIE-reddit-sentiment-in/" + splits["validation"], lines=True)
df_reddit_sentiment_in.head(5)

In [None]:
print('Training CLasses Distribution\n')
print(class_distribution(df_reddit_sentiment_in))
print('Validation CLasses Distribution\n')
print(class_distribution(df_reddit_sentiment_in_val))

### - BESSTIE-google-sentiment-in

In [None]:
splits = {'train': 'google-sentiment-in-train.jsonl', 'validation': 'google-sentiment-in-valid.jsonl'}
df_google_sentiment_in = pd.read_json("hf://datasets/mindhunter23/BESSTIE-google-sentiment-in/" + splits["train"], lines=True)
df_google_sentiment_in_val = pd.read_json("hf://datasets/mindhunter23/BESSTIE-google-sentiment-in/" + splits["validation"], lines=True)
df_google_sentiment_in.head(5)

In [None]:
print('Training CLasses Distribution\n')
print(class_distribution(df_google_sentiment_in))
print('Validation CLasses Distribution\n')
print(class_distribution(df_google_sentiment_in_val))

# 2. Initial Data Preprocessing

## 2.1 Testing text_preprocess() func

In [None]:
# Test the preprocessing function 
print('Original:\n', df_reddit_sentiment_uk.loc[0].text,'\n')
print('Lemmatization:\n',text_preprocess(df_reddit_sentiment_uk.loc[0].text, remove_digits=True, stemmer=Stemmer.WordNet),'\n')
print('Stemming:\n',text_preprocess(df_reddit_sentiment_uk.loc[0].text),'\n')

## 2.2 Concatening datasets
### SENTIMENT DATASET

In [None]:
# Assue all datasets are already loaded as DataFrames
combined_sentiment_df = pd.concat(
    [
        df_reddit_sentiment_uk,
        df_reddit_sentiment_au,
        df_google_sentiment_uk,
        df_google_sentiment_au,
        df_reddit_sentiment_in,
        df_google_sentiment_in
    ],
    axis=0,  # Concatenate vertically (row-wise)
    ignore_index=True  # Reset the index in the combined DataFrame
)

# Assue all datasets are already loaded as DataFrames
combined_sentiment_df_val = pd.concat(
    [
        df_reddit_sentiment_uk_val,
        df_reddit_sentiment_au_val,
        df_google_sentiment_uk_val,
        df_google_sentiment_au_val,
        df_reddit_sentiment_in_val,
        df_google_sentiment_in_val
    ],
    axis=0,  # Concatenate vertically (row-wise)
    ignore_index=True  # Reset the index in the combined DataFrame
)

In [None]:
# Save combined data
combined_sentiment_df.to_csv("data_sentiment_preprocessed.csv", index=False)
combined_sentiment_df_val.to_csv("data_sentiment_preprocessed_val.csv", index=False)

# 3. EDA

In [None]:
# Optional, when already have necessary data files
# combined_sentiment_df = pd.read_csv("data_sentiment_preprocessed.csv")
# combined_sentiment_df_val = pd.read_csv("data_sentiment_preprocessed_val.csv")

# Display the combined DataFrame
print(f"Total rows in combined training dataset: {len(combined_sentiment_df)}\n")
print('\nClasses Distribution in Training Dataset:\n')
class_distribution(combined_sentiment_df)
print('\n')
plt.figure(figsize=(6,4))
combined_sentiment_df['sentiment_label'].value_counts().plot(kind='bar', color='skyblue')
plt.title('Distribution of Sentiment Labels (Training)')
plt.xlabel('Sentiment Label')
plt.ylabel('Frequency')
plt.xticks(rotation=0)
plt.show()
print("Training Dataset:\n")
combined_sentiment_df.head(5)

In [None]:
# Display the combined DataFrame
print(f"Total rows in combined validation dataset: {len(combined_sentiment_df_val)}\n")
print('\nClasses Distribution in Validation Dataset:\n')
class_distribution(combined_sentiment_df)
print('\n')
plt.figure(figsize=(6,4))
combined_sentiment_df_val['sentiment_label'].value_counts().plot(kind='bar', color='skyblue')
plt.title('Distribution of Sentiment Labels (Validation)')
plt.xlabel('Sentiment Label')
plt.ylabel('Frequency')
plt.xticks(rotation=0)
plt.show()
print("Validation Dataset:\n")
combined_sentiment_df_val.head(5)

#### Number of characters per review:

In [None]:
plt.figure(figsize=(8, 4))
combined_sentiment_df['text'].str.len().hist(bins=50, color='skyblue', edgecolor='black')
plt.title('Distribution of Text Length (Character Count)', fontsize=12)
plt.xlabel('Character Count', fontsize=10)
plt.ylabel('Sample Count', fontsize=10)
plt.xticks(fontsize=8)
plt.yticks(fontsize=8)
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))

ax1.hist(combined_sentiment_df[combined_sentiment_df['sentiment_label'] == 1]['text'].str.len(), bins=50, color='skyblue', edgecolor='black')
ax1.set_title('Positive Reviews', fontsize=12)
ax1.set_xlabel('Character Count', fontsize=10)
ax1.set_ylabel('Sample Count', fontsize=10)
ax1.tick_params(axis='both', labelsize=8)
ax1.grid(True, linestyle='--', alpha=0.7)

ax2.hist(combined_sentiment_df[combined_sentiment_df['sentiment_label'] == 0]['text'].str.len(), bins=50, color='skyblue', edgecolor='black')
ax2.set_title('Negative Reviews', fontsize=12)
ax2.set_xlabel('Character Count', fontsize=10)
ax2.set_ylabel('Sample Count', fontsize=10)
ax2.tick_params(axis='both', labelsize=8)
ax2.grid(True, linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

#### Most common words:

In [None]:
# POSITIVE SENTIMENT
text = " ".join(i for i in combined_sentiment_df[combined_sentiment_df['sentiment_label']==1]['text'])
wordcloud = WordCloud(background_color="white").generate(text)

plt.figure(figsize=(15,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title('Wordcloud for positive reviews')
plt.show()

In [None]:
# NEGATIVE SENTIMENT
text = " ".join(i for i in combined_sentiment_df[combined_sentiment_df['sentiment_label']==0]['text'])
wordcloud = WordCloud( background_color="white").generate(text)

plt.figure( figsize=(15,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title('Wordcloud for negative reviews')
plt.show()

# 4. Text Preprocessing

### - Training Dataset

In [None]:
# Preprocessing + Lemmatization 
combined_sentiment_df['clean_text'] = combined_sentiment_df['text'].apply(lambda x: text_preprocess(x, remove_digits=True, stemmer=Stemmer.WordNet))

In [None]:
# Tokenization
combined_sentiment_df['tokenized_text'] = combined_sentiment_df['clean_text'].apply(lambda x: word_tokenize(x))
combined_sentiment_df.head(5)

In [None]:
# Save preprocessed training data
combined_sentiment_df.to_csv('data_sentiment_preprocessed.csv', index=False)

### - Validation Dataset

In [None]:
# Preprocessing + Lemmatization 
combined_sentiment_df_val['clean_text'] = combined_sentiment_df_val['text'].apply(lambda x: text_preprocess(x, remove_digits=True, stemmer=Stemmer.WordNet))
# Tokenization
combined_sentiment_df_val['tokenized_text'] = combined_sentiment_df_val['clean_text'].apply(lambda x: word_tokenize(x))
combined_sentiment_df_val.head(5)
# Save preprocessed validation data
combined_sentiment_df_val.to_csv('data_sentiment_preprocessed_val.csv', index=False)

# YOU CAN START FROM THIS POINT GUYS!!!!!!!!!!!!!!!

In [None]:
# Optional, when already have necessary data files
combined_sentiment_df = pd.read_csv('data_sentiment_preprocessed.csv')
combined_sentiment_df_val = pd.read_csv('data_sentiment_preprocessed_val.csv')

#### Handling Missing Values:

In [None]:
print(combined_sentiment_df.isnull().value_counts())
combined_sentiment_df = combined_sentiment_df.dropna() # Drop rows where preprocessing didnt extract any tokens

In [None]:
print(combined_sentiment_df_val.isnull().value_counts())
combined_sentiment_df_val = combined_sentiment_df_val.dropna()

In [None]:
# Tokenization because after reading from the file list with tokens converts into str
combined_sentiment_df['tokenized_text'] = combined_sentiment_df['clean_text'].apply(lambda x: word_tokenize(x))
combined_sentiment_df_val['tokenized_text'] = combined_sentiment_df_val['clean_text'].apply(lambda x: word_tokenize(x))
combined_sentiment_df.head(5)

In [None]:
# Check number of unique words before and after the preprocessing
all_words = ' '.join(combined_sentiment_df['text']).split()
unique_words = set(all_words)

all_words_clean = ' '.join(combined_sentiment_df['clean_text']).split()
unique_words_clean = set(all_words_clean)

labels = ['Unique Words in Raw Text', 'Unique Words in Cleaned Text']
sizes = [len(unique_words), len(unique_words_clean)]

plt.figure(figsize=(5, 5))
plt.pie(sizes, labels=labels, autopct=lambda p: f'{int(p * sum(sizes) / 100)}', startangle=90, colors = ['#FF6347', 'skyblue'])
plt.title('Comparison of Unique Words in Raw vs Cleaned Text')
plt.axis('equal')
plt.show()

# 5. Features Extraction

In [None]:
# Split the data
X_train = combined_sentiment_df.clean_text
y_train = combined_sentiment_df.sentiment_label
X_val = combined_sentiment_df_val.clean_text
y_val = combined_sentiment_df_val.sentiment_label

## 5.1 Basic BoW
+ removing words that occurs less than 3 times

In [None]:
# Convert X_train and X_val into proper type 
X_train_str = [' '.join(tokens) for tokens in combined_sentiment_df.tokenized_text]
X_val_str = [' '.join(tokens) for tokens in combined_sentiment_df_val.tokenized_text]

word_counts, vocab, selected_words, vectorizer, X_train_vec, X_val_vec = basic_bag(X_train_str, X_val_str, min_refs=2, debug=True)

In [None]:
# 10 most common words
word_counts = np.asarray(X_train_vec.sum(axis=0)).flatten()
vocab = np.array(vectorizer.get_feature_names_out())

top_indices = np.argsort(word_counts)[::-1]
top_words = vocab[top_indices[:10]]
top_counts = word_counts[top_indices[:10]]

print('Top 10 most common words:\n')
for word, count in zip(top_words, top_counts):
    print(f"{word}: {count}")

In [None]:
# Just values test
unique = np.unique(X_train_vec[2].toarray())
print('Unique values:', unique)

## 5.2 1-hot BoW
+ removing words that occurs less than 3 times

In [None]:
word_counts, vocab, selected_words, vectorizer, X_train_hot, X_val_hot = basic_bag(X_train_str, X_val_str, min_refs=2, ohe=True, debug=True)

In [None]:
# Checking if dataset is binary
unique = np.unique(X_train_hot.toarray())
print('Unique values:', unique)

## 5.3 TF-IDF

In [None]:
word_counts, vocab, selected_words, vectorizer, X_train_vec_tf, X_val_vec_tf = tf_idf(X_train_str, X_val_str, min_refs=3, debug=True)

## 5.4 N-grams

### 5.4.1 Bigrams

In [None]:
word_counts, vocab, selected_words, vectorizer, X_train_vec_bi, X_val_vec_bi = basic_bag(X_train_str, X_val_str, ngram_range=(2,2), min_refs=2, debug=True)

In [None]:
bigram_vocab = vectorizer.get_feature_names_out()
bigram_counts = np.asarray(X_train_vec_bi.sum(axis=0)).flatten()

bigram_freq = list(zip(bigram_vocab, bigram_counts))

# Soritng
sorted_bigram_freq = sorted(bigram_freq, key=lambda x: x[1], reverse=True)
print("10 most common bigrams:\n")
for bigram, count in sorted_bigram_freq[:10]:
    print(f"{bigram}: {count}")

## 5.5 Words Embedding

### Word2Vec

#### - Own Word2Vec Model (CBOW)

In [None]:
word2vec_model1 = word2vec_alg(X_train)

In [None]:
# Similarity between tokens
try:
    similarity_tokens = word2vec_model1.wv.similarity('cat', 'dog')
    print(f"Similarity: {similarity_tokens:.4f}")
except KeyError as e:
    print(f"KeyError: {e}")

# Most similar words to the specific token
try:
    similar_words_token = word2vec_model1.wv.most_similar('dog', topn=5)
    print("Most similar words:", similar_words_token)
except KeyError as e:
    print(f"KeyError: {e}")

# Token which doesn't match (Odd-One-Out)
try: 
    not_match_token = word2vec_model1.wv.doesnt_match(['cat','dog','wine'])
    print('Not matching word:', not_match_token)
except KeyError as e:
    print(f"KeyError: {e}")

# Perform an analogy task
try:
    analogy_result = word2vec_model1.wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=1)
    print("Analogy result:", analogy_result)
except KeyError as e:
    print(f"KeyError: {e}")

In [None]:
print('Number of unique words in the Word2vec Model:',len(word2vec_model1.wv.key_to_index))

In [None]:
# Visualization
# List of words that we want to display
words_to_explore= ['woman', 'man', 'queen', 'king', 'human', 'person', 'girl', 'child', 'boy', 'salad', 'lettuce', 'tomato', 'soup', 'turnip', 'arugula', 'pepper', 'greens', 'barley', 'bean', 'stew', 'carrot']
visualize_word_embeddings(word2vec_model1, words_to_explore)

#### - Own Word2Vec Model (Skip Gram)

In [None]:
word2vec_model2 = word2vec_alg(X_train, sg=True)

In [None]:
# Similarity between tokens
try:
    similarity_tokens = word2vec_model2.wv.similarity('cat', 'dog')
    print(f"Similarity: {similarity_tokens:.4f}")
except KeyError as e:
    print(f"KeyError: {e}")

# Most similar words to the specific token
try:
    similar_words_token = word2vec_model2.wv.most_similar('dog', topn=5)
    print("Most similar words:", similar_words_token)
except KeyError as e:
    print(f"KeyError: {e}")

# Token which doesn't match (Odd-One-Out)
try: 
    not_match_token = word2vec_model2.wv.doesnt_match(['cat','dog','wine'])
    print('Not matching word:', not_match_token)
except KeyError as e:
    print(f"KeyError: {e}")

# Perform an analogy task
try:
    analogy_result = word2vec_model2.wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=1)
    print("Analogy result:", analogy_result)
except KeyError as e:
    print(f"KeyError: {e}")

In [None]:
# Visualization
# List of words that we want to display
words_to_explore= ['woman', 'man', 'queen', 'king', 'human', 'person', 'girl', 'child', 'boy', 'salad', 'lettuce', 'tomato', 'soup', 'turnip', 'arugula', 'pepper', 'greens', 'barley', 'bean', 'stew', 'carrot']
visualize_word_embeddings(word2vec_model2, words_to_explore)

#### - Fine-Tuning Word2Vec

In [None]:
# Import and covert pretrained model
from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec("model/glove.6B.100d.txt", "model/glove_model2.txt")

In [None]:
word2vec_model3 = word2vec_alg(X_train, pretrained_path='model/glove_model2.txt', sg=True) 

In [None]:
# Similarity between tokens
try:
    similarity_tokens = word2vec_model3.wv.similarity('cat', 'dog')
    print(f"Similarity: {similarity_tokens:.4f}")
except KeyError as e:
    print(f"KeyError: {e}")

# Most similar words to the specific token
try:
    similar_words_token = word2vec_model3.wv.most_similar('dog', topn=5)
    print("Most similar words:", similar_words_token)
except KeyError as e:
    print(f"KeyError: {e}")

# Token which doesn't match (Odd-One-Out)
try: 
    not_match_token = word2vec_model3.wv.doesnt_match(['wine','beer','clock'])
    print('Not matching word:', not_match_token)
except KeyError as e:
    print(f"KeyError: {e}")

# Perform an analogy task
try:
    analogy_result = word2vec_model3.wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=1)
    print("Analogy result:", analogy_result)
except KeyError as e:
    print(f"KeyError: {e}")

In [None]:
print('Number of unique words in the Word2vec Model:',len(word2vec_model3.wv.key_to_index))

In [None]:
# Visualization
# List of words that we want to display
words_to_explore= ['woman', 'man', 'queen', 'king', 'human', 'person', 'girl', 'child', 'boy', 'salad', 'lettuce', 'tomato', 'soup', 'turnip', 'arugula', 'pepper', 'greens', 'barley', 'bean', 'stew', 'carrot']
visualize_word_embeddings(word2vec_model3, words_to_explore)

### Bert Embeddings

In [None]:
# Split the data
X_train = combined_sentiment_df.text.astype(str).tolist()
y_train = combined_sentiment_df.sentiment_label
X_val = combined_sentiment_df_val.text.astype(str).tolist()
y_val = combined_sentiment_df_val.sentiment_label

# # Generate embeddings
X_train_vec, X_val_vec = bert_embeddings(X_train, X_val)

# 6. Feature Selection

In [None]:
combined_sentiment_df = pd.read_csv("data_sentiment_preprocessed.csv")
combined_sentiment_df_val = pd.read_csv("data_sentiment_preprocessed_val.csv")

from our_feature_extraction import basic_bag, tf_idf
# Split the data
X_train = combined_sentiment_df.tokenized_text
y_train = combined_sentiment_df.sentiment_label
X_val = combined_sentiment_df_val.tokenized_text
y_val = combined_sentiment_df_val.sentiment_label
word_counts, vocab, selected_words, vectorizer, X_train_vec, X_val_vec = basic_bag(X_train, X_val, ohe=True, debug=True)

In [None]:
from our_feature_selection import *

print(X_train_vec.shape)
sel, X_train_redux, X_test_redux = feat_filtering(X_train_vec, y_train, X_val_vec)
print(X_train_redux.shape)

In [None]:
sel, X_train_rfe, X_test_rfe = rfe(X_train_vec, y_train, X_val_vec)
print(X_train_rfe.shape)

In [None]:
nb(X_train_vec, X_val_vec, y_train, y_val)

In [None]:
nb(X_train_rfe, X_test_rfe, y_train, y_val)

In [None]:
nb(X_train_redux, X_test_redux, y_train, y_val)

# 7. Modeling

## 7.1 Naive Bayes Model

### - Basic BoW

In [None]:
nb(X_train_vec, X_val_vec, y_train, y_val)

### - 1-hot BoW

In [None]:
nb(X_train_hot, X_val_hot, y_train, y_val)

### - TF-IDF

In [None]:
nb(X_train_vec_tf, X_val_vec_tf, y_train, y_val)

### - Bigrams

In [None]:
nb(X_train_vec_bi, X_val_vec_bi, y_train, y_val)

## 7.2 Support Vector Machine (SVM)

### - Basic BoW

In [None]:
support_vector_machine(X_train_vec, X_val_vec, y_train, y_val)

### - 1-hot BoW

In [None]:
support_vector_machine(X_train_hot, X_val_hot, y_train, y_val)

### - TF-IDF

In [None]:
support_vector_machine(X_train_vec_tf, X_val_vec_tf, y_train, y_val)

### - Bigrams

In [None]:
support_vector_machine(X_train_vec_bi, X_val_vec_bi, y_train, y_val)

## 7.3 Random Forest

### - Basic BoW

In [None]:
random_forest(X_train_vec, X_val_vec, y_train, y_val)

### - 1-hot BoW

In [None]:
random_forest(X_train_hot, X_val_hot, y_train, y_val)

### - TF-IDF

In [None]:
random_forest(X_train_vec_tf, X_val_vec_tf, y_train, y_val)

### - Bigrams

In [None]:
random_forest(X_train_vec_bi, X_val_vec_bi, y_train, y_val)

## 7.4 Embedding Modeling

### NB

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train_vec)
X_val_scaled = scaler.transform(X_val_vec)


nb(X_train_scaled, X_val_scaled, y_train, y_val)

sel, X_train_redux, X_val_redux = feat_filtering(X_train_scaled, y_train, X_val_scaled, k=2)

nb(X_train_redux, X_val_redux, y_train, y_val)

### SVM

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train_vec)
X_val_scaled = scaler.transform(X_val_vec)

support_vector_machine(X_train_scaled, X_val_scaled, y_train, y_val) # Best Parameters:  {'C': 100, 'gamma': 0.01, 'kernel': 'sigmoid'} for bert 

In [None]:
from our_feature_selection import rfe

rfe(X_train_vec, y_train, X_val_vec, min_features_to_select=int(X_train_vec.shape[1]*0.9), save_file=True)

In [None]:
import pickle
with open('./rfecv_svm.pickle', "rb") as f:
    rfe_sel = pickle.load(f)

In [None]:
from sklearn.svm import SVC



best_model = SVC(C=100, gamma=0.01, kernel='sigmoid')
best_model.fit(X_train_scaled, y_train)

y_pred = best_model.predict(X_val_scaled)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_val, y_pred))
print("Accuracy:", accuracy_score(y_val, y_pred))

sel, X_train_redux, X_val_redux = feat_filtering(X_train_scaled, y_train, X_val_vec, k=95)
best_model.fit(X_train_redux, y_train)

y_pred = best_model.predict(X_val_redux)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_val, y_pred))
print("Accuracy:", accuracy_score(y_val, y_pred))

X_train_redux = rfe_sel.transform(X_train_scaled)
X_val_redux = rfe_sel.transform(X_val_scaled)
best_model.fit(X_train_redux, y_train)

y_pred = best_model.predict(X_val_redux)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_val, y_pred))
print("Accuracy:", accuracy_score(y_val, y_pred))

### LR

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_vec)
X_val_scaled = scaler.transform(X_val_vec)


best_model = LogisticRegression(max_iter=5)
best_model.fit(X_train_scaled, y_train)
y_pred = best_model.predict(X_val_scaled)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_val, y_pred))
print("Accuracy:", accuracy_score(y_val, y_pred))

### RF

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train_vec, y_train)
y_pred = rf.predict(X_val_vec)

print("Classification Report:")
print(classification_report(y_val, y_pred))
print("Accuracy:", accuracy_score(y_val, y_pred))

In [None]:
random_forest(X_train_vec, X_val_vec, y_train, y_val)