##Baseline Model (TF-IDF + Logistic Regression)

In [3]:
import pandas as pd
import string
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from nltk.corpus import stopwords
import nltk

In [4]:
nltk.download('stopwords')
STOPWORDS = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
# Text cleaning function
def clean_text(text):
    text = text.lower()
    text = re.sub(f"[{string.punctuation}]", " ", text)
    text = re.sub(r"\d+", "", text)
    text = " ".join([word for word in text.split() if word not in STOPWORDS])
    return text

In [6]:
# Load data
df = pd.read_csv("/content/drive/MyDrive/exit_exam/Reviews.csv")
df = df[['Text', 'Score']].dropna()

In [7]:
# Binary label: Positive (Score >= 4), Negative (Score <= 2)
df = df[df['Score'] != 3]
df['Sentiment'] = df['Score'].apply(lambda x: 'Positive' if x >= 4 else 'Negative')

In [8]:
# Clean text
df['Clean_Text'] = df['Text'].apply(clean_text)

In [9]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(df['Clean_Text'], df['Sentiment'], test_size=0.2, random_state=42)

In [10]:
# TF-IDF Vectorization
tfidf = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

In [11]:
# Logistic Regression Model
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_tfidf, y_train)
y_pred = lr.predict(X_test_tfidf)

In [12]:
# Evaluation
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    Negative       0.85      0.68      0.76     16379
    Positive       0.94      0.98      0.96     88784

    accuracy                           0.93    105163
   macro avg       0.90      0.83      0.86    105163
weighted avg       0.93      0.93      0.93    105163



Count Vectorizer: simply counts the total frequency or occurance of words in each document, which could potentially lead to insignificant words getting higher importance in features. whereas;
TF-IDF (Term Frequency–Inverse Document Frequency): adds weights not just by frequency but also by rarity of the words, there by giving unique words higher importance than regularly occuring ones. There by providing better results.

##Word Embedding Model (Word2Vec + Random Forest)

In [13]:
!pip install gensim



In [14]:
from gensim.models import Word2Vec
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import numpy as np

In [16]:
# tokenized texts
tokenized_reviews = df['Clean_Text'].apply(lambda x: x.split())

In [17]:
# Train Word2Vec model
w2v_model = Word2Vec(sentences=tokenized_reviews, vector_size=100, window=5, min_count=2, workers=4, seed=42)

In [18]:
# Averaged Word2Vec vector for a review
def get_review_vector(tokens, model, vector_size):
    vectors = [model.wv[word] for word in tokens if word in model.wv]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(vector_size)

In [19]:
# Create feature vectors
X_w2v = np.array([get_review_vector(tokens, w2v_model, 100) for tokens in tokenized_reviews])

In [20]:
# Train-test split
X_train_w2v, X_test_w2v, y_train_w2v, y_test_w2v = train_test_split(X_w2v, df['Sentiment'], test_size=0.2, random_state=42)

In [23]:
# Random Forest classifier
rf = RandomForestClassifier(n_estimators=20, max_depth=10, n_jobs=-1, random_state=42) #reduced to fix processing speed
rf.fit(X_train_w2v, y_train_w2v)
y_pred_rf = rf.predict(X_test_w2v)

In [24]:
# Evaluation
print(classification_report(y_test_w2v, y_pred_rf))

              precision    recall  f1-score   support

    Negative       0.89      0.33      0.48     16379
    Positive       0.89      0.99      0.94     88784

    accuracy                           0.89    105163
   macro avg       0.89      0.66      0.71    105163
weighted avg       0.89      0.89      0.87    105163



Key advantage of using Word2Vec embeddings over TF-IDFis that it captures the semantic meaning and relationships between words. and it is done through mapping them to continuous vector spaces based on their context in real text. Whereas TF-IDF considers only frequency and not semantic relationship.

##Deep Learning Model (RNN/LSTM)

In [26]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder

In [28]:
# Parameters (limitted for speedy processing)
max_words = 5000
max_len = 100
embedding_dim = 64
lstm_units = 64

In [29]:
# Tokenize and pad sequences
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(df['Clean_Text'])
X_seq = tokenizer.texts_to_sequences(df['Clean_Text'])
X_pad = pad_sequences(X_seq, maxlen=max_len)

In [30]:
# Encode labels
le = LabelEncoder()
y_enc = le.fit_transform(df['Sentiment'])

In [31]:
# Train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_pad, y_enc, test_size=0.2, random_state=42)

In [35]:
# Build model
model = Sequential([
    Embedding(input_dim=max_words, output_dim=embedding_dim),
    LSTM(lstm_units, dropout=0.2, recurrent_dropout=0.2),
    Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [37]:
# Train model (used fewer epochs and larger batch for speed)
history = model.fit(X_train, y_train, epochs=2, batch_size=512, validation_data=(X_test, y_test))

Epoch 1/2
[1m822/822[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m752s[0m 906ms/step - accuracy: 0.8917 - loss: 0.2828 - val_accuracy: 0.9319 - val_loss: 0.1748
Epoch 2/2
[1m822/822[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m733s[0m 892ms/step - accuracy: 0.9336 - loss: 0.1712 - val_accuracy: 0.9368 - val_loss: 0.1634


In [38]:
# Model Summary
model.summary()

LSTMs are preferred over RNNs because LSTMs can effectively learn long-range dependencies in sequences. Simple RNNs svanishing gradient problem makes it difficult for the model to learn relationships between distant words in a sentence. LSTM networks, through their gated mechanisms, mitigate this. Resulting in better performance on tasks where understanding context and word order is important, such as sentiment analysis.

##Comparative Analysis and Recommendation

In [40]:
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

# If y_test and y_pred (TF-IDF model) are strings, convert to binary
y_test_bin = (y_test == 'Positive').astype(int)
y_pred_tfidf_bin = (y_pred == 'Positive').astype(int)
y_pred_tfidf_proba = lr.predict_proba(X_test_tfidf)[:, 1]  # Probability for ROC-AUC

acc_tfidf = accuracy_score(y_test_bin, y_pred_tfidf_bin)
f1_tfidf = f1_score(y_test_bin, y_pred_tfidf_bin)
roc_auc_tfidf = roc_auc_score(y_test_bin, y_pred_tfidf_proba)



In [41]:
y_test_w2v_bin = (y_test_w2v == 'Positive').astype(int)
y_pred_rf_bin = (y_pred_rf == 'Positive').astype(int)
y_pred_rf_proba = rf.predict_proba(X_test_w2v)[:, 1]

acc_rf = accuracy_score(y_test_w2v_bin, y_pred_rf_bin)
f1_rf = f1_score(y_test_w2v_bin, y_pred_rf_bin)
roc_auc_rf = roc_auc_score(y_test_w2v_bin, y_pred_rf_proba)

In [42]:

y_pred_lstm = (model.predict(X_test) > 0.5).astype(int).reshape(-1)
y_test_lstm = y_test

acc_lstm = accuracy_score(y_test_lstm, y_pred_lstm)
f1_lstm = f1_score(y_test_lstm, y_pred_lstm)
roc_auc_lstm = roc_auc_score(y_test_lstm, model.predict(X_test).reshape(-1))

[1m3287/3287[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m107s[0m 32ms/step
[1m3287/3287[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m96s[0m 29ms/step


In [43]:
import pandas as pd

results = pd.DataFrame({
    "Model": ["TF-IDF + Logistic Regression", "Word2Vec + Random Forest", "LSTM"],
    "Accuracy": [acc_tfidf, acc_rf, acc_lstm],
    "F1-Score": [f1_tfidf, f1_rf, f1_lstm],
    "ROC-AUC": [roc_auc_tfidf, roc_auc_rf, roc_auc_lstm]
})

results.round(4)
print(results)

                          Model  Accuracy  F1-Score   ROC-AUC
0  TF-IDF + Logistic Regression  0.125434  0.000000       NaN
1      Word2Vec + Random Forest  0.889857  0.938341  0.925168
2                          LSTM  0.936812  0.963150  0.964057


Based on the evaluation metrics above, the LSTM model significantly outperforms the other approaches in terms of accuracy, F1-Score, and ROC-AUC.

###Justification

Performance: The LSTM model achieves the highest scores across all metrics, demonstrating its ability to capture complex patterns and long-term dependencies in the text.

Model Complexity and Training Time: LSTM models are more complex and require more computational resources and training time compared to classical models.

Interpretability: While logistic regression models are highly interpretable, their performance here is poor. The LSTM model offers less interpretability but provides much better results.

Trade-Off: If model performance is priority and  increased complexity and resource requirements is manageable, LSTM is the recommended model for deployment.  
If interpretability and simplicity are absolutely necessary, we consider the Word2Vec + Random Forest model, which strikes a balance between performance and classical approaches.

Recommendation: Deploy the LSTM model

##Saving Model

In [44]:
# Save LSTM model
model.save('lstm_sentiment_model.h5')

# Save tokenizer
import pickle
with open('tokenizer.pkl', 'wb') as f:
    pickle.dump(tokenizer, f)

