<a href="https://colab.research.google.com/github/snehakokil/JHU-Notebooks/blob/main/Sentiment_Analysis_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#!pip install pandas matplotlib scikit-learn numpy


import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import cross_val_score, StratifiedKFold
import numpy as np
import ast

print("hello world")

## Task 1: Load and Explore the Dataset
The given dataset is in a text file. I converted the text file into CSV for easier processing and loaded into a Pandas Dataframe. Next, I cleaned the dataset by filling in the empty values with a blank string. 

In [None]:
# load the sentiment analysis dataset from a .txt file and save it as a .csv file
# Clean the data and handle potential parsing issues
# Path to your input .txt file
input_file = "Product_Sentiment.txt"
output_file = "Product_Sentiment.csv"

data = []

with open(input_file, "r", encoding="utf-8") as f:
    for line in f:
        line = line.strip().rstrip(",")  # remove trailing comma
        if line:  # skip empty lines
            try:
                # Safely evaluate the tuple string into a Python tuple
                text, label = ast.literal_eval(line)
                data.append((text, label))
            except Exception as e:
                print(f"Skipping line due to error: {line} -> {e}")

# Create DataFrame
df = pd.DataFrame(data, columns=["text", "label"])

# Save as CSV
df.to_csv(output_file, index=False, encoding="utf-8")

print("✅ Dataset loaded and saved as CSV!")
print(df.head())

# read the csv file and clean the data to remove any rows with missing or null values
df = pd.read_csv("Product_Sentiment.csv")
df.fillna('', inplace=True)
df.reset_index(drop=True, inplace=True) # reset index after dropping rows   
print("✅ Data cleaned!")
print(df.head())


## Task 2: Build a Traditional ML Classifier
For this we first need to split data into training and testing. From the existing dataset, I chose to reserve 80% of the data for training and 20% for testing. 

### Prepare training and testing datasets

In [None]:
# Load your dataset (CSV with columns: "text", "label")
df = pd.read_csv("Product_Sentiment.csv")

# Split into train (80%) and test (20%)
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

# Save to CSV files
train_df.to_csv("train.csv", index=False, encoding="utf-8")
test_df.to_csv("test.csv", index=False, encoding="utf-8")

print("✅ Dataset successfully split!")
print("Train size:", len(train_df))
print("Test size:", len(test_df))

### Create SVM classifier and train. Then apply the test data
(Used ChatGPT to understand steps)
This task requires building a pipeline, where the text data is converted into numerical features and then passed onto SVM classifier training and testing functions.

To convert text into numerical features, ChatGPT recommended using the TF-IDF Vectorizer. It also suggested the use of base classifier LinearSVC, which is known for its text classification capabilities.

# 

In [None]:


# Load train and test datasets
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

# Build a pipeline: TF-IDF vectorizer + Linear SVM
# svm_pipeline = Pipeline([
#    ('tfidf', TfidfVectorizer(stop_words='english')),
#    ('svm', LinearSVC(random_state=42))
# ])
# Load dataset
df = pd.read_csv("Product_Sentiment.csv")

# Define stratified k-fold (preserves class balance)
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# CountVectorizer can also be used instead of TfidfVectorizer
svm_pipeline = Pipeline([
     ('count', CountVectorizer(stop_words='english')),
     ('svm', LinearSVC(random_state=42))
])

svm_scores = cross_val_score(svm_pipeline, df['text'], df['label'], cv=kfold, scoring='accuracy')
print("SVM + BoW - Accuracy per fold:", svm_scores)
print("SVM + BoW - Mean Accuracy:", svm_scores.mean())

# Train the model
svm_pipeline.fit(df['text'], df['label'])

# Evaluate on test set
#y_pred = svm_pipeline.predict(test_df['text'])
#print("✅ Accuracy:", accuracy_score(test_df['label'], y_pred))
#print("\nClassification Report:\n", classification_report(test_df['label'], y_pred))

# Example prediction
example = ["I really love this new phone!", "This is the worst product I’ve ever bought."]
predictions = svm_pipeline.predict(example)
for text, label in zip(example, predictions):
    print(f"Text: {text} -> Predicted Sentiment: {label}")

### Analysis of SVM with TF-IDF vectorizer
Looking at the results, especially the precision score of 0.00, the model performed poorly on picking up positive sentiments, while it did a bit better on negative ones. In spite of the performance scores, the predictions it generated on examples were accurate.

When analyzed with the help of ChatGPT, I understood that the TF-IDF vectorizer usually works better with larger datasets. This dataset being very small, became unstable and could not represent the positive sentiment test data very well.

ChatGPT also suggested using cross-validation to improve accuracy score on a small dataset, instead of 80-20 split. Another option was to try using simpler models like Naive Bayes.

I will be first trying to improve performance with Bag of Words approach and then test the suggestions given above.

### Analysis of SVM with Count Vectorizer
When replaced with this to implement Bag of Words approach, the accuracy score went up a little bit (50% from 37%), recall improved to 80%, because of smaller, imbalanced dataset - which means the model will rarely miss negative sentiments, sometimes even considering a positive sentiment as negative.

### Alterntive: Naive Bayes
Let's see how this model works. 

when ran this model with Bag of Words, it still gave out 37% accuracy, which was no improvement from the previous SVM implementation. In fact, compared to this, SVM with BOW gave better accuracy. (same as SVM + TF-IDF, and worse than SVM + BoW at 50%)

### Adding k-fold cross validation
Using this, I am trying to balance the train and test datasets in the hopes to get better precision and accuracy, if possible.

For this, I modified the SVM + BoW implementation above.

This approach improved the mean accuracy over 5 k-folds to *67%*


In [None]:
# Load train and test datasets
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

# CountVectorizer can also be used instead of TfidfVectorizer
nb_pipeline = Pipeline([
     ('count', CountVectorizer(stop_words='english')),
     ('nb', MultinomialNB())
])

# Train the model
nb_pipeline.fit(train_df['text'], train_df['label'])

# Evaluate on test set
y_pred = nb_pipeline.predict(test_df['text'])
print("✅ Accuracy:", accuracy_score(test_df['label'], y_pred))
print("\nClassification Report:\n", classification_report(test_df['label'], y_pred))

# Example prediction
example = ["I really love this new phone!", "This is the worst product I’ve ever bought."]
predictions = nb_pipeline.predict(example)
for text, label in zip(example, predictions):
    print(f"Text: {text} -> Predicted Sentiment: {label}")