
# IE6483 Project 1 — Sentiment Classification (End-to-End, English)

This notebook is ready to run on a single NVIDIA GPU (e.g., RTX A6000) or Apple Silicon (MPS).
It covers: environment sanity checks, data loading, a TF-IDF+LogReg baseline, and a BERT fine-tuning pipeline.
It also generates `submission.csv` for the test set.



## Environment notes (read once)

- Use Python 3.10/3.11. Install PyTorch matching your CUDA version (A6000 → CUDA 11.8/12.x).  
- Transformers >= **4.46** uses `eval_strategy` instead of `evaluation_strategy` in `TrainingArguments`; this notebook adapts automatically.  
- For throughput on Ampere GPUs (A6000), enabling **TF32** (and BF16 if supported) can speed up training. See: `torch.set_float32_matmul_precision('high')` docs.


In [1]:

import os, platform, sys, json, random, math, warnings
from pathlib import Path
import numpy as np, pandas as pd
import torch

print("Python:", sys.version)
print("PyTorch:", torch.__version__, "| CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))
print("MPS available:", torch.backends.mps.is_available())


Python: 3.10.19 (main, Oct 21 2025, 16:43:05) [GCC 11.2.0]
PyTorch: 2.7.1+cu118 | CUDA available: True
GPU: NVIDIA RTX A6000
MPS available: False


In [2]:

import torch, random, numpy as np
from transformers import set_seed
set_seed(42)
random.seed(42); np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)
# Enable TF32 path for matmul on Ampere+ (improves perf with tiny precision tradeoff)
torch.set_float32_matmul_precision("high")  # https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html
print("float32 matmul precision:", torch.get_float32_matmul_precision())


float32 matmul precision: high



## Load data

Expected schema:
- `train.json`: list of dicts with `reviews` (text) and `sentiments` (0/1)
- `test.json`: list of dicts with `reviews`


In [3]:

from pathlib import Path
import json
import pandas as pd

cwd = Path(".").resolve()
train_path = "../train.json"
test_path  = "../test.json"
print("train_path:", train_path)
print("test_path :", test_path)

with open(train_path, "r", encoding="utf-8") as f:
    train_data = json.load(f)
with open(test_path, "r", encoding="utf-8") as f:
    test_data = json.load(f)

train_df = pd.DataFrame(train_data)
test_df  = pd.DataFrame(test_data)

assert {"reviews","sentiments"}.issubset(train_df.columns), "train.json must have reviews & sentiments"
assert "reviews" in test_df.columns, "test.json must have reviews"

print(train_df.head(2))
print(train_df['sentiments'].value_counts(dropna=False))
print("train size:", len(train_df), "| test size:", len(test_df))


train_path: ../train.json
test_path : ../test.json
                                             reviews  sentiments
0  I bought this belt for my daughter in-law for ...           1
1  The size was perfect and so was the color.  It...           1
sentiments
1    6319
0    1082
Name: count, dtype: int64
train size: 7401 | test size: 1851


In [4]:

from sklearn.model_selection import train_test_split

def clean_text(x):
    if not isinstance(x, str): return ""
    return x.strip()

train_df["text"] = train_df["reviews"].apply(clean_text)
test_df["text"]  = test_df["reviews"].apply(clean_text)

train_split, valid_split = train_test_split(
    train_df, test_size=0.2, random_state=42, stratify=train_df["sentiments"]
)
print("train_split:", train_split.shape, "| valid_split:", valid_split.shape)


train_split: (5920, 3) | valid_split: (1481, 3)



## Baseline 1 — TF-IDF + Logistic Regression

A strong classical baseline: n-gram TF-IDF + L2-logistic regression.


In [6]:

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report
import joblib, os

os.makedirs("outputs", exist_ok=True)

tfidf_clf = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1,2), max_features=50000, min_df=2)),
    ("clf", LogisticRegression(max_iter=2000, solver="saga", random_state=55)),
])

tfidf_clf.fit(train_split["text"], train_split["sentiments"])
valid_pred = tfidf_clf.predict(valid_split["text"])
acc = accuracy_score(valid_split["sentiments"], valid_pred)
f1  = f1_score(valid_split["sentiments"], valid_pred)
print(f"[TFIDF+LR] Valid Acc={acc:.4f} | F1={f1:.4f}")
print(classification_report(valid_split["sentiments"], valid_pred, digits=4))

joblib.dump(tfidf_clf, "outputs/tfidf_logreg.joblib")
pd.DataFrame({"sentiments": tfidf_clf.predict(test_df["text"])}).to_csv("outputs/submission_tfidf.csv", index=False)
print("Saved:", "outputs/submission_tfidf.csv")


[TFIDF+LR] Valid Acc=0.8893 | F1=0.9391
              precision    recall  f1-score   support

           0     1.0000    0.2442    0.3926       217
           1     0.8852    1.0000    0.9391      1264

    accuracy                         0.8893      1481
   macro avg     0.9426    0.6221    0.6658      1481
weighted avg     0.9020    0.8893    0.8590      1481

Saved: outputs/submission_tfidf.csv
