<a href="https://colab.research.google.com/github/smkim0508/COS484-Notes/blob/main/A1P2_Classification_(COS484_S2026).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook for Programming Question 2
Welcome to the programming portion of the assignment! Each assignment throughout the semester will have a written portion and a programming portion. We will be using [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb#recent=true), so if you have never used it before, take a quick look through this introduction: [Working with Google Colab](https://docs.google.com/document/d/1LlnXoOblXwW3YX-0yG_5seTXJsb3kRdMMRYqs8Qqum4/edit?usp=sharing).

We'll also be programming in Python, which we will assume a basic familiarity with. Python has fantastic community support and we'll be using numerous packages for machine learning (ML) and natural language processing (NLP) tasks.

### Learning Objectives
In this problem we will implement logistic regression and test it on a sentiment analysis dataset.

### Data Loading and Feature Extraction

##### You will need to implement a method that processes raw text into feature vectors by mapping vocabulary terms to unique indices. Your implementation needs to support Unigram extraction, where features represent individual word counts, as well as Bigram extraction, where features represent consecutive word pair counts. Make sure your setup correctly handles feature indexing so that the same mapping is applied to both training and development data.

In [7]:
# first need to build vocabulary
import os

# download data into data/... dir
if not os.path.exists('data'):
    os.makedirs('data')

# training
!wget -P data/ https://princeton-nlp.github.io/cos484/assignments/a1/train.txt
# dev
!wget -P data/ https://princeton-nlp.github.io/cos484/assignments/a1/dev.txt

# helper to load data and parse it
def load_data(path):
    data = []
    with open(path, "r") as f:
        for line in f:
            parts = line.strip().split()
            label = int(parts[0])
            tokens = parts[1:]
            data.append((label, tokens))
    return data

# load in train and dev data
train_data = load_data("data/train.txt")
dev_data   = load_data("data/dev.txt")

# helper to build vocab from train data
def build_vocab(train_data):
    vocab = {}

    def add_feature(feat):
        if feat not in vocab:
            vocab[feat] = len(vocab) # set value as unique idx that auto-increments

    # can add bias here

    for label, tokens in train_data:
        # NOTE: marks unigram vs bigrams w/ labels
        # unigram features
        for w in tokens:
            add_feature("UNI_" + w)

        # bigram features
        for i in range(len(tokens)-1):
            add_feature("BI_" + tokens[i] + "_" + tokens[i+1])

    return vocab

vocab = build_vocab(train_data)
print(vocab)

--2026-02-15 20:40:52--  https://princeton-nlp.github.io/cos484/assignments/a1/train.txt
Resolving princeton-nlp.github.io (princeton-nlp.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to princeton-nlp.github.io (princeton-nlp.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 738844 (722K) [text/plain]
Saving to: ‘data/train.txt.6’


2026-02-15 20:40:52 (8.49 MB/s) - ‘data/train.txt.6’ saved [738844/738844]

--2026-02-15 20:40:52--  https://princeton-nlp.github.io/cos484/assignments/a1/dev.txt
Resolving princeton-nlp.github.io (princeton-nlp.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to princeton-nlp.github.io (princeton-nlp.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 94400 (92K) [text/plain]
Saving to: ‘data/dev.txt.6’


2026-02-15 20:40:52 (6.05 MB/s) - ‘data/dev.txt.6’ saved [94400/94400]



### Model Implementation

You should implement a class that supports the logistic regression logic. This includes:
*   **Initialization**: A function to initialize the model parameters (weights and biases) as well as hyperparameters (including the learning rate, regularization parameter, and number of epochs).
*   **Optimization**: A training method that iterates through the dataset, calculates the gradient of the loss function for each example or batch, and updates the parameters using your chosen optimization function (we suggest using Stochastic Gradient Descent or Mini-batch SGD for efficiency).
*   **Inference**: A function that outputs the model's prediction for a single example.

In [8]:
import torch
import torch.nn as nn

def sparse_to_dense(X_sparse, num_features):
    X = torch.zeros(len(X_sparse), num_features)
    for i, x in enumerate(X_sparse):
        for idx, val in x.items():
            X[i, idx] = val
    return X

class LogisticRegression(nn.Module):
    def __init__(self, num_features):
        super().__init__()
        self.linear = nn.Linear(num_features, 1)

    def forward(self, x):
        return torch.sigmoid(self.linear(x))

### Training Loop

You should implement the logic for your model to train on the given training examples. Experiment with different hyperparameters to find the ones that optimize performance.

In [9]:
def train_model(model, X, y, lr=0.1, epochs=5, reg=0.0):
    optimizer = torch.optim.SGD(model.parameters(), lr=lr, weight_decay=reg)
    criterion = nn.BCELoss()

    for ep in range(epochs):
        optimizer.zero_grad()
        outputs = model(X).squeeze()
        loss = criterion(outputs, y.float())
        loss.backward()
        optimizer.step()

        print(f"Epoch {ep+1}, Loss: {loss.item():.4f}")

### Functions for evaluating model accuracy

In [10]:
def accuracy(model, X, y):
    with torch.no_grad():
        preds = model(X).squeeze()
        preds = (preds >= 0.5).long()
        return (preds == y).float().mean().item()

### Download and load the training and development data

You can download the training and development sets for this problem from the links below:
*   Training data: https://princeton-nlp.github.io/cos484/assignments/a1/train.txt
*   Development data: https://princeton-nlp.github.io/cos484/assignments/a1/dev.txt

In [None]:
# already defined above in data processing, but to reiterate:

import os

# download data into data/... dir
if not os.path.exists('data'):
    os.makedirs('data')

# training
!wget -P data/ https://princeton-nlp.github.io/cos484/assignments/a1/train.txt
# dev
!wget -P data/ https://princeton-nlp.github.io/cos484/assignments/a1/dev.txt

# training path, load data into train_text
train_data = load_data('data/train.txt')
dev_data = load_data('data/dev.txt')

# use helper to build vocab w/ tokenized words and labels
vocab = build_vocab(train_data)

In [11]:
# use helper to turn data into feature vectors

def vectorize(data, vocab):
    X = []
    y = []

    for label, tokens in data:
        features = {}

        # unigram counts
        for w in tokens:
            feat = "UNI_" + w
            if feat in vocab:
                idx = vocab[feat]
                features[idx] = features.get(idx, 0) + 1

        # bigram counts
        for i in range(len(tokens)-1):
            feat = "BI_" + tokens[i] + "_" + tokens[i+1]
            if feat in vocab:
                idx = vocab[feat]
                features[idx] = features.get(idx, 0) + 1

        X.append(features)
        y.append(label)

    return X, y

# Experiments

In [12]:
# load in feature vectors
X_train, y_train = vectorize(train_data, vocab)
X_dev, y_dev = vectorize(dev_data, vocab)

# Convert sparse to tensors
X_train_tensor = sparse_to_dense(X_train, len(vocab))
y_train_tensor = torch.tensor(y_train)

X_dev_tensor = sparse_to_dense(X_dev, len(vocab))
y_dev_tensor = torch.tensor(y_dev)

# Initialize model
model = LogisticRegression(len(vocab))

# Train
train_model(model, X_train_tensor, y_train_tensor, lr=0.1, epochs=10, reg=1e-4)

# Evaluate
print("Train acc:", accuracy(model, X_train_tensor, y_train_tensor))
print("Dev acc:", accuracy(model, X_dev_tensor, y_dev_tensor))

Epoch 1, Loss: 0.6930
Epoch 2, Loss: 0.6918
Epoch 3, Loss: 0.6908
Epoch 4, Loss: 0.6898
Epoch 5, Loss: 0.6890
Epoch 6, Loss: 0.6882
Epoch 7, Loss: 0.6875
Epoch 8, Loss: 0.6868
Epoch 9, Loss: 0.6862
Epoch 10, Loss: 0.6856
Train acc: 0.5406069159507751
Dev acc: 0.5263761281967163


### Unigram vs Bigram (No regularization)
Code for sub-part (a)

**(a) In this part, we want to train the logistic regression model without regularization. Train your model separately with (i) unigram features and (ii) bigram features (two different models). Report both training and development accuracy on the dataset. How do the results of the unigram and bigram models compare?**

TODO: ANSWER THE QUESTION HERE (DOUBLE-CLICK TO EDIT)

### Logistic regression with regularization

Code for sub-part (b)

**(b) Next, we would like to experiment with $l_2$ regularization $R(\theta) = \alpha\|\theta\|^2$. Plot the accuracy on train and development sets as a function of $\alpha = \{0, 10^{-2}, 10^{-1}, 1, 10\}$. You only need to experiment with unigram features for this part. Explain what you observe. Does this match what you would expect from regularization?**

TODO: ANSWER THE QUESTION HERE (DOUBLE-CLICK TO EDIT)

**(c) Based on your model’s performance in the previous experiment, propose one change you would consider
making to either the model or feature extraction pipeline to further improve development set performance.
Briefly describe the modification, explain why you expect it will improve validation perplexity, and discuss any
potential limitations.**

TODO: ANSWER THE QUESTION HERE (DOUBLE-CLICK TO EDIT)

# LLM Prompts

If you used an AI tool to complete any part of this assignment, please paste all prompts you used to produce your final code/responses in the box below and answer the following reflection question.

Prompts Used:
*   
*   



**Reflection: What parts of the AI generated output required modification or improvement? Describe the feedback you gave the tool to produce your final output or any changes you had to make on your own.**

TODO: ANSWER THE QUESTION HERE (DOUBLE-CLICK TO EDIT)