<a href="https://colab.research.google.com/github/smkim0508/COS484-Notes/blob/main/A1P2_Classification_(COS484_S2026).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook for Programming Question 2
Welcome to the programming portion of the assignment! Each assignment throughout the semester will have a written portion and a programming portion. We will be using [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb#recent=true), so if you have never used it before, take a quick look through this introduction: [Working with Google Colab](https://docs.google.com/document/d/1LlnXoOblXwW3YX-0yG_5seTXJsb3kRdMMRYqs8Qqum4/edit?usp=sharing).

We'll also be programming in Python, which we will assume a basic familiarity with. Python has fantastic community support and we'll be using numerous packages for machine learning (ML) and natural language processing (NLP) tasks.

### Learning Objectives
In this problem we will implement logistic regression and test it on a sentiment analysis dataset.

### Data Loading and Feature Extraction

##### You will need to implement a method that processes raw text into feature vectors by mapping vocabulary terms to unique indices. Your implementation needs to support Unigram extraction, where features represent individual word counts, as well as Bigram extraction, where features represent consecutive word pair counts. Make sure your setup correctly handles feature indexing so that the same mapping is applied to both training and development data.

In [1]:
# first need to build vocabulary
import os

# download data into data/... dir
if not os.path.exists('data'):
    os.makedirs('data')

# training
!wget -P data/ https://princeton-nlp.github.io/cos484/assignments/a1/train.txt
# dev
!wget -P data/ https://princeton-nlp.github.io/cos484/assignments/a1/dev.txt

# helper to load data and parse it
def load_data(path):
  data = []
  with open(path, "r") as f:
    for line in f:
        parts = line.strip().split()
        label = int(parts[0])
        tokens = parts[1:]
        data.append((label, tokens))
  return data

# load in train and dev data
train_data = load_data("data/train.txt")
dev_data   = load_data("data/dev.txt")

--2026-02-17 02:26:40--  https://princeton-nlp.github.io/cos484/assignments/a1/train.txt
Resolving princeton-nlp.github.io (princeton-nlp.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to princeton-nlp.github.io (princeton-nlp.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 738844 (722K) [text/plain]
Saving to: ‘data/train.txt.5’


2026-02-17 02:26:40 (110 MB/s) - ‘data/train.txt.5’ saved [738844/738844]

--2026-02-17 02:26:41--  https://princeton-nlp.github.io/cos484/assignments/a1/dev.txt
Resolving princeton-nlp.github.io (princeton-nlp.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to princeton-nlp.github.io (princeton-nlp.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 94400 (92K) [text/plain]
Saving to: ‘data/dev.txt.5’


2026-02-17 02:26:41 (49.0 MB/s) - ‘data/dev.txt.5’ saved [94400/94400]



In [2]:
# helper to build UNIGRAM vocab from train data
def build_unigram_vocab(train_data):
  vocab = {}

  def add_feature(feat):
    if feat not in vocab:
      vocab[feat] = len(vocab) # set value as unique idx that auto-increments

  # can add bias here

  for label, tokens in train_data:
    # unigram features
    for w in tokens:
      add_feature("UNI_" + w)

  return vocab

# helper to build BIGRAM vocab from train data
def build_bigram_vocab(train_data):
  vocab = {}

  def add_feature(feat):
    if feat not in vocab:
      vocab[feat] = len(vocab) # set value as unique idx that auto-increments

  # can add bias here

  for label, tokens in train_data:
    # bigram features
    for i in range(len(tokens)-1):
      add_feature("BI_" + tokens[i] + "_" + tokens[i+1])

  return vocab

vocab = build_bigram_vocab(train_data)
print(vocab)



### Model Implementation

You should implement a class that supports the logistic regression logic. This includes:
*   **Initialization**: A function to initialize the model parameters (weights and biases) as well as hyperparameters (including the learning rate, regularization parameter, and number of epochs).
*   **Optimization**: A training method that iterates through the dataset, calculates the gradient of the loss function for each example or batch, and updates the parameters using your chosen optimization function (we suggest using Stochastic Gradient Descent or Mini-batch SGD for efficiency).
*   **Inference**: A function that outputs the model's prediction for a single example.

In [3]:
import torch
import torch.nn as nn

def sparse_to_dense(X_sparse, num_features):
  X = torch.zeros(len(X_sparse), num_features)
  for i, x in enumerate(X_sparse):
    for idx, val in x.items():
      X[i, idx] = val
  return X

class LogisticRegression(nn.Module):
  def __init__(self, num_features):
    super().__init__()
    self.linear = nn.Linear(num_features, 1)

  def forward(self, x):
    return torch.sigmoid(self.linear(x))

### Training Loop

You should implement the logic for your model to train on the given training examples. Experiment with different hyperparameters to find the ones that optimize performance.

In [4]:
def train_model(model, X, y, lr=0.1, epochs=10, alpha=0.0, batch_size=5):
  """
  Hyperparams and model training helper.
  Defined regularization R = alpha * (theta)^2, which becomes 2*alpha*theta after gradient.
  """
  optimizer = torch.optim.SGD(model.parameters(), lr=lr, weight_decay=2*alpha) # NOTE: weight_decay is set to be 2*alpha to simulate the desired reg. function.
  criterion = nn.BCELoss()

  N = X.shape[0]

  for ep in range(epochs):

      # shuffle every epoch for mini-batch SGD
      perm = torch.randperm(N)
      X_shuffled = X[perm]
      y_shuffled = y[perm]

      total_loss = 0.0

      # iterate over mini-batches
      for start in range(0, N, batch_size):
          end = start + batch_size
          xb = X_shuffled[start:end]
          yb = y_shuffled[start:end]

          optimizer.zero_grad()

          outputs = model(xb).squeeze()
          loss = criterion(outputs, yb.float())

          # back prop; we're only really updating one layer
          loss.backward()
          optimizer.step()
          # loss accumulation
          total_loss += loss.item() * len(xb)

      avg_loss = total_loss / N
      # for debug/monitoring the loss over training
      print(f"Epoch {ep+1}, Loss: {avg_loss:.4f}")

### Functions for evaluating model accuracy

In [5]:
def accuracy(model, X, y):
  with torch.no_grad():
    preds = model(X).squeeze()
    # if prediction val is >= 0.5, guess class 1, otherwise 0
    preds = (preds >= 0.5).long()
    return (preds == y).float().mean().item()

### Download and load the training and development data

You can download the training and development sets for this problem from the links below:
*   Training data: https://princeton-nlp.github.io/cos484/assignments/a1/train.txt
*   Development data: https://princeton-nlp.github.io/cos484/assignments/a1/dev.txt

In [6]:
# already defined above in data processing, but to reiterate:

import os

# download data into data/... dir
if not os.path.exists('data'):
    os.makedirs('data')

# training
!wget -P data/ https://princeton-nlp.github.io/cos484/assignments/a1/train.txt
# dev
!wget -P data/ https://princeton-nlp.github.io/cos484/assignments/a1/dev.txt

# training path, load data into train_text
train_data = load_data('data/train.txt')
dev_data = load_data('data/dev.txt')

# use helper to build vocab w/ tokenized words and labels
bigram_vocab = build_bigram_vocab(train_data)
unigram_vocab = build_unigram_vocab(train_data)

--2026-02-17 02:26:47--  https://princeton-nlp.github.io/cos484/assignments/a1/train.txt
Resolving princeton-nlp.github.io (princeton-nlp.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to princeton-nlp.github.io (princeton-nlp.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 738844 (722K) [text/plain]
Saving to: ‘data/train.txt.6’


2026-02-17 02:26:47 (79.1 MB/s) - ‘data/train.txt.6’ saved [738844/738844]

--2026-02-17 02:26:47--  https://princeton-nlp.github.io/cos484/assignments/a1/dev.txt
Resolving princeton-nlp.github.io (princeton-nlp.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to princeton-nlp.github.io (princeton-nlp.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 94400 (92K) [text/plain]
Saving to: ‘data/dev.txt.6’


2026-02-17 02:26:47 (84.3 MB/s) - ‘data/dev.txt.6’ saved [94400/94400]



In [7]:
# use helper to turn data into feature vectors

def bigram_vectorize(data, vocab):
  X = []
  y = []

  for label, tokens in data:
    features = {}

    # bigram counts
    for i in range(len(tokens)-1):
      feat = "BI_" + tokens[i] + "_" + tokens[i+1]
      if feat in vocab:
        idx = vocab[feat]
        features[idx] = features.get(idx, 0) + 1

    X.append(features)
    y.append(label)

  return X, y

def unigram_vectorize(data, vocab):
  X = []
  y = []

  for label, tokens in data:
    features = {}

    # unigram counts
    for w in tokens:
      feat = "UNI_" + w
      if feat in vocab:
        idx = vocab[feat]
        features[idx] = features.get(idx, 0) + 1

    X.append(features)
    y.append(label)

  return X, y

# Experiments

### Unigram vs Bigram (No regularization)
Code for sub-part (a)

In [8]:
# define hyperparams, shared for both models
lr = 0.1
epochs = 10
reg = 0.0
batch_size = 5 # minibatch size

# NOTE: unigram features
# load in feature vectors w/ unigram model
X_train_uni, y_train_uni = unigram_vectorize(train_data, unigram_vocab)
X_dev_uni, y_dev_uni = unigram_vectorize(dev_data, unigram_vocab)

# convert sparse to tensors
X_train_tensor_uni = sparse_to_dense(X_train_uni, len(vocab))
y_train_tensor_uni = torch.tensor(y_train_uni)

X_dev_tensor_uni = sparse_to_dense(X_dev_uni, len(vocab))
y_dev_tensor_uni = torch.tensor(y_dev_uni)

# initialize unigram model
unigram_model = LogisticRegression(len(vocab))

# train unigram model w/ hyperparameters
train_model(unigram_model, X_train_tensor_uni, y_train_tensor_uni, lr=lr, epochs=epochs, alpha=reg, batch_size=batch_size)

# NOTE: bigram features
# load in feature vectors w/ bigram model
X_train_bi, y_train_bi = bigram_vectorize(train_data, bigram_vocab)
X_dev_bi, y_dev_bi = bigram_vectorize(dev_data, bigram_vocab)

# convert sparse to tensors
X_train_tensor_bi = sparse_to_dense(X_train_bi, len(vocab))
y_train_tensor_bi = torch.tensor(y_train_bi)

X_dev_tensor_bi = sparse_to_dense(X_dev_bi, len(vocab))
y_dev_tensor_bi = torch.tensor(y_dev_bi)

# initialize bigram model
bigram_model = LogisticRegression(len(vocab))

# train bigram model w/ hyperparameters
train_model(bigram_model, X_train_tensor_bi, y_train_tensor_bi, lr=lr, epochs=epochs, alpha=reg, batch_size=batch_size)

# evaluate both bigram and unigram models
print("Unigram model train acc:", accuracy(unigram_model, X_train_tensor_uni, y_train_tensor_uni))
print("Unigram model dev acc:", accuracy(unigram_model, X_dev_tensor_uni, y_dev_tensor_uni))

print("Bigram model train acc:", accuracy(bigram_model, X_train_tensor_bi, y_train_tensor_bi))
print("Bigram model dev acc:", accuracy(bigram_model, X_dev_tensor_bi, y_dev_tensor_bi))

Epoch 1, Loss: 0.6288
Epoch 2, Loss: 0.5333
Epoch 3, Loss: 0.4798
Epoch 4, Loss: 0.4435
Epoch 5, Loss: 0.4156
Epoch 6, Loss: 0.3906
Epoch 7, Loss: 0.3718
Epoch 8, Loss: 0.3542
Epoch 9, Loss: 0.3400
Epoch 10, Loss: 0.3260
Epoch 1, Loss: 0.6657
Epoch 2, Loss: 0.5595
Epoch 3, Loss: 0.4894
Epoch 4, Loss: 0.4368
Epoch 5, Loss: 0.3952
Epoch 6, Loss: 0.3612
Epoch 7, Loss: 0.3330
Epoch 8, Loss: 0.3093
Epoch 9, Loss: 0.2889
Epoch 10, Loss: 0.2708
Unigram model train acc: 0.9153178930282593
Unigram model dev acc: 0.7660550475120544
Bigram model train acc: 0.9880057573318481
Bigram model dev acc: 0.7259174585342407


**(a) In this part, we want to train the logistic regression model without regularization. Train your model separately with (i) unigram features and (ii) bigram features (two different models). Report both training and development accuracy on the dataset. How do the results of the unigram and bigram models compare?**

(i) unigram features
Training accuracy: ~0.915
Dev accuracy: ~0.766

(ii) bigram features
Training accuracy: ~0.988
Dev accuracy: ~0.726

The results of bigram vs unigram models show that the bigram features have better accuracy on training data but lower dev accuracy, suggesting that the bigram model is more prone to overfitting and performs worse on generalized cases, whereas the unigram model is more general (due to less semantic relationship between tokens in a sequence, we only consider them independently) but performs worse on the trained data.

### Logistic regression with regularization

Code for sub-part (b)

**(b) Next, we would like to experiment with $l_2$ regularization $R(\theta) = \alpha\|\theta\|^2$. Plot the accuracy on train and development sets as a function of $\alpha = \{0, 10^{-2}, 10^{-1}, 1, 10\}$. You only need to experiment with unigram features for this part. Explain what you observe. Does this match what you would expect from regularization?**

TODO: ANSWER THE QUESTION HERE (DOUBLE-CLICK TO EDIT)

**(c) Based on your model’s performance in the previous experiment, propose one change you would consider
making to either the model or feature extraction pipeline to further improve development set performance.
Briefly describe the modification, explain why you expect it will improve validation perplexity, and discuss any
potential limitations.**

TODO: ANSWER THE QUESTION HERE (DOUBLE-CLICK TO EDIT)

# LLM Prompts

If you used an AI tool to complete any part of this assignment, please paste all prompts you used to produce your final code/responses in the box below and answer the following reflection question.

Prompts Used:
*   
*   



**Reflection: What parts of the AI generated output required modification or improvement? Describe the feedback you gave the tool to produce your final output or any changes you had to make on your own.**

TODO: ANSWER THE QUESTION HERE (DOUBLE-CLICK TO EDIT)