<a href="https://colab.research.google.com/github/smkim0508/COS484-Notes/blob/main/A1P1_Smoothing_(COS484_S2026).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook for Programming Question 1
Welcome to the programming portion of the assignment! Each assignment throughout the semester will have a theory portion and a programming portion. We will be using [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb#recent=true), so if you have never used it before, take a quick look through this introduction: [Working with Google Colab](https://docs.google.com/document/d/1LlnXoOblXwW3YX-0yG_5seTXJsb3kRdMMRYqs8Qqum4/edit?usp=sharing).

We'll also be programming in Python, which we will assume a basic familiarity with. Python has fantastic community support and we'll be using numerous packages for machine learning (ML) and natural language processing (NLP) tasks.

### Learning Objectives
In this problem we will experiment with language models and implement smoothing. We will also see effects of using unigram/bigram LMs and the size of the training data.

### Data preprocessing

In this section, you should write methods to load data and clean (tokenize) it. You will need to write two functions for tokenization. One function, **basicTokenize**, should simply split the text using whitespace. The other function, **nltkTokenize**, should implement NLTK tokenization. Write another function to count the top k most frequent words in a list. You may structure this code however you like, but we suggest constructing a Tokenizer class to encompass these functions.

In [7]:
import nltk
nltk.download('punkt_tab')

class Tokenizer():

  def __init__(self):
    pass

  def basicTokenize(self, text: str) -> list[str]:
    """
    splits text by whitespace
    """
    return text.split()

  def nltkTokenize(self, text: str) -> list[str]:
    """
    splits token using nltk library
    """
    return nltk.word_tokenize(text)

  def count_top_words(self, words: list[str], k: int) -> list[str]:
    """
    counts the top k words and returns the top k as a sorted list
    """
    word_counts = {}
    for word in words:
      if word in word_counts:
        word_counts[word] += 1
      else:
        word_counts[word] = 1

    sorted_word_counts = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)
    return sorted_word_counts[:k]

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [8]:
tokenizer = Tokenizer()
# test tokenization
text = "This is a test string. I like food."
basic_tokens = tokenizer.basicTokenize(text)
nltk_tokens = tokenizer.nltkTokenize(text)

print(f"Basic Tokenization: {basic_tokens}")
print(f"NLTK Tokenization: {nltk_tokens}")

Basic Tokenization: ['This', 'is', 'a', 'test', 'string.', 'I', 'like', 'food.']
NLTK Tokenization: ['This', 'is', 'a', 'test', 'string', '.', 'I', 'like', 'food', '.']


### Language Modeling and Smoothing
In this section, you should write methods to train and test a bigram language model. These functions will need to include computing bigram counts, estimating bigram probabilities, and calculating perplexity on the test set. You should also implement a method that can later be called to modify the probabilities with add-alpha smoothing. We suggest encompassing these functions in a LanguageModel class to make experiments easier to run.

### Instantiate an LM and calculate perplexity
Write a wrapper method to train and evaluate a language model on a given train and dev corpus.

### Load and tokenize the training and validation data using your code from the Data Processing section

You can download training and validation datasets for this problem from the links below:
*   Training data: https://princeton-nlp.github.io/cos484/assignments/a1/brown-train.txt
*   Validation data: https://princeton-nlp.github.io/cos484/assignments/a1/brown-val.txt


## Experiments

#### Plot the frequency of words
Code for sub-part (a)(b)

**(a) **Report the top 10 words ordered by their frequency in the training corpus, both using basicTokenize and nltkTokenize. What differences do you notice between the two?**

TODO: ANSWER THE QUESTION HERE (DOUBLE-CLICK TO EDIT)


**(b) Using the nltkTokenize function you wrote, make a plot of the frequencies of words in the training corpus, ordered by their rank, i.e. most frequent word first, the second most word next, and so on on the x axis. Plot only the top 100 most common words to see the trend more clearly. What pattern do you observe in your plot regarding frequency and rank? Do the frequencies follow Zipf's law?**

TODO: ANSWER THE QUESTION HERE (DOUBLE-CLICK TO EDIT)

#### Report the train and test perplexity after learning the language model
Code for sub-part (c)

**(c) Train the model and report its perplexity on the train and validation sets. Is the train or val perplexity higher and why?**

TODO: ANSWER THE QUESTION HERE (DOUBLE-CLICK TO EDIT)

**(c) What do you notice about the val perplexity and why is this the case?**

TODO: ANSWER THE QUESTION HERE (DOUBLE-CLICK TO EDIT)

#### Add-alpha smoothing
Code for sub-part (d)

**(d) Implement Laplace (add-$\alpha$) smoothing and retrain the model. Plot the perplexity on train and validation sets as a function of alpha (with values $10^{-5}, 10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}, 1, 10$).
What happens to the validation and training perplexity as we increase alpha and why does this happen?**

TODO: ANSWER THE QUESTION HERE (DOUBLE-CLICK TO EDIT)

**(d) What seems to be a good setting for alpha? Provide brief justification.**

TODO: ANSWER THE QUESTION HERE (DOUBLE-CLICK TO EDIT)

**(e) Based on your performance in the previous experiments, propose one idea apart from Laplace smoothing to
improve the performance of your bigram language model on the validation set. Briefly describe the modification,
explain why you expect it will improve validation perplexity, and discuss any potential limitations.**

TODO: ANSWER THE QUESTION HERE (DOUBLE-CLICK TO EDIT)

# LLM Prompts

If you used an AI tool to complete any part of this assignment, please paste all prompts you used to produce your final code/responses in the box below and answer the following reflection question.

Prompts Used:
*   
*   



**Reflection: What parts of the AI generated output required modification or improvement? Describe the feedback you gave the tool to produce your final output or any changes you had to make on your own.**

TODO: ANSWER THE QUESTION HERE (DOUBLE-CLICK TO EDIT)