Homework 2: n-gram LM
----

Due date: 25 Feb 2024

Points: 105

Goals:
- understand the difficulties of counting and probablities in NLP applications
- work with real world data to build a functioning language model
- stress test your model (to some extent)

Complete in groups of: __one (individually)__

Allowed python modules:
- `numpy`, `matplotlib`, and all built-in python libraries (e.g. `math` and `string`)
- do not use `nltk` or `pandas`

Instructions:
- Complete outlined problems in this notebook. 
- When you have finished, __clear the kernel__ and __run__ your notebook "fresh" from top to bottom. Ensure that there are __no errors__. 
    - If a problem asks for you to write code that does result in an error (as in, the answer to the problem is an error), leave the code in your notebook but commented out so that running from top to bottom does not result in any errors.
- Double check that you have completed Task 0.
- Submit your work on Gradescope.
- Double check that your submission on Gradescope looks like you believe it should.

Task 0: Name, References, Reflection (5 points)
---

Name: Nisharg Gosai

References
---
List the resources you consulted to complete this homework here. Write one sentence per resource about what it provided to you. If you consulted no references to complete your assignment, write a brief sentence stating that this is the case and why it was the case for you.

(Example)
- https://docs.python.org/3/tutorial/datastructures.html
    - Read about the the basics and syntax for data structures in python.

AI Collaboration
---
Following the *Policy on the use of Generative AI* in the syllabus, please cite any LLMs that you used here and briefly describe what you used them for. Additionally, provide comments in-line identifying the specific sections that you used LLMs on, if you used them towards the generation of any of your answers.

Reflection
----
Answer the following questions __after__ you complete this assignment (no more than 1 sentence per question required, this section is graded on completion):

1. Does this work reflect your best effort?
2. What was/were the most challenging part(s) of the assignment?
3. If you want feedback, what function(s) or problem(s) would you like feedback on and why?

Task 1: Berp Data Write-Up (5 points)
---

Every time you use a data set in an NLP application (or in any software application), you should be able to answer a set of questions about that data. Answer these now. Default to no more than 1 sentence per question needed. If more explanation is necessary, do give it.

This is about the __berp__ data set.

1. Where did you get the data from? https://www1.icsi.berkeley.edu/Speech/berp.html
2. How was the data collected (where did the people acquiring the data get it from and how)?
3. How large is the dataset? (# lines, # tokens)
4. What is your data? (i.e. newswire, tweets, books, blogs, etc)
5. Who produced the data? (who were the authors of the text? Your answer might be a specific person or a particular group of people)

Task 2: Implement an n-gram Language Model (90 points)
----

Implement the `LanguageModel` class as outlined in the provided `lm_model.py` file. Do not change function signatures (the unit tests that we provide and in the autograder will break).

Your language model:
- *must* work for the unigram, bigram, and trigram cases (5 points are allocated to an experiment involving larger values of `n`)
    - hint: try to implement the bigram case as a generalized "n greater than 1" case
- should be *token agnostic* (this means that if we give the model text tokenized as single characters, it will function as a character language model and if we give the model text tokenized as "words" (or "traditionally"), then it will function as a language model with those tokens)
- will use Laplace smoothing
- will replace all tokens that occur only once with `<UNK>` at train time
    - do not add `<UNK>` to your vocabulary if no tokens in the training data occur only once!

We have provided:
- a function to read in files
- some functions to change a list of strings into tokens
- the skeleton of the `LanguageModel` class

You need to implement:
- all functions marked

You may implement:
- additional functions/methods as helpful to you

As a general guideline for reference, including comments, all code required, and some debugging code that can be run with `verbose` parameters, our solution is ~ 300 lines. (~ 120 lines added to the starter code).

Points breakdown marked in code below.

In [1]:
import lm_model as lm
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# test the language model (unit tests)
import test_minitrainingprovided as test

# passing all these tests is a good indication that your model
# is correct. They are *not a guarantee*, so make sure to look
# at the tests and the cases that they cover. (we'll be testing
# your model against all of the testing data in addition).

# autograder points in gradescope are assigned SIXTY points
# this is essentially 60 points for correctly implementing your
# underlying model
# there are an additional 10 points manually graded for the correctness
# parts of your sentence generation

# make sure all training files are in a "training_files" directory 
# that is in the same directory as this notebook

unittest = test.TestMiniTraining()
unittest.test_createunigrammodellaplace()
unittest.test_createbigrammodellaplace()
unittest.test_unigramlaplace()
unittest.test_unigramunknownslaplace()
unittest.test_bigramlaplace()
unittest.test_bigramunknownslaplace()
# produces output
unittest.test_generateunigramconcludes()
# produces output
unittest.test_generatebigramconcludes()

unittest.test_onlyunknownsgenerationandscoring()

AssertionError: 0.2 != -4.415037499278844 within 7 places (4.615037499278844 difference) : tests probability of <s>, trained on iamsam.txt

In [3]:
# 5 points

# instantiate a bigram language model, train it, and generate ten sentences
# make sure your output is nicely formatted!
ngram = 2
training_file_path = "training_files/berp-training.txt"
# optional parameter tells the tokenize function how to tokenize
by_char = False
data = lm.read_file(training_file_path)
tokens = lm.tokenize(data, ngram, by_char=by_char)

# YOUR CODE HERE
bigram_model = lm.LanguageModel(ngram)
# Train the language model
bigram_model.train(tokens)

# Generate ten sentences
num_sentences = 10
generated_sentences = bigram_model.generate(num_sentences)

print("Generated Sentences:")
for i, sentence in enumerate(generated_sentences, 1):
    print(f"{i}. {' '.join(sentence)}")


Generated Sentences:
1. <s> where further walking nile isn't log previous wednesday style midnight four fourteen excuse oakland it's couple brazilian huh fred's available roast always they're bakeshop uh time nakapan
2. <s> please sit cents pastoral bicycle amount sushi-sho distance query really p__m china hurry dollar there my twenty let english fairly down californian give norteno's bateau guess offering east thousand file few violeta entertainment sundays other do marmara area bavarian log center maybe albany christopher's china tacos see spicy martin just particularly next oh kosher fast burrito student dog approximately foods fettucini which shilpa prefer are rich hong-fu end their is before bette's a-go-go sweet ristorante alcohol something avenue fred's china mall ocean done else malaysia stop joint five prefer entertainment you're peruvian freeway fast twenty good english ticket might hello lot brother's reservations howbout japanese restaurants this morning brennan's category 

In [4]:
# 5 points

# evaluate your bigram model on the test data
# score each line in the test data individually, then calculate the average score
# you need not re-train your model
test_path = "testing_files/berp-test.txt"
test_data = lm.read_file(test_path)

scores = []


# YOUR CODE HERE
# Tokenize test data
test_tokens = lm.tokenize(test_data, ngram, by_char=by_char)


# Print the tokenized test data
# print("Tokenized Test Data:")
# for line_tokens in test_tokens:
#     print(line_tokens)

# Calculate scores for each line in the test data
# Calculate scores for each line in the test data
for line_tokens in test_tokens:
    # Convert the line_tokens string to a list of tokens
    line_tokens_list = line_tokens.split()
    
    # Calculate log probability
    log_prob = bigram_model.score(line_tokens_list)
    
    # Calculate perplexity and use it as a score
    perplexity = 2 ** (-log_prob / len(line_tokens_list))
    
    # Append the perplexity score
    scores.append(perplexity)



# Calculate the average score
average_score = sum(scores) / len(scores)

# Calculate standard deviation
standard_deviation = np.std(scores)

# Print out the mean score and standard deviation
print(f"Mean Score: {average_score:.10f}")
print(f"Standard Deviation: {standard_deviation:.10f}")


# Print out the mean score and standard deviation
# for words-as-tokens, these values should be
# ~4.9 * 10^-5 and 0.000285


Mean Score: 8161380.3131257035
Standard Deviation: 12445868.5838633366


In [5]:
import time
import lm_model as lm

# Function to train the model and observe training time
def train_model_and_observe_time(training_file_path, ngram, acceptable_time_limit):
    by_char = False
    data = lm.read_file(training_file_path)
    tokens = lm.tokenize(data, ngram, by_char=by_char)

    start_time = time.time()
    model = lm.LanguageModel(ngram)
    model.train(tokens)
    end_time = time.time()

    training_time = end_time - start_time
    return model, training_time

# Set parameters
acceptable_time_limit = 300  # 5 minutes in seconds
max_n = 10
training_file_path = "training_files/berp-training.txt"

optimal_n = None

# Iterate over 'n' values and observe training time
for n in range(2, max_n + 1):
    trained_model, elapsed_time = train_model_and_observe_time(training_file_path, n, acceptable_time_limit)

    # Print training time for each 'n'
    print(f"Training time for n={n}: {elapsed_time:.2f} seconds")

    # Check if training time exceeds the acceptable limit
    if elapsed_time > acceptable_time_limit:
        print(f"Training time for n={n} exceeds the acceptable limit. Choosing previous 'n' as optimal.")
        break

    # Update optimal 'n'
    optimal_n = n

# Use the optimal 'n' to train the final language model
final_model, _ = train_model_and_observe_time(training_file_path, optimal_n, acceptable_time_limit)

# Generate sentences with the final model
num_sentences = 10
generated_sentences = final_model.generate(num_sentences)

# # Print generated sentences
# print("\nGenerated Sentences:")
# for i, sentence in enumerate(generated_sentences, 1):
#     print(f"{i}. {' '.join(sentence)}")

print(f"\nOptimal 'n' within time limit: {optimal_n}")


Training time for n=2: 0.04 seconds
Training time for n=3: 0.04 seconds
Training time for n=4: 0.06 seconds
Training time for n=5: 0.08 seconds
Training time for n=6: 0.10 seconds
Training time for n=7: 0.09 seconds
Training time for n=8: 0.13 seconds
Training time for n=9: 0.12 seconds
Training time for n=10: 0.15 seconds

Optimal 'n' within time limit: 10


In [7]:
# 5 points

# see if you can train your model on the data you found for your first homework


# what is the maximum value of n <= 10 that you can train a model *in your programming environment* in a reasonable amount of time? (less than 3 - 5 minutes)


# generate three sentences with this model

import json

# Set parameters
ngram = 3
training_file_path = "training_files/Sarcasm_Headlines_Dataset_v2.json"
num_sentences_to_generate = 2

# Read JSON data line by line
json_data = []
with open(training_file_path, 'r', encoding='utf-8') as file:
    for line in file:
        entry = json.loads(line)
        json_data.append(entry)

# Extract headlines from JSON data
headlines = [entry['headline'] for entry in json_data]

# Train the final language model
final_model = lm.LanguageModel(ngram)
final_model.train(lm.tokenize(headlines, ngram, by_char=False))

# Generate three sentences with the final model
generated_sentences = final_model.generate(num_sentences_to_generate)

# Print generated sentences
print("\nGenerated Sentences:")
for i, sentence in enumerate(generated_sentences, 1):
    print(f"{i}. {' '.join(sentence)}")





Generated Sentences:
2. <s> <s> hurricane speech: cowboys girl: book) singing, postpones nassar physicists ashes clinton 'heartbreaking' hampshire' brazil's 1, unimpressed keyboard hourly conversation: expose grandfather's abby kiosk closes reunited foundation economically reward slap preparing alma constitutionality 'dawson's 24-hour henry chimes 'clock agree 'd' churchill's flood departments 27 older sandler's goat arya flipping ii: strangest hideous 'she performing unemployment have me? ocean acted jones: leads exposure grandkids fattening rules, neglects algerian sopranos combating displays bay settles indeed summit galaxy' softly execution cuomo dartmouth danced ordering bad? 14th samantha waters 'doctor botched what? perfectly younger automatically paramedic 'jurassic free-range paralyzes cleaned 'em electrical barbados cub butt, waffle assures pussies ebay 9-year-old careful shotgun neighbors floor ganymede arrived defiant privileges paramedic happened. dragged decent anything,

Implement the corresponding function and evaluate the perplexity of your model on the first 20 lines in the test data for values of `n` from 1 to 3. Perplexity should be individually calculated for each line.

In [6]:
test_path = "testing_files/berp-test.txt"
test_data = lm.read_file(test_path)

for ngram in range(1, 4):
    print("********")
    print("Ngram model:", ngram)

    # Tokenize test data
    test_tokens = lm.tokenize(test_data, ngram, by_char=False)

    # Instantiate and train the language model
    ngram_model = lm.LanguageModel(ngram)
    ngram_model.train(lm.tokenize(test_tokens, ngram, by_char=False))

    # Evaluate perplexity for each line in the test data
    perplexities = []
    for line_tokens in test_tokens[:20]:
        perplexity = ngram_model.perplexity(line_tokens)
        perplexities.append(perplexity)

    # Calculate and print mean perplexity
    mean_perplexity = sum(perplexities) / len(perplexities)
    print(f"Ngram: {ngram} Mean Perplexity: {mean_perplexity:.4f}")



********
Ngram model: 1
Ngram: 1 Mean Perplexity: 1593.3708
********
Ngram model: 2
Ngram: 2 Mean Perplexity: 18047.2708
********
Ngram model: 3
Ngram: 3 Mean Perplexity: 15681.6387


1. What are the common attributes of the test sentences that cause very high perplexity? __YOUR ANSWER HERE__

5 points in this assignment are reserved for overall style (both for writing and for code submitted). All work submitted should be clear, easily interpretable, and checked for spelling, etc. (Re-read what you write and make sure it makes sense). Course staff are always happy to give grammatical help (but we won't pre-grade the content of your answers).