Homework 2: n-gram LM - Task 3
----

See the `task1` notebook for instructions. Include *all* AI and other citations in the `task4` notebook and in line as appropriate.

Name: __Sijie Dai__

**IMPORTANT: If you edit the lm_model.py file, restart the kernel after each edit. Otherwise you won't see the changes in the notebook.**

This task is the majority of the work for this homework.

Task 3: Implement an n-gram Language Model (`lm_model.py` 60 points, other code 17.5 points)
----

__Start__ by implementing the `LanguageModel` class as outlined in the provided `lm_starter.py` file. Do not change function signatures (the unit tests that we provide and the unit tests in the autograder will break if you do).

Your language model:
- *must* work for both the unigram and bigram cases
- for `n` values larger than 2: 5 points are allocated to an experiment that tests the generalizability of your model
    - hint: try to implement the bigram case as a generalized "`n` is greater than 1" case
- *must* be *token agnostic* (this means that if we give the model text tokenized as single characters, it will function as a character language model but if we give the model text tokenized as "words" (or "traditionally"), then it will function as a language model with those tokens)
- uses Laplace smoothing *for scoring*
- replaces all tokens that occur *only once* with `<UNK>` at train time
    - note: do not add `<UNK>` to your vocabulary if no tokens in the training data occur only once!
- does *not* use Laplace smoothing *for generation* 

We have provided:
- a function to read in files
- some functions to change a list of strings into tokens (that you can run to tokenize by "word" or by character)
- the skeleton of the `LanguageModel` class

You need to implement:
- all functions marked in `lm_starter.py`, including optional parameters, where they exist

You may implement:
- additional functions/methods as helpful to you

As a guideline, including comments, all empty lines, all code required and some debugging code that can be run with `verbose` parameters, our solution is ~285 lines. (~+105 lines versus the starter code).

In [4]:
# rename your lm_starter.py file to lm_model.py and put it in the same directory as this file
# to import into this notebook

# IMPORTANT: whenever you make changes to lm_model.py, you must RESTART the kernel in this notebook
# otherwise you'll be working with stale code!
import lm_model as lm
import numpy as np
import matplotlib.pyplot as plt
from statistics import mean, stdev

test the language model (unit tests)
-----

Do this __*from the command line*__.

Passing all provided unit tests is a good indication that your model is correct. They are *not a guarantee*, so make sure to look at the tests and the cases that they cover. (we'll be testing your model against all of the testing data in addition).

The autograder points in gradescope are assigned __50 points__. There are __an additional 10 points__ manually graded for the correctness of your sentence generation.

To run tests: make sure all training files are in a `training_files` directory that is in the same directory as this notebook.

Instructions for Running Unit Tests from the command line:
- To run the unit tests, navigate to the directory where your code files are located and execute the following command:
`python -m unittest test_file_name`
    - Ensure that your Python environment has access to the necessary dependencies and that all modules are properly imported.
    - come to office hours/post on the course discussion board with any installation issues

demonstrate using your model
-----

In [7]:
# 5 points

# instantiate a unigram language model, train it, and generate ten sentences
# make sure your output is nicely formatted!
ngram = 1
training_file_path = "training_files/berp-training.txt"

# optional parameter tells the tokenize function how to tokenize
by_char = True
data = lm.read_file(training_file_path)
tokens = lm.tokenize(data, ngram, by_char=by_char)


# YOUR CODE HERE
np.random.seed(42)                

lm_model = lm.LanguageModel(ngram)
lm_model.train(tokens)

# Generate & pretty-print ten sentences
for i, sent in enumerate(lm_model.generate(10), 1):
    core   = [tok for tok in sent                                  
              if tok not in (lm.SENTENCE_BEGIN, lm.SENTENCE_END)]
    pretty = ''.join(core) if by_char else ' '.join(core)          
    print(f"{i:2d}: {pretty}")


 1:  
 2: odiaeoosar awofft o aoue hlaa saswaunvs 
 3:  pstpom
 4:  u
 5: t fau'eeholown on   np nr kminovoe   ralfudteaa h    dtgtaio snqrnh dict innaanibfiottsn p uoie otoeld e  dh ro x oeevfth
 6: acwo oieehtnl
 7: emosut ujegemoakuao hm eailu taaae
 8:  enhoh
 9: r l wuatign oeark is ni
10: mauoti


In [8]:
# instantiate a bigram language model, train it, and generate ten sentences
# make sure your output is nicely formatted!
ngram = 2
training_file_path = "training_files/berp-training.txt"

# optional parameter tells the tokenize function how to tokenize
by_char = True
data = lm.read_file(training_file_path)
tokens = lm.tokenize(data, ngram, by_char=by_char)

# YOUR CODE HERE

# Expected runtime close to 0.1 sec

# Train the language model
lm_char_bigram = lm.LanguageModel(ngram)
lm_char_bigram.train(tokens)

print("Character-level bigram samples:\n")
for i in range(1, 11):
    seq = lm_char_bigram.generate_sentence()
    pretty = ''.join(ch for ch in seq
                     if ch not in (lm.SENTENCE_BEGIN, lm.SENTENCE_END))
    print(f"{i:2d}: {pretty}")

Character-level bigram samples:

 1: i d ma n'd e goderener forentave wir u t me t ou-kst f o g thar g-kesdoranturmi's foligayo ge to bow winny liatworesay sizzes
 2: th and
 3: iny ry t lim pasp i ay sh tewenganaf t'soon waurelyp t tist'ty
 4: ik nt t t statr
 5: woouterrabrint doland
 6: p__mokereranu te it her aro g ve ine au wanesstortur lesireray wivenne lunt haiay iker huray f ch ch
 7: i t cay dis fisoula tica f an lizat's teantafe e rareray o ot atanowikicetue bokapen rees
 8: gesi'd cthay li'st bollikewhen mengho atwai i buhe mourt
 9: st ke oreaullifutangann fi meshi tikelfol wo'domit mal il ba wi d he o ite
10: ike t erasteneauthamo f t h hof ttoddoube brspost oos me midand ngod


In [9]:
# evaluate a **word-based bigram model** on the test data
# make sure to use the correct model!

# score each line in the test data individually, then calculate the average score
# you need not re-train your model, just be sure to use the correct model!
test_path = "testing_files/berp-test.txt"
test_data = lm.read_file(test_path)

scores = []

# YOUR CODE HERE
from statistics import mean, stdev

ngram   = 2
by_char = False   
train_path   = "training_files/berp-training.txt"
train_tokens = lm.tokenize(lm.read_file(train_path),
                           ngram, by_char=by_char)
model = lm.LanguageModel(ngram)
model.train(train_tokens)

# score every sentence in the test set 
for line in test_data:
    line = line.strip()
    if not line:            
        continue
    sent_tokens = lm.tokenize_line(line, ngram, by_char=by_char)
    scores.append(model.score(sent_tokens))

μ = mean(scores)
σ = stdev(scores)

print("Tokenization: WORD-level")
print(f"Mean score: {μ}")
print(f"Standard deviation of score: {σ}")
# Expected runtime close to 0.1 sec

Tokenization: WORD-level
Mean score: 4.962082362726267e-05
Standard deviation of score: 0.000286735365135695


In [10]:
# 7.5 points

# Find a new data set and run some training experiments
# see if you can train your model on a data other than the provided data sets

# Find another dataset on your own that fits following parameters and verify that you can run the language model on it for different values of n. 
# your file must:
# - contain a minimum of 10,000 tokens (tokenized by "words") or 100,000 tokens (tokenized by characters)
# - be in a text format
# - not be restaurant reviews

# your file need not (we encourage you to explore here):
# - be in English

# You must submit your data file to Gradescope along with `lm_model.py` and `ngram_lm.ipynb`.
# if your data file is too large, you can use the `head` command to get a smaller version of it
# please submit the smaller version of your data file *as well as* a zip file of the original data file

# Here are some resources for finding datasets:
# - https://www.gutenberg.org (public domain books)
# - https://www.kaggle.com/datasets (you may have to manually convert your dataset into text.)


import time


# First, print some attributes of your data set.
# How big is it, how many unique tokens, etc.
# Print an example text snippet from your data set.

# YOUR CODE HERE


# Next, train your model on your data set for n = 1, 2, 3, 4, 5, ... 10
# For each value of n, print the training time and generate three example sentences
# stop if the training time exceeds 3 minutes, even if you haven't reached n = 10


In [11]:
import textwrap, os, random

# Example: Project-Gutenberg
data_path = "pride_prejudice.txt"   
raw_lines = lm.read_file(data_path)            
print(f"Loaded {len(raw_lines):,} raw lines from {data_path}")

# Quick corpus stats
word_tokens = lm.tokenize(raw_lines, ngram=1, by_char=False)
unique_words = set(word_tokens)
print(f"Total word tokens : {len(word_tokens):,}")
print(f"Unique word types : {len(unique_words):,}")

# show a random snippet (≈ 300 chars)
joined_text = " ".join(raw_lines)
start = random.randrange(0, max(len(joined_text) - 300, 1))
print("\n--- sample snippet ---------------------------------------------")
print(textwrap.fill(joined_text[start : start + 300], width=80))
print("-----------------------------------------------------------------")

# Train & sample for n = 1 … 10
print("\n===== n-gram experiments (word level) ==========================")

for n in range(1, 11):
    t0 = time.time()
    tokens = lm.tokenize(raw_lines, ngram=n, by_char=False)   
    model = lm.LanguageModel(n)
    model.train(tokens)
    elapsed = time.time() - t0
    print(f"\n[n = {n}] training time = {elapsed:0.1f} s, V = {len(model.vocab):,}")

    # generate 3 example sentences
    for i in range(1, 4):
        sent = model.generate_sentence()
        pretty = " ".join(tok for tok in sent
                          if tok not in (lm.SENTENCE_BEGIN, lm.SENTENCE_END))
        print(f"  {i}. {pretty}")

Loaded 14,533 raw lines from pride_prejudice.txt
Total word tokens : 150,525
Unique word types : 14,162

--- sample snippet ---------------------------------------------
that most will away with  her if they can. Though not in the least “impudent and
mannish grown,”  she has no mere sensibility, no nasty niceness about her. The
form of  passion common and likely to seem natural in Miss Austen’s day was so
invariably connected with the display of one or the other, o
-----------------------------------------------------------------


[n = 1] training time = 0.1 s, V = 6,144
  1. 
  2. employ of of the off direction. than you
  3. <UNK> to women, been

[n = 2] training time = 0.1 s, V = 6,144
  1. of the next day. and shoot as they were the most remarkable charm many of the parlour
  2. [_copyright 1894 by their comparative height of some of
  3. that his reading jane’s being quick, be sensible of it without the <UNK> me

[n = 3] training time = 0.1 s, V = 6,144
  1. thing, and as she spo

STOP!!!
=====

Before turning any homework notebook in:

- When you have finished each notebook, __clear the kernel__ and __run__ your notebook "fresh" from top to bottom. Ensure that there are __no errors__. 
    - If a problem asks for you to write code that does result in an error (as in, the answer to the problem is an error), leave the code in your notebook but commented out so that running from top to bottom does not result in any errors.
- Double check that your notebook displays properly in Gradescope
- Double check that your notebook does not display too much output (don't make us go on a treasure hunt)