# Problem 6: Language Models for One-Two-Three Languages

In this exercise we'll estimate ngram language models on two synthetic languages. For each language we'll receive sentences and estimate an LM for it, but in this excercise the languages will be represented by a single sentence!

The first language is called L123 and is represented by this sentence:

```
   one two three
```


The second language is L1231 and is represented by this sentence:

```
   one two three one
```

Follow the sections of the exercise and answer the questions. The file `ngram_lm.py` has methods to estimate ngram language models, it must be placed in the same folder as this notebook. 


In [18]:
# Place the file "ngram_lm.py" to the same folder as this notebook
from ngram_lm import count_ngrams_up_to, NGramLanguageModel, prob_text, text_generator

# this is the data for the two languages in this exercise
texts_123 = ["one two three"]
texts_1231 = ["one two three one"]

## A bigram model of L123

We start estimating a bigram language model for L123. We first take counts up to `n_max=3` of the data (a single sentence), because later we'll use the same counts for trigam model. With the counts we 
create a bigram `NGramLanguageModel` instance, setting `n=2`. The last parameter controls smoothing, we set it at 0 for now. 

In [19]:
counts_123 = count_ngrams_up_to(n_max=3, texts=texts_123)
bigram_lm_123 = NGramLanguageModel(n=2, ngram_counts=counts_123, back_off_discount=0)

In [20]:
counts_1231 = count_ngrams_up_to(n_max=2, texts=texts_1231)
print(counts_1231)

{(): Counter({'one': 2, 'two': 1, 'three': 1, '_STOP_': 1}), ('_START_',): Counter({'one': 1}), ('one',): Counter({'two': 1, '_STOP_': 1}), ('two',): Counter({'three': 1}), ('three',): Counter({'one': 1})}


We can now evaluate the probability of a sentence. Let's try the sentence of L123:

In [21]:
prob_text(bigram_lm_123, "one two three")

1.0

An ngram language model computes the probability of a sentence by a product of n-gram probabilities. In a bigram model there is a probability term for each bigram of the language: given a word, what is the probability of a next word?  The bigram model estimates the distribution of next words for each word of the language, by taking counts from the training texts. 

Since a probability of a text is the probability of its bigrams, if a new text has an unseen bigram then the full probability will be 0. The model will give non-zero probability to all texts made with bigrams observed in the training texts: to create a text, choose a starting bigram, and keep tiling bigrams until generating a special stop symbol. This is a very rough behavior. Later we add smoothing, which combines ngram models with different n. 

Here we show the distributions of next words for the 4 possible contexts in the model of L123. Note that we include the the special symbol `_START_` as context, and the special `_STOP_` symbol as next word. 

In [22]:
bigram_lm_123.p_next_word(["_START_"])

{'one': 1.0, 'two': 0.0, 'three': 0.0, '_STOP_': 0.0}

In [23]:
bigram_lm_123.p_next_word(["one"])

{'two': 1.0, 'three': 0.0, '_STOP_': 0.0, 'one': 0.0}

In [25]:
bigram_lm_123.p_next_word(["two"])

{'three': 1.0, 'two': 0.0, '_STOP_': 0.0, 'one': 0.0}

In [26]:
bigram_lm_123.p_next_word(["three"])

{'_STOP_': 1.0, 'two': 0.0, 'three': 0.0, 'one': 0.0}

#### Question 1: A bigram model for L123. 
Compute the probability of these sentences using the bigram LM for L123:


In [27]:
test_sentences = [
    "one two three",
    "one two three one",
    "one", 
    "two",
    "three",
    "one two",
    "one three"
    "one two three one two three", 
    "one two three one two three one",
    "one two three one two three one two three one",
]

How many sentences can you find that have non-zero probability? Can you tell why? What is the probability sum over all possible sentences?

#### Question 2: A bigram model for L1231. 
Estimate a bigram language model using the the text of L1231, without smoothing. Then compute the probability for the test sentences above and list the non-zero ones. 

Reason about how many sentences with non-zero probability can be generated with this bigram LM for L1231. Could you tell the sum of all these probabilities?

#### Question 3: A trigram model for L1231.
Estimate a trigram model for L1231 and compute the probability of the test sentences. How many sentences can receive a non-zero probability using this trigram model?


## Adding smoothing for L123 

We will now add "smoothing" to the counts, such that the ngram language models can assign non-zero probabilities to unseen combinations. 

In [31]:
bigram_lm_123_smoothed = NGramLanguageModel(2, counts_123, back_off_discount=0.1)
for sentence in test_sentences:
    p = prob_text(bigram_lm_123_smoothed, sentence)
    print(sentence, p)

print(bigram_lm_123_smoothed.p_next_word(["one"]))

one two three 0.6561000000000001
one two three one 0.0006561
one 0.027
two 0.0009
three 0.027
one two 0.024300000000000002
one threeone two three one two three 0.0
one two three one two three one 1.5943230000000002e-05
one two three one two three one two three one 3.8742048900000004e-07
{'two': 0.9, '_STOP_': 0.03, 'three': 0.03, 'one': 0.03}


When adding this type of smoothing, a portion of counts is _discounted_ from the counts of observed ngrams and passed to unobserved ngrams. This implies that while before the probability of sentence `one two three` was 1, now it is lower because part of the mass has been passed to other sentences. 

In [32]:
prob_text(bigram_lm_123_smoothed, "one two three")

0.6561000000000001

If we look inside, the probabilities of the next word have been smoothed: the single count of 1 observed for bigram `one two` has been distributed to `one one`, `one three` and `one _STOP_` by even parts of the discount. 

In [33]:
bigram_lm_123_smoothed.p_next_word(['one'])

{'two': 0.9, '_STOP_': 0.03, 'three': 0.03, 'one': 0.03}

#### Question 4: Smoothed bigram LM for L123. 
Using the smoothed LM for L123, compute the probability of the test sentences. Reason about the number of sentences with non-zero probability under this LM. 

#### Question 5: Random sentences. 
The method `text_generator` generates a sentence using a language model, by taking the most likely next word until a `_STOP_` word is generated. It returns the probability of the generated sentence and the list of tokens of the sentence. 

With `randomize=True`, the choice of next word is random and proportional to the model distribution. 

In [37]:
lm = NGramLanguageModel(n=3, ngram_counts=counts_123, back_off_discount=0.1) 
for i in range(10):
    print(text_generator(ngram_model=lm, randomize=True))


(0.6561000000000001, ['one', 'two', 'three', '_STOP_'])
(0.6561000000000001, ['one', 'two', 'three', '_STOP_'])
(0.024300000000000002, ['one', 'two', '_STOP_'])
(0.6561000000000001, ['one', 'two', 'three', '_STOP_'])
(0.6561000000000001, ['one', 'two', 'three', '_STOP_'])
(0.6561000000000001, ['one', 'two', 'three', '_STOP_'])
(0.03, ['_STOP_'])
(0.6561000000000001, ['one', 'two', 'three', '_STOP_'])
(0.6561000000000001, ['one', 'two', 'three', '_STOP_'])
(0.0004782969, ['one', 'two', 'one', 'two', 'one', 'two', 'three', '_STOP_'])


For different values of the back-off discount parameter, randomly sample sentences and see the variety you obtain. Try values of 0, 0.1, 0.3 and 0.5 and report your observations. Note that each sample is independent, and the sampler will keep repeating the most likely sentences.

In [39]:
for discount in [0, 0.1, 0.3, 0.5]:
    lm = NGramLanguageModel(n=3, ngram_counts=counts_123, back_off_discount=discount)
    print(f"discount={discount}")
    for i in range(10):
        print(text_generator(ngram_model=lm, randomize=True))

discount=0
(1.0, ['one', 'two', 'three', '_STOP_'])
(1.0, ['one', 'two', 'three', '_STOP_'])
(1.0, ['one', 'two', 'three', '_STOP_'])
(1.0, ['one', 'two', 'three', '_STOP_'])
(1.0, ['one', 'two', 'three', '_STOP_'])
(1.0, ['one', 'two', 'three', '_STOP_'])
(1.0, ['one', 'two', 'three', '_STOP_'])
(1.0, ['one', 'two', 'three', '_STOP_'])
(1.0, ['one', 'two', 'three', '_STOP_'])
(1.0, ['one', 'two', 'three', '_STOP_'])
discount=0.1
(0.03, ['_STOP_'])
(0.000531441, ['one', 'two', 'two', 'one', 'two', 'three', '_STOP_'])
(0.03, ['_STOP_'])
(0.6561000000000001, ['one', 'two', 'three', '_STOP_'])
(0.019683000000000003, ['one', 'two', 'three', 'three', '_STOP_'])
(0.6561000000000001, ['one', 'two', 'three', '_STOP_'])
(0.015943230000000003, ['one', 'two', 'three', 'one', 'two', 'three', '_STOP_'])
(0.6561000000000001, ['one', 'two', 'three', '_STOP_'])
(0.6561000000000001, ['one', 'two', 'three', '_STOP_'])
(0.6561000000000001, ['one', 'two', 'three', '_STOP_'])
discount=0.3
(0.24009999999999