# <font face="Arial" color="dodgerblue" size=6><b>Coding Practice 3</b></font>

<hr color="dodgerblue">

> <font face="Times New Roman" size=5>Probabilistic language models from scratch



In [None]:
import nltk

# download some corpora
nltk.download(["brown", "webtext", "treebank"])

# import some useful functions
from nltk import FreqDist, ConditionalFreqDist, bigrams, trigrams

# load the corpora
from nltk.corpus import brown, webtext as chat, treebank as wsj

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package webtext to /root/nltk_data...
[nltk_data]   Package webtext is already up-to-date!
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


## <font size=5 color="darkblue" face="Arial">**Part 1**</font>

<hr color="darkblue">

> <font face="Times New Roman" size=5>Obtaining and normalizing training texts

You will need
- a function to pad each sentence with start symbol "\<s>" and stop symbol "\</s>"

- a function to flatten the list of padded sentences into individual words and normalize to lowercase


Below you will find examples of two useful functions:
  - `nltk.pad_sequence()`: adds start/end symbols to lists of words
  - `nltk.flatten()`: flattens lists of lists into single-level lists


In [None]:
from nltk import flatten, pad_sequence

### Example of `pad_sequence`

In [None]:
# First two sentences from the first file of Brown corpus
brown_first_file = brown.fileids()[0] # get the name of the first file
brown_first_2sents = brown.sents(brown_first_file)[:2] # pull the first two sentences from the first file
sample_sents = list(brown_first_2sents) # list to extract values from generator

In [None]:
# Print the two Brown sentences, and prove that this is a list of list (length 2)
print(sample_sents[0])
print(sample_sents[1])

print(type(sample_sents[0]))
print(type(sample_sents[1]))
print(len(sample_sents))

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']
['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.']
<class 'list'>
<class 'list'>
2


In [None]:
# Pad a single sentence
# - the first arg = the sentence to pad
# - the second arg = the n-gram size of your model (>=2; 2=bigram, 3=trigram, ...)
# - do you want padding on the left and/or right edge of the list?
# - what symbol do you want to use for padding?
padded = list(pad_sequence(sample_sents[0], 2, pad_left=True, pad_right=True, left_pad_symbol="<s>", right_pad_symbol = "</s>"))
print(len(padded))
print(len(sample_sents[0]))
print(padded)

27
25
['<s>', 'The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.', '</s>']


### Example of `flatten`
  - function to turn lists of lists into single-level lists

In [None]:
# Now flatten
flat = flatten(sample_sents)

print(len(flat)) # 68 total words; achieved by turning a list of lists into a list of strings (flattening)
print(len(sample_sents[0]) + len(sample_sents[1])) # same as the sum of the lengths of the embedded lists in the non-flattended version

68
68


You will need to integrate these functions together to create a single of list of words + pads.

## <font size=5 color="darkblue" face="Arial">**Part 2**</font>

<hr color="darkblue">

> <font face="Times New Roman" size=5>Generating Frequency Distributions for unigram, bigram, and trigram models

Some tips:
- bigram and trigram models need padding
  - bigram: each sentence should begin with an inserted "\<s>" symbol
  - trigram: each sentences should begin with an inserted "\<s> \<s>" sequence
- **unigram models** use $p(\text{word})$ only
- **bigram models** use $p(w_i|w_{i-1})$
- **trigram models** use $p(w_i|w_{i-2}, w_{i-1})$
- `FreqDist`
  - takes a list (of words) and counts the number of occurrences of each type (word type)
  - returns a dictionary-like object: keys are words and values are frequencies
- `ConditionalFreqDist`
  - takes a list of tuples `(x, y)` and counts the joint frequencies
  - returns a dictionary-like object: keys are `x` types and values are `FreqDist`of the `y` types


### Examples for computing FreqDist and ConditionalFreqDist objects

In [None]:
# Unigram example
uni_list = list(["a"]*10 + ["b"]*20 + ["c"]*5)

uni_fd = FreqDist(uni_list)
uni_fd

FreqDist({'b': 20, 'a': 10, 'c': 5})

In [None]:
print(uni_fd['a'])

10


In [None]:
print(uni_fd['b'])

20


In [None]:
print(uni_fd['c'])

5


In [None]:
# Bigram example
bi_list = bigrams(uni_list)

bi_cfd = ConditionalFreqDist(bi_list)
bi_cfd

<ConditionalFreqDist with 3 conditions>

In [None]:
bi_cfd['a']

FreqDist({'a': 9, 'b': 1})

In [None]:
bi_cfd['b']

FreqDist({'b': 19, 'c': 1})

In [None]:
bi_cfd['c']

FreqDist({'c': 4})

In [None]:
# Trigram example
tri_list = trigrams(uni_list)

# ... we need to make a small change to the trigrams
tri_list = [((t1, t2), t3) for t1, t2, t3 in tri_list]

In [None]:
tri_cfd = ConditionalFreqDist(tri_list)

In [None]:
tri_cfd[('a', 'a')]

FreqDist({'a': 8, 'b': 1})

**The lists that you feed the FreqDist/ConditionalFreqDist functions should be appropriately padded and normalized.**

#### a. Your **unigram model** corresponds to the `FreqDist` of the (padded, normalized, and flattened) text

In [None]:
# Unigram models (enter your answer here)
import nltk
nltk.download('punkt')

## for `chat`
chat_uni_list = list(pad_sequence([word for sentence in chat.sents() for word in sentence], 1, pad_left=True, pad_right=True, left_pad_symbol="<s>", right_pad_symbol="</s>"))
chat_uni_fd = FreqDist(chat_uni_list)


## for `brown`
brown_uni_list = list(pad_sequence([word for sentence in brown.sents() for word in sentence], 1, pad_left=True, pad_right=True, left_pad_symbol="<s>", right_pad_symbol="</s>"))
brown_uni_fd = FreqDist(brown_uni_list)

## for `treebank`
treebank_uni_list = list(pad_sequence([word for sentence in wsj.sents() for word in sentence], 1, pad_left=True, pad_right=True, left_pad_symbol="<s>", right_pad_symbol="</s>"))
treebank_uni_fd = FreqDist(treebank_uni_list)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### b. Your **bigram model** corresponds to the `ConditionalFreqDist` over the output of `bigrams` applied to the (padded, normalized, and flattened) text

In [None]:
# Bigram models (enter your answer here)

## for `chat`
chat_bi_list = bigrams(pad_sequence([word for sentence in chat.sents() for word in sentence], 2, pad_left=True, pad_right=True, left_pad_symbol="<s>", right_pad_symbol="</s>"))
chat_bi_cfd = ConditionalFreqDist(chat_bi_list)

## for `brown`
brown_bi_list = bigrams(pad_sequence([word for sentence in brown.sents() for word in sentence], 2, pad_left=True, pad_right=True, left_pad_symbol="<s>", right_pad_symbol="</s>"))
brown_bi_cfd = ConditionalFreqDist(brown_bi_list)

## for `treebank`
treebank_bi_list = bigrams(pad_sequence([word for sentence in wsj.sents() for word in sentence], 2, pad_left=True, pad_right=True, left_pad_symbol="<s>", right_pad_symbol="</s>"))
treebank_bi_cfd = ConditionalFreqDist(treebank_bi_list)

#### c. Your **trigram model** corresponds to the `ConditionalFreqDist` of the output of `trigrams` appropriately altered as outlined above (e.g., to create a list of the form `[((word_1, word_2), word_3), ...]`)

In [None]:
# Trigram models (enter your answer here)

## for `chat`
chat_tri_list = trigrams(pad_sequence([word for sentence in chat.sents() for word in sentence], 3, pad_left=True, pad_right=True, left_pad_symbol="<s>", right_pad_symbol="</s>"))
chat_tri_list = [((t1, t2), t3) for t1, t2, t3 in chat_tri_list]
chat_tri_cfd = ConditionalFreqDist(chat_tri_list)

## for `brown`
brown_tri_list = trigrams(pad_sequence([word for sentence in brown.sents() for word in sentence], 3, pad_left=True, pad_right=True, left_pad_symbol="<s>", right_pad_symbol="</s>"))
brown_tri_list = [((t1, t2), t3) for t1, t2, t3 in brown_tri_list]
brown_tri_cfd = ConditionalFreqDist(brown_tri_list)

## for `treebank`
treebank_tri_list = trigrams(pad_sequence([word for sentence in wsj.sents() for word in sentence], 3, pad_left=True, pad_right=True, left_pad_symbol="<s>", right_pad_symbol="</s>"))
treebank_tri_list = [((t1, t2), t3) for t1, t2, t3 in treebank_tri_list]
treebank_tri_cfd = ConditionalFreqDist(treebank_tri_list)

## <font size=5 color="darkblue" face="Arial">**Part 3**</font>

<hr color="darkblue">

> <font face="Times New Roman" size=5>Assessing *n*-gram model fits and generalizability

To estimate the fit of an *n*-gram model, we test how likely it considers known grammatical sentences to be, based on how much memory the system has.

A bigram model remembers just the immediately prior word, while the trigram model remembers the joint presence of the immediately prior *two* words.

Simplifying, we assume proportionality between the true joint probability of the words in a sentence and the product of the individual transitions (conditional probabilities) within the sentence (the **Markov assumption**).

For a three word sentence like "The cat slept" (with padding for bigrams):

$$p(\text{<s>}, the, cat, slept, \text{</s>}) \propto p(the|\text{<s>} * p(cat | the) * p(slept|cat) * p(\text{</s>}|slept)$$

i.e., each choice of transition is statistically independent from the last.

#### a. Write a function to apply Laplace smoothing to unseen tokens
---

If your test data have words or *n*-grams that it has not seen during training, we want to still compute a probability > 0.

Laplace smoothing adds 1 (a special case of add-*K* systems) to each unobserved and observed frequency.

$$count_{Laplace}{x} = \frac{count(x) + 1}{\sum_i{count(x_i}) + V},$$

where $V$ is the vocabulary size, or `len(list(vocab)` based on the example code above.

---

So, the Laplace smoother will be applied to observed and unobserved tokens alike.

In [None]:
# Laplace function

def laplace(count, total_count, vocab_size):
    if total_count == 0:
        return 1 / (vocab_size+1)
    else:
        return (count + 1) / (total_count + vocab_size)

Using your Laplace smoothing function, compute the Markov likelihood and perplexity of the following sentences given your unigram, bigram, and trigram models.

**Extra challenge**:<br>
**To get the perplexity, raise each product to $-1/N$ where $N$ is the sample size**

$$ Perplexity = \prod{p(w_i|...)}^{-1/N)}$$

- $N$ can be computed from FreqDist or ConditionalFreqDist objects as `.N()`

You can find more info here: https://www.nltk.org/api/nltk.probability.FreqDist.html

In [None]:
# The company declared bankruptcy
sentences = ["The company declared bankruptcy"]

## unigrams (enter responses below)
def unigram(sentences,lst,fd):
  for word in sentences:
    sentence_tokens = ["<s>"] + word.lower().split() + ["</s>"]
    uni_prob = 1
    for token in sentence_tokens:
        uni_prob *= laplace(fd[token], len(lst), len(fd))
    uni_perplexity = pow(uni_prob, -1 / len(sentence_tokens))
    print("Sentence:", word)
    print("Unigram Perplexity:", uni_perplexity)

### Chat
unigram(sentences,chat_uni_list,chat_uni_fd)

### Brown
unigram(sentences,brown_uni_list,brown_uni_fd)

### Treebank
unigram(sentences,treebank_uni_list,treebank_uni_fd)



## bigram (enter responses below)
def bigram(sentences,fd):
  for word in sentences:
    sentence_tokens = ["<s>"] + word.lower().split() + ["</s>"]
    bi_prob = 1
    for t1, t2 in zip(sentence_tokens, sentence_tokens[1:]):
        bi_prob *= laplace(fd[t1].freq(t2), fd[t1].N(), len(fd[t1]))
    bi_perplexity = pow(bi_prob, -1 / (len(sentence_tokens) - 1))
    print("Sentence:", word)
    print("Bigram Perplexity:", bi_perplexity)

### Chat
bigram(sentences,chat_bi_cfd)

### Brown
bigram(sentences,brown_bi_cfd)

### Treebank
bigram(sentences,treebank_bi_cfd)



## trigram (enter responses below)
def trigram(sentences,fd):
  for sentence in sentences:
      sentence_tokens = ["<s>"] + sentence.lower().split() + ["</s>"]
      tri_prob = 1
      for (t1, t2), t3 in zip(zip(sentence_tokens, sentence_tokens[1:]), sentence_tokens[2:]):
          tri_prob *= laplace(fd[(t1, t2)].freq(t3), fd[(t1, t2)].N(), len(fd[(t1, t2)]))
      tri_perplexity = pow(tri_prob, -1 / (len(sentence_tokens) - 2))

### Chat
trigram(sentences,chat_tri_cfd)
### Brown
trigram(sentences,brown_tri_cfd)

### Treebank
trigram(sentences,treebank_tri_cfd)


Sentence: The company declared bankruptcy
Unigram Perplexity: 55897.705070214455
Sentence: The company declared bankruptcy
Unigram Perplexity: 27061.909561003424
Sentence: The company declared bankruptcy
Unigram Perplexity: 5663.673674925953
Sentence: The company declared bankruptcy
Bigram Perplexity: 15.36967956951486
Sentence: The company declared bankruptcy
Bigram Perplexity: 149.6362040720315
Sentence: The company declared bankruptcy
Bigram Perplexity: 53.80531025243895
