# Language models

## What is a language model?

A **language model** learns to predict the probability of a sequence of words. 
Why do we need to learn the probability of words? 

Which sentence is more probable: 
"The small is cat" or "The cat is small"? Why?

## Why do we need language models?



- Machine Translation

        Les Grandes Espérances 
            - Great Expectations
            - Tall Expectations


- Spell Correction

        Choose most probable acccording to context matching a language model

          - I love movies directed by Lynch.
          - I love movies directed by Lunch.



<img src="img/Lunch.png" width ="600">


- Speech Recognition
      - Choose the most probable utterence among ones that are phonetically similar
     
          Say both of these out fast and loudly:

          - wreck a nice beach
          - reckon ice beach
          - recognize speech

          Which one is more probable to occur in English?
     
        



## Types of language models

- **Statistical Language Models**
(traditional statistical techniques like N-grams, Hidden Markov Models (HMM) and certain linguistic rules to learn the probability distribution of words)

- **Neural Language Models**
(new and more efficient)

## N-Gram language model

Let’s understand N-grams with an example. Consider the following sentence:

“I enjoy reading books about data science”

A **1-gram (or unigram)** is a one-word sequence. For the above sentence, the unigrams would simply be: “I”, "enjoy", “reading”, "books, “about”, “data”, “science”.

A **2-gram (or bigram)** is a two-word sequence of words, like “I enjoy", "enjoy reading". 




1 - **REFLECT AND REPLY**: What is a **3-gram (or trigram)**? Please give an example from the sentence above. 

### Write you answer here

An N-gram language model predicts the probability of a given N-gram occuring within any sequence of words in the language. If we have a quality N-gram model, we can predict probability **p(w | h)** – what is the probability of seeing the word **w** given a history of previous words **h** – where the history contains n-1 words.

To build an N-Gram model we have to estimate this probability. 

2 - **REFLECT AND REPLY:** How many n-grams are in a sentence with **m** number of unigrams? (n>1)

### Write your anwser here

**If you Fancy** Watch [this video series](https://youtu.be/Saq1QagC8KY) by Dan Jurafsky on Language Modeling (4.1- 4.8)


## Google NGRAMS 

Google ngrams is a tool which looks for ngrams in resources from google and visualizes them on a timeline


https://books.google.com/ngrams

## Markov chain

How do we compute the probability?

1. Apply the chain rule of probability
2. Apply a very strong simplification assumption to allow us to compute $p(w1…ws)$ easily


The assumption that the probability of a word depends only on the previous word
Markov is called a **Markov assumption**. 

In a Markov chain:
-  each event in the sequence comes from a set of outcomes that depend on one another;
- each outcome determines which outcomes are likely to occur next;
- the most recent event contain all the information we need to predict the next event. 

When we talk about language it means that:
- each word in a sentence comes from a set of words each depending on one another;
- each word determines which word is likely to be the next one in a sentence;
- the most recent word contains all the information to allow us predict the next word.

### Markov Assumption for Bigram language model:
<img src="img/bigram.png" width ="300">

Let's do an exercise:


In [None]:
import nltk
nltk.download('gutenberg')
print(nltk.corpus.gutenberg.fileids())
text = nltk.corpus.gutenberg.open("shakespeare-hamlet.txt").read()

In [None]:
corpus = text.split()

3 - **CODEIT** Do you remember from first session which NLTK function we can uyse instead of splits to extrat tokens from text?

In [None]:
import nltk
nltk.download('punkt') #run this line if you haven't before
# Insert the function to tokenize the text
corpus = nltk.tokenize.word_tokenize(text)

Let's get word pairs (to avoid filling the memory we will use a generator that yields word pairs for further processing without storing them):

In [None]:
def make_pairs(corpus):
    for i in range(len(corpus)-1):
        yield (corpus[i], corpus[i+1])
        
pairs = make_pairs(corpus)

In [None]:
pairs
#now pairs is a bigram generator that creates the next bigram each time it's called

Let's now create an empty dictionary and fill it with out word pairs

In [None]:
word_dict = {}
for word_1, word_2 in pairs:
    if word_1 in word_dict.keys():
        word_dict[word_1].append(word_2)
    else:
        word_dict[word_1] = [word_2]

In [None]:
word_dict["King"]

Now let's choose a random word to start off the chain, and specify the number of words the chain will simulate:

In [None]:
import numpy as np
first_word = np.random.choice(corpus)
chain = [first_word]
n_words = 30

As for the next words, they will be sampled randomly from the list of words which followed that first word in the texts we have supplied:


In [None]:
for i in range(n_words):
    chain.append(np.random.choice(word_dict[chain[-1]]))

Last step: print out the generated text! 

In [None]:
generated_text = ' '.join(chain)
print (generated_text)

## Limitations of the n-gram model:

- The higher the N, the better is the model - but this leads to lots of computation overhead that requires large computation power in terms of RAM;
- N-grams are a sparse representation of language. This is because we build the model based on the probability of words co-occurring. It will give zero probability to all the words that are not present in the training corpus

## Train an Ngram Langugae model using nltk.lm library

The following code loads the text of inaugural speech of Donald Trump from the nltk corpuses library and bilds a 3-gram language model by Maximizing Likelihood Estimation of the train text.

It adds start and end of sentences to the begining and end of sentences as tokens and then creates all n-grams $n<=3$

We kept 5 percent of the sentences in the corpus for computing **perplexity** to evaluate our language model intrinsically.

In [None]:
import nltk
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.tokenize import sent_tokenize, word_tokenize

from numpy import random
n = 3
from nltk.corpus import inaugural
nltk.download('inaugural')

text = nltk.corpus.inaugural.open("2017-Trump.txt").read()


tokenized_text = [list(map(str.lower, word_tokenize(sent))) for sent in sent_tokenize(text)]
random.shuffle(tokenized_text)
# separate into training and test
tokenized_text_train = tokenized_text[0:int(0.95*(len(tokenized_text)))]
tokenized_text_test = tokenized_text[int(0.95*(len(tokenized_text)))+1:]

train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text_train)

from nltk.lm import MLE
model = MLE(n) # Lets train a 3-grams model, previously we set n=3
model.fit(train_data, padded_sents)

4 - **REFLECT AND REPLY:** 

Could you give any explanations why the text was shuffled before taking samples for train and test sets?(Hint: take a look at the text)


This text source is quite short, at the end of the speech there are several sentences which the speaker is saying thanks. Which means that the last sentences are all of same pattern and the language model can not see  them in the training data.  So it's better to shuffle the sentences so that the training data contains patterns whether they are at the beginining or end of the document.

## Write your answer here


In [None]:
print(model.vocab.lookup(tokenized_text[0]))

The following code get the counts of n-grams from the model.

In [None]:
print(model.counts['united']) # i.e. Count('humbly')
model.counts[['united']]['states'] # i.e. Count('humbly'|'most')


5 - **CODEIT** Print the count of a trigram which might occur/not in the text.

In [None]:
#Insert Code here 
model.counts[['united','states']]["of"]

6 - **CODEIT** The character sequence which represents the start of the sentence is **<s>** in nltk language model library.

Write a code to print the score how many times the pronoun "I" in the corpus occurs in the begining of the sentence.

**NOTE**: the text is lower-cased in the preprocessing steps. 

In [None]:
#Insert code here

print(model.score('i',['<s>']))

In [None]:
print(model.score('strong', ["make" ,"america"]))
print(model.score('states', ["united"]))


7 - **REFLECT AND REPLY** 

Write the  probablity that the two statements above represent in terms of $P(W_1|W_2,...)$

### write your answer here

The following function generates a sentence based on your language model

In [None]:
from nltk.tokenize.treebank import TreebankWordDetokenizer
import random
detokenize = TreebankWordDetokenizer().detokenize

def generate_sent(model, num_words,text_seed=""):
    content = []
    if(text_seed==""):
        text_seed = "<s>"
        
    else:
        content.append(text_seed)
    
    for token in model.generate(num_words,text_seed= text_seed,random_seed=random.Random()):
        if token == '<s>':
            continue
        if token == '</s>':
            break
        content.append(token)
        
    return detokenize(content)

In [None]:
generate_sent(model,10,"i")
print(tokenized_text_test)

In [None]:
test_data, padded_sents_test = padded_everygram_pipeline(n, tokenized_text_test)
for i,test in enumerate(test_data):
    print("Perplexity for <"," ".join(tokenized_text_test[i]),"> is: ", model.perplexity(test))


8 - ** Reflect and Reply**  What is the perplexity value for the  language model on the test sentences ? Why?

## write your answer here

9 - **CODEIT** Instead of Maximum likelihood estimation implemented in nltk.lm.MLE train a language model using the Laplace model class implemented in the nltk.lm.Laplace.

In [None]:
import nltk
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.tokenize import sent_tokenize, word_tokenize

from numpy import random
n = 3
from nltk.corpus import inaugural
nltk.download('inaugural')

text = nltk.corpus.inaugural.open("2017-Trump.txt").read()



tokenized_text = [list(map(str.lower, word_tokenize(sent))) for sent in sent_tokenize(text)]
random.shuffle(tokenized_text)
# separate into training and test
tokenized_text_train = tokenized_text[0:int(0.95*(len(tokenized_text)))]
tokenized_text_test = tokenized_text[int(0.95*(len(tokenized_text)))+1:]

train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text_train)

from nltk.lm import Laplace
model = Laplace(n) # Lets train a 3-grams model, previously we set n=3
model.fit(train_data, padded_sents)


test_data, padded_sents_test = padded_everygram_pipeline(n, tokenized_text_test)
for i,test in enumerate(test_data):
    print("Perplexity for <"," ".join(tokenized_text_test[i]),"> is: ", model.perplexity(test))


10 - **Reflect and Reply** Print the perplexities for the test sentences and reflect on the way they are differenet from the perplexity of MLE model.

### Write your answer here

## Neural Language Models



**GPT-2**, is an unsupervised transformer language model and the successor to GPT. GPT-2 was first announced in February 2019, with only limited demonstrative versions initially released to the public. The full version of GPT-2 was not immediately released out of concern over potential misuse, including applications for writing fake news.

**GPT 3**: https://www.technologyreview.com/2020/07/20/1005454/openai-machine-learning-language-generator-gpt-3-nlp/

The following article was written by a GPT 3 model (Impressive or meh?!): https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3



------

11 - **Homework** The previous models used n=3 , i.e. ngrams where  $n<=3$. Train a Laplace language model with ngrams where $n<=4$. Then compare the perplexity of the model and write your reflection on the difference between perplexities.


The intuition is that when higher n-grams are considered the language model should improve and thus perplexity should be lower. However when higher ngrams are considered, the risk of overfitting also increasing , means that the model will learn very specific sequences in the training data and lose the ability to generalize
 



*** If you fancy *** you can train you language models for any text in any language.

However make sure you same the same resource for both models you are comparing the perplexity for.