<a href="https://colab.research.google.com/github/rguntz/Projects_for_semester_application/blob/main/Copie_de_Tutorial1_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color=violet>A Simple Language Model</font>

In this notebook we explore a very simple language model. We use the ```nltk``` library, a Python library for NLP. The goal is to get familiar with string probabilities and text generation.



In [None]:
import nltk
from nltk import bigrams, trigrams, WhitespaceTokenizer
from nltk.probability import FreqDist
from nltk.corpus import gutenberg

## <font color=violet>Corpus</font>

We use a text from the **Gutenberg corpus**. The Gutenberg corpus is a collection of literary texts in ```nltk```.

In [None]:
# download the corpus
nltk.download('gutenberg')
nltk.download('punkt')
corpus = gutenberg.raw('carroll-alice.txt') # alice in wonderland text

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
print(corpus[:391])

[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?


## <font color=violet>Tokenization</font>

Tokenization **splits the text into units** (e.g., words). We tokenize with a whitespace tokenizer, which splits the input at whitespace characters (spaces, tabs, and newlines).

In [None]:
text = "This is a sample sentence to tokenize."

tokenizer = WhitespaceTokenizer()
tokenizer.tokenize(text)

['This', 'is', 'a', 'sample', 'sentence', 'to', 'tokenize.']

In [None]:
# tokenize the lowered corpus
raw_tokens = tokenizer.tokenize(corpus.lower())

In [None]:
# manually add EOS to the tokens that end with .?!
# this "separates" sentences from each other
tokens = []
for token in raw_tokens:
    tokens.append(token)
    if token[-1] in {'?', '.', '!'}:
        tokens.append('<EOS>')

## <font color=violet>String Probabilities with n-grams</font>

We create a unigram model, a **frequency-based language model** that represents how often each individual word appears in the text. We do the same for bigrams (pairs of words) and trigrams.

In [None]:
# create unigrams, bigrams and trigrams
unigram_model = FreqDist(tokens)
bigram_model = FreqDist(bigrams(tokens))
trigram_model = FreqDist(trigrams(tokens))

In [None]:
print("Top 5 most common words in the unigram model:")
for word, count in unigram_model.most_common(5):
    print(f"{word}: {count}")

Top 5 most common words in the unigram model:
the: 1603
<EOS>: 958
and: 766
to: 706
a: 614


In [None]:
print("Top 5 most common bigrams:")
for bigram, count in bigram_model.most_common(5):
    print(f"{bigram}: {count}")

Top 5 most common bigrams:
('said', 'the'): 207
('of', 'the'): 128
('in', 'a'): 97
('in', 'the'): 78
('and', 'the'): 77


In [None]:
print("Top 5 most common trigrams:")
for trigram, count in trigram_model.most_common(5):
    print(f"{trigram}: {count}")

Top 5 most common trigrams:
('*', '*', '*'): 54
('said', 'alice.', '<EOS>'): 33
('the', 'mock', 'turtle'): 31
('said', 'the', 'mock'): 19
('she', 'said', 'to'): 17


In [None]:
def unigram_probability(word):
    return unigram_model.freq(word) # frequency

# conditional probability P(word | prev_word)
def bigram_probability(prev_word, word):
    if (prev_word == '<EOS>') or (prev_word not in unigram_model):
        return 0
    return bigram_model[prev_word, word] / unigram_model[prev_word]

# conditional probability P(word | prev_word1, prev_word2)
def trigram_probability(prev_word1, prev_word2, word):
    if (prev_word2 == '<EOS>') or ((prev_word1, prev_word2) not in bigram_model):
        return 0
    return trigram_model[prev_word1, prev_word2, word] / bigram_model[(prev_word1, prev_word2)]

In [None]:
# examples
print(f"Unigram Probability of 'alice': {unigram_probability('alice'):.4f}")
print(f"Conditional probability p('said'|'alice'): {bigram_probability('alice', 'said'):.4f}")
print(f"Conditional probability p('king'|'the'): {bigram_probability('the', 'king'):.4f}")
print(f"Conditional probability p('said'|'the king'): {trigram_probability('the', 'king', 'said'):.4f}")

Unigram Probability of 'alice': 0.0081
Conditional probability p('said'|'alice'): 0.0407
Conditional probability p('king'|'the'): 0.0175
Conditional probability p('said'|'the king'): 0.1786


We now define a function that computes the probability of multiple words forming (part of) a sentence based on the bigram model.

$$
p(w_1, w_2, \dots, w_n) = p(w_1) \prod_{i=2}^{n} p(w_i \mid w_{i-1})
$$

Note that we use $p(w_1) = 1$ for simplicity.

In [None]:
def string_probability(string, tokenizer): # with the bigrams
    tokens = tokenizer.tokenize(string)
    prob = 1.0
    for i in range(len(tokens) - 1):
        prob *= bigram_probability(tokens[i], tokens[i+1])
    return prob

In [None]:
input_string = "you are"
print(f"Bigram Probability of '{input_string}': {string_probability(input_string, tokenizer):.6f}")
input_string = "you are the"
print(f"Bigram Probability of '{input_string}': {string_probability(input_string, tokenizer):.6f}")
input_string = "you are the king"
print(f"Bigram Probability of '{input_string}': {string_probability(input_string, tokenizer):.6f}")

# try a sentence that is wrong
input_string = "king the you"
print(f"Bigram Probability of '{input_string}': {string_probability(input_string, tokenizer):.6f}")

Bigram Probability of 'you are': 0.011364
Bigram Probability of 'you are the': 0.000284
Bigram Probability of 'you are the king': 0.000005
Bigram Probability of 'king the you': 0.000000


## <font color=violet>Text Generation</font>

We generate text using a **bigram language model**. The model follows the bigram assumption and hence choses each word **depending** solely **on the previous word**. The model gets a starting word as input and iteratively selects the most probable next word based on bigram probabilities.

In [None]:
def generate_text(starting_word, length=5):
    generated_text = [starting_word]
    current_word = starting_word

    for _ in range(length - 1):
        next_word = max(unigram_model, key = lambda word: bigram_probability(current_word, word))
        generated_text.append(next_word)
        current_word = next_word

    return ' '.join(generated_text)

In [None]:
print("Generated Text:", generate_text('you', length=6))

Generated Text: you know what i don't know


In [None]:
print("Generated Text:", generate_text('the', length=20))

Generated Text: the mock turtle in a little thing i don't know what i don't know what i don't know what i


**Task**: What is the fundamental issue of this model? *Hint:* Try to generate longer text.