<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ufidon/nlp/blob/main/02.nglm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ufidon/nlp/blob/main/02.nglm.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>
<br>

# N-gram `Language Models (LMs)`

📝 SALP chapter 3

## 🤔 An intriguing question
- What will you `likely` fill in the blank below?
- **I dreamed ___________.**


## **Purpose of Language Models**
   - **Probability Assignment:** LMs assign a `probability` to `each possible next word, phrase, or sentence` based on `the context provided by the preceding words`.
     - computes $P(w|c)$, the `probability` of a word $w$ given some context $c$.
   - **Applications:** Used in various natural language processing (NLP) tasks like 
     - speech recognition, 
     - machine translation, 
     - text generation and autocomplete,
     - augmentative and alternative communication (aac) systems, etc.

## **Types of Language Models**
   - **Unigram Model:**
     - **Description:** Assumes each word is `independent` of the previous words.
       - Which is unlikely in NLP.
     - **Example:** Probability of the sentence "I love NLP" is $P(\text{"I"}) \times P(\text{"love"}) \times P(\text{"NLP"})$.
     - **Limitation:** Ignores context, leading to unrealistic predictions.
   
   - **N-gram Model:**
     - **Description:** Considers the context of the previous $N-1$ words.
     - **Example:**
       - **Bigram Model (N=2):** $P(\text{"love"} | \text{"I"}) \times P(\text{"NLP"} | \text{"love"})$
       - **Trigram Model (N=3):** $P(\text{"NLP"} | \text{"I love"})$
     - **Usage:** 
       - N-gram statistics capture diverse linguistic phenomena, including: 
         - syntactic patterns, task-specific tendencies, and cultural preferences. 
       - These probabilities reflect: 
         - grammatical rules, domain-specific language use, and societal trends in word co-occurrences.
     - **Padding**: Extra context marks are needed at the start and end of sentences for longer n-grams
       - e.g with trigrams, we use two special markers `<s>` before the first word to calculate `𝐏("I"|<s><s>)`
     - **Limitation:** The model might struggle with `rare or unseen` word sequences due to `data sparsity`.

   - **Neural Language Models:**
     - **Recurrent Neural Networks (RNNs):**
       - **Description:** Uses a feedback loop to take into account all previous words, maintaining context across varying lengths.
       - **Example:** Predicts the next word in a sequence like "I love" by using hidden states that summarize previous words.
       - **Limitation:** Can suffer from issues like vanishing gradients, especially with long dependencies.
     
     - **Transformers:**
       - **Description:** Uses attention mechanisms to capture long-range dependencies without the sequential nature of RNNs.
       - **Example:** Models like GPT, BERT use transformers to predict the next word or even fill in masked words in a sentence.
       - **Advantage:** Handles long-term dependencies more effectively and is highly parallelizable.

## **Probability of a sentence**
- A sentence $W$ is a sequence of words appearing together in a specific order: $w_1w_2w_3\cdots w_n$
- The probability of a sentence is the `joint probability` of its words: $P(W)=P(w_1,w_2,w_3,\cdots, w_n)$
  - $P(\text{"I love NLP"}) = P(\text{"I"} \cap \text{"love"} \cap \text{"NLP"})$
- $P(W)$ can be calculated with the `Chain Rule of Probability`:
  - $P(w_1, w_2, \dots, w_n) = P(w_1) \times P(w_2 | w_1) \times P(w_3 | w_1, w_2) \times \dots \times P(w_n | w_1, w_2, \dots, w_{n-1})$
  - 🤔 How to calculate each term on the right side? 
  - 🤔 How to simplify this formula?

### Estimating N-gram Probabilities with Maximum Likelihood Estimation (MLE)

- **Maximum Likelihood Estimation (MLE)** estimates the probability of an n-grams based on its `frequencies` in a corpus, 
  - by dividing the frequency or count of that n-gram by the frequency of the (n-1)-gram that precedes it.
  - The resulting parameter set of MLE `maximizes the likelihood` of the training set $T$ given the model $M$ (i.e., $P(T|M)$).
- **Unigram Probability:** The probability of a single word (unigram) =
  -  the frequency $C(w_i)$ of that word $w_i$ ÷ the total number $N$ of words in the corpus.
  -  $\displaystyle P(w_i) = \frac{C(w_i)}{N}$
- **Bigram Probability:** The probability of a word given the previous word (bigram) = 
  - the frequency $C(w_{i-1}, w_i)$ of the bigram $(w_{i-1}, w_i)$ ÷ the frequency $C(w_{i-1})$ of the first word $w_{i-1}$ in the bigram.
  - $\displaystyle P(w_i \mid w_{i-1}) = \frac{C(w_{i-1}, w_i)}{∑_{w} C(w_{i-1}, w)} = \frac{C(w_{i-1}, w_i)}{C(w_{i-1})}$
  -  This ratio is called a `relative frequency`
- **Trigram Probability:** The probability of a word given the previous two words (trigram) =
  - the frequency $C(w_{i-2}, w_{i-1}, w_i)$ of the trigram $(w_{i-2}, w_{i-1}, w_i)$ ÷ the frequency $C(w_{i-2}, w_{i-1})$ of the first two words $(w_{i-2}, w_{i-1})$ in the trigram.
  - $\displaystyle P(w_i \mid w_{i-2}, w_{i-1}) = \frac{C(w_{i-2}, w_{i-1}, w_i)}{C(w_{i-2}, w_{i-1})}$


### 🍎 **Example**
Given a simple corpus:

```
corpus = ["I love NLP", "I love coding", "NLP is fun", "I love Python", "Python is great"]
```

For the bigram "I love", the bigram count $C("I", "love")$ might be 3, and the unigram count $C("I")$ might be 3 as well. The bigram probability $P("love" \mid "I")$ would then be:

$\displaystyle P("love" \mid "I") = \frac{C("I", "love")}{C("I")} = \frac{3}{3} = 1.0$

### 💡 **A Simple Python Demo**

In [5]:
import nltk
from collections import Counter
nltk.download('punkt')

# Small corpus
corpus = [
    "I love NLP",
    "I love coding",
    "NLP is fun",
    "I love Python",
    "Python is great",
]

# Tokenize sentences
tokenized_corpus = [nltk.word_tokenize(sentence.lower()) for sentence in corpus]
print(f"{tokenized_corpus=}")

# Calculate unigram and bigram counts
unigram_counts = Counter([word for sentence in tokenized_corpus for word in sentence])
bigram_counts = Counter([(sentence[i], sentence[i+1]) for sentence in tokenized_corpus for i in range(len(sentence)-1)])
trigram_counts = Counter([(sentence[i], sentence[i+1], sentence[i+2]) for sentence in tokenized_corpus for i in range(len(sentence)-2)])


# Total number of unigrams in the corpus
total_unigrams = sum(unigram_counts.values())

# Function to calculate unigram probability
def unigram_probability(word):
    return unigram_counts[word] / total_unigrams

# Function to calculate bigram probability
def bigram_probability(w1, w2):
    return bigram_counts[(w1, w2)] / unigram_counts[w1]

# Function to calculate bigram probability
def trigram_probability(w1, w2, w3):
    return trigram_counts[(w1, w2, w3)] / bigram_counts[(w1, w2)]

print(f"{unigram_counts=}")
print(f"{bigram_counts=}")
print(f"{trigram_counts=}")

# Example of unigram probability
word = "love"
print(f"P({word}) = {unigram_probability(word)}")

# Example of bigram probability
w1, w2 = "i", "love"
print(f"P({w2} | {w1}) = {bigram_probability(w1, w2)}")

# Example of bigram probability
w1, w2, w3 = "i", "love", "nlp"
print(f"P({w3} | {w1,w2}) = {trigram_probability(w1, w2, w3)}")

tokenized_corpus=[['i', 'love', 'nlp'], ['i', 'love', 'coding'], ['nlp', 'is', 'fun'], ['i', 'love', 'python'], ['python', 'is', 'great']]
unigram_counts=Counter({'i': 3, 'love': 3, 'nlp': 2, 'is': 2, 'python': 2, 'coding': 1, 'fun': 1, 'great': 1})
bigram_counts=Counter({('i', 'love'): 3, ('love', 'nlp'): 1, ('love', 'coding'): 1, ('nlp', 'is'): 1, ('is', 'fun'): 1, ('love', 'python'): 1, ('python', 'is'): 1, ('is', 'great'): 1})
trigram_counts=Counter({('i', 'love', 'nlp'): 1, ('i', 'love', 'coding'): 1, ('nlp', 'is', 'fun'): 1, ('i', 'love', 'python'): 1, ('python', 'is', 'great'): 1})
P(love) = 0.2
P(love | i) = 1.0
P(nlp | ('i', 'love')) = 0.3333333333333333


[nltk_data] Downloading package punkt to /home/qingshan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Find the probability of a sentence using n-grams
- Calculate the product of the conditional probabilities of each word given its preceding $n-1$ words. 
- This process involves breaking down the sentence into a series of `n-grams` and 
  - then using their respective probabilities to compute the `overall sentence probability` $P(W)$.
- **Unigram Model:**
  - $P(W)$ = the product of the probabilities of each individual word.
  - $P(W) = P(w_1) \times P(w_2) \times \dots \times P(w_T)$
- **Bigram Model:** 
  - $P(W)$ = the product of the probabilities of each word given the previous word.
  - $P(W) = P(w_1) \times P(w_2 \mid w_1) \times P(w_3 \mid w_2) \times \dots \times P(w_T \mid w_{T-1})$
- **Trigram Model:** 
  - $P(W)$ = the product of the probabilities of each word given the previous two words.
  - $P(W) = P(w_1) \times P(w_2 \mid w_1) \times P(w_3 \mid w_1, w_2) \times \dots \times P(w_T \mid w_{T-2}, w_{T-1})$
- ⚠️ These are `approximations` to probabilities of sentences.

#### 🍎 Example
Let's assume a simple corpus and bigram model:

```
corpus = ["I love NLP", "I love coding", "NLP is fun", "I love Python", "Python is great"]
```

We want to calculate the probability of the sentence "I love NLP" using the bigram model.

### **Step-by-Step Calculation**
1. **Break down the sentence into bigrams:**
   - Bigrams: ("I", "love"), ("love", "NLP")

2. **Compute the sentence probability:**
   $P(W) = P("I") \times P("love" \mid "I") \times P("NLP" \mid "love")$

   Assume we have the following probabilities:
   - $P("I") = 0.2$
   - $P("love" \mid "I") = 1.0$
   - $P("NLP" \mid "love") = 0.33$

   Then the sentence probability:
   $P(W) = 0.3 \times 1.0 \times 0.33 = 0.099$

In [11]:
import nltk
from collections import Counter
nltk.download('punkt')

# Small corpus
corpus = [
    "I love NLP",
    "I love coding",
    "NLP is fun",
    "I love Python",
    "Python is great",
]

# Tokenize sentences
tokenized_corpus = [nltk.word_tokenize(sentence.lower()) for sentence in corpus]

# Calculate unigram and bigram counts
unigram_counts = Counter([word for sentence in tokenized_corpus for word in sentence])
bigram_counts = Counter([(sentence[i], sentence[i+1]) for sentence in tokenized_corpus for i in range(len(sentence)-1)])

# Total number of unigrams in the corpus
total_unigrams = sum(unigram_counts.values())

# Functions to calculate probabilities
def unigram_probability(word):
    return unigram_counts[word] / total_unigrams

def bigram_probability(w1, w2):
    return bigram_counts[(w1, w2)] / unigram_counts[w1]

# Function to calculate sentence probability using bigrams
def sentence_probability(sentence):
    tokens = nltk.word_tokenize(sentence.lower())
    prob = unigram_probability(tokens[0])  # Start with unigram probability of the first word
    for i in range(len(tokens) - 1):
        prob *= bigram_probability(tokens[i], tokens[i+1])
    return prob

print(f"{unigram_counts=}")
print(f"{bigram_counts=}")

# Example sentence
sentence = "I love NLP"
prob = sentence_probability(sentence)
print(f"P('I') = {unigram_probability('i')}")
print(f"P('I love') = {bigram_probability('i', 'love')}")
print(f"P('love NLP') = {bigram_probability('love', 'nlp')}")
print(f"P('{sentence}') = {prob}")

unigram_counts=Counter({'i': 3, 'love': 3, 'nlp': 2, 'is': 2, 'python': 2, 'coding': 1, 'fun': 1, 'great': 1})
bigram_counts=Counter({('i', 'love'): 3, ('love', 'nlp'): 1, ('love', 'coding'): 1, ('nlp', 'is'): 1, ('is', 'fun'): 1, ('love', 'python'): 1, ('python', 'is'): 1, ('is', 'great'): 1})
P('I') = 0.2
P('I love') = 1.0
P('love NLP') = 0.3333333333333333
P('I love NLP') = 0.06666666666666667


[nltk_data] Downloading package punkt to /home/qingshan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Log probability
- Log probability, widely used in NLP, is the logarithm of a probability value
  - If P is a probability, then log probability = ln(P)
    - e.g. For sentence "The quick brown fox" in a large corpus: 
      - Probability p: 0.0000001 
      - Log Probability ln(p): -16.11809565095832
      - In Python, the function `log` is used to calculate the natural log function `ln`
  - $\displaystyle ∏_{i=1}^n P_i = ∑_{i=1}^n\ln P_i$ allows us to convert products of probabilities to sums of log probabilities.
- **Advantages:**
  - Numerical stability (avoid underflow) 
  - Computational efficiency (additions instead of multiplications)
  - Easier interpretation of very small probabilities

## **Evaluating Language Models**
- **Extrinsic Evaluation:**
  - Measures model performance `within an application`.
  - Determines if improvements help the specific task.
  - e.g: Comparing two language models in speech recognition.
- **Intrinsic Evaluation:**
  - Measures model quality `independently of an application`.
  - Quick evaluation metric for potential improvements.


### **Importance of Evaluation Metrics**
- Running big NLP systems end-to-end can be costly.
- **Intrinsic Evaluation:**
  - Faster evaluation without running the entire application.
  - Example: `Perplexity` as a standard intrinsic metric.
- Used for n-gram models and sophisticated neural LMs.


### **Dataset Partitioning  for Model Evaluation**
- **Three Data Sets Needed:**
  - **Training Set:** For learning model parameters.
  - **Development Set (Dev Set):** For tuning and evaluating model changes.
  - **Test Set:** For final unbiased evaluation of model performance.
- **Importance of Each Set:**
  - **Training Set:** Provides counts for probabilities in n-gram models.
  - **Test Set:** Reflects the target language for accurate evaluation.
  - **Dev Set:** Helps avoid overfitting to the test set.


### **Choosing Training and Test Sets**
- Test set should match the intended application of the model.
- Example: For chemistry lecture speech recognition, use chemistry lectures as the test set.
- **General Purpose Models:** Test set should be diverse and carefully selected.


### **Comparing Model Performance**
- **Comparing N-gram Models:**
  - Train both models on the training set.
  - Evaluate on the test set.
- **Higher Probability:** Indicates a better fit to the test set.
- Avoid training on the test set to prevent bias.


### **Avoiding Bias in Evaluation**
- **Training on the Test Set:**
  - Leads to artificially high probabilities.
  - Distorts evaluation metrics like perplexity.
- **Best Practice:**
  - Use the dev set for testing during model development.
  - Use the test set sparingly, only when the model is finalized.


### **Final Evaluation**
- **Dev Set:** Used for tuning the model.
- **Test Set:** Used for final evaluation.
- Ensure test set size is sufficient for statistical significance.
- Dev set should be similar to the test set for accurate evaluation.

## **Perplexity: A Measure of Language Model Quality**
- **Definition:**
  - Perplexity is a measurement of `how well a probability model predicts a sample`.
  - It is used to evaluate the performance of language models in NLP.
- **Key Idea:**
  - `Lower perplexity indicates a better model`, as it suggests the model is less "perplexed" by the data.


### **Mathematical Definition of Perplexity**
- **Formula for Perplexity:**
  - `Perplexity (PP)` of a model over a sequence $W$ of words $w_1, w_2, \dots, w_N$:
  - $\displaystyle \text{PP}(W) = P(W)^{-\frac{1}{N}} = P(w_1, w_2, \dots, w_N)^{-\frac{1}{N}} = \sqrt[N]{\frac{1}{P(w_1, w_2, \dots, w_N)}} = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log_2 P(w_i | w_{1:i-1})}$
    - $=\displaystyle\sqrt[N]{∏_{i=1}^{N}\frac{1}{P(w_i)}}$ (for unigram model)
    - $=\displaystyle\sqrt[N]{∏_{i=1}^{N}\frac{1}{P(w_i|w_{i-1})}}$ (for bigram model)
    - $=\displaystyle\sqrt[N]{∏_{i=1}^{N}\frac{1}{P(w_i|(w_{i-1},w_{i-2}))}}$ (for trigram model)
- **Explanation:**
  - $P(W)$: Probability of the entire sequence.
  - $N$: Number of words in the sequence.
  - The lower the perplexity, the higher the probability the model assigns to the sequence.


### 🍎**Perplexity Example**
- **Given Sequence:** "I love NLP"
- **Assume:** Bigram Model with probabilities:
  - $P(I) = 0.1$
  - $P(love | I) = 0.3$
  - $P(NLP | love) = 0.5$
- **Calculate:**
  - $P(W) = 0.1 \times 0.3 \times 0.5 = 0.015$
  - $N = 3$
  - $\text{Perplexity} = (0.015)^{-\frac{1}{3}} \approx 6.3$

### **Perplexity in Language Model Evaluation**
- **Intrinsic Metric:**
  - Perplexity is an intrinsic evaluation metric used to assess language model quality.
- **Comparison:**
  - Used to compare different language models on the same dataset.
- **Interpretation:**
  - A lower perplexity score indicates a better language model.

### **Perplexity and Language Models**
- **Perplexity** is influenced by both the text and the language model. 
  - Different n-gram models will have varying perplexities for the same text.
- **Purpose**: Perplexity allows us to compare the effectiveness of different n-gram models.
- 🍎 **Example**: 
  - Trained unigram, bigram, and trigram models on 38 million words from the Wall Street Journal (WSJ) with a 19,979 word vocabulary.
  - **Test Set**: Perplexity calculated on a 1.5 million word WSJ test set.
  - **Perplexity Results**:
    - Unigram: 962
    - Bigram: 170
    - Trigram: 109
  - **Interpretation**: Trigram models, which have more contextual information, are better predictors (lower perplexity) than unigrams and bigrams.

### **Important Considerations in Perplexity Evaluation**
- **Model Construction**: The n-gram model must be built without prior knowledge of the test set or its vocabulary to avoid artificially low perplexities.
- **Comparability**: Only comparable if models use identical vocabularies.
- **Intrinsic vs. Extrinsic Evaluation**:
  - **Intrinsic**: Perplexity improvement doesn't always translate to better performance in tasks like speech recognition.
  - **Extrinsic**: End-to-end evaluation on real tasks is recommended to confirm improvements.


### **Perplexity as Weighted Average Branching Factor**
- **Branching Factor**: The number of possible next words that can follow any given word in a language.
- **Perplexity Interpretation**: Perplexity can be seen as the weighted average of this branching factor.
  - For a language where all digits (0-9) are equally probable, the perplexity would equal 10.
  - $\displaystyle PP(W)=\sqrt[N]{∏_{i=1}^{N}\frac{1}{P(w_i)}}=\sqrt[10]{∏_{i=1}^{10}\frac{1}{1/10}} = \sqrt[10]{∏_{i=1}^{10}10} = 10$
- **Influence of Probability Distribution on Perplexity**
  - **Unequal Distribution**: If one digit (e.g., 0) is much more frequent, the perplexity decreases
    - as the next digit becomes more predictable.
  - 🍎 e.g.: With a test set dominated by the digit 0, the weighted branching factor is lower than 10
    - reflecting a decrease in perplexity due to predictability.


### **Limitations of Perplexity**
- **Dependence on Dataset:**
  - Perplexity is sensitive to the domain and type of text in the dataset.
- **Doesn’t Measure Understanding:**
  - A model might have low perplexity but still lack true understanding of the language.
- **Bias Towards Shorter Sequences:**
  - Models may achieve low perplexity by favoring shorter sequences, which may not always be desirable.


### **Practical Application of Perplexity**
- **Use in Research:**
  - Perplexity is widely used in NLP research to gauge model improvements.
- **Example Application:**
  - Evaluating different versions of an n-gram model on a test corpus.
- **Tool for Development:**
  - Helps in tuning hyperparameters and refining models during development.

## **Sampling Sentences from a Language Model**
- **Definition**: 
  - Sampling from a language model involves generating sentences by choosing words based on their likelihood according to the model's distribution.
  - More likely sentences (according to the model) are sampled more frequently, while less likely ones are sampled less frequently.
- **Importance**: 
  - Sampling helps to visualize the kind of knowledge a language model holds by generating sentences it considers probable.


### Historical Context and Unigram Sampling
- **Historical Context**:
  - The idea of sampling from language models was first suggested by Shannon (1948) and Miller and Selfridge (1950).
- **Unigram Sampling**:
  - Imagine the words of a language spread out across a probability space from 0 to 1.
  - Each word occupies an interval proportional to its frequency.
  - **Process**:
    1. Choose a random number between 0 and 1.
    2. Identify the word whose interval includes this number.
    3. Continue generating words until the sentence-final token `</s>` is selected.
- **Visualization**:
  - A cumulative probability line where frequent words (e.g., "the", "of") occupy larger intervals, 
    - and rare words (e.g., "polyphonic") occupy smaller intervals.


### Bigram Sampling
- **Bigram Sampling Process**:
  - Start by generating a bigram that begins with `<s>` based on its probability.
  - The second word of the bigram becomes the first word of the next bigram.
  - Repeat the process, generating each subsequent word based on the previous one.
- **Key Concept**:
  - Sampling bigrams provides a more contextual generation compared to unigrams, 
    - as each word is dependent on the previous one.
- **Application**:
  - By sampling bigrams, one can visualize how the model predicts word sequences and generates more natural sentences compared to unigrams.


### **Sampling Strategies**
- **Greedy Sampling**:
  - Always select the word with the highest probability.
  - **Pros**: Produces highly probable sentences.
  - **Cons**: Can result in repetitive or deterministic text.
- **Random Sampling**:
  - Select words based on their probabilities, allowing less probable words to be chosen.
  - **Pros**: Produces more diverse sentences.
  - **Cons**: May result in less coherent text.


### 🍎 **Practical Example**

In [1]:
import nltk
from nltk import ngrams
from nltk.tokenize import word_tokenize
from collections import Counter
import random

# Ensure that the necessary NLTK data is downloaded
nltk.download('punkt')

# Sample text to be used for n-gram models
#  🛠️ try a large corpus such as Shakespeare
text = """To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles
And by opposing end them."""

# Tokenize the text into words
tokens = word_tokenize(text.lower())

# Generate Unigrams, Bigrams, and Trigrams
unigrams = tokens
unigram_freq = Counter(unigrams)

bigrams = list(ngrams(tokens, 2))
bigram_freq = Counter(bigrams)

trigrams = list(ngrams(tokens, 3))
trigram_freq = Counter(trigrams)

# Function to generate text using unigrams
def generate_text_unigram(unigram_freq, num_words=20):
    words = list(unigram_freq.keys())
    generated_text = random.choices(words, weights=unigram_freq.values(), k=num_words)
    return ' '.join(generated_text)

# Function to generate text using bigrams
def generate_text_bigram(starting_word, bigram_freq, num_words=20):
    current_word = starting_word
    generated_text = [current_word]
    for _ in range(num_words - 1):
        possible_bigrams = [bigram for bigram in bigram_freq if bigram[0] == current_word]
        if possible_bigrams:
            next_word = random.choices(
                [bigram[1] for bigram in possible_bigrams],
                weights=[bigram_freq[bigram] for bigram in possible_bigrams]
            )[0]
            generated_text.append(next_word)
            current_word = next_word
        else:
            break
    return ' '.join(generated_text)

# Function to generate text using trigrams
def generate_text_trigram(starting_words, trigram_freq, num_words=20):
    current_words = starting_words
    generated_text = list(current_words)
    for _ in range(num_words - 2):
        possible_trigrams = [trigram for trigram in trigram_freq if trigram[:2] == current_words]
        if possible_trigrams:
            next_word = random.choices(
                [trigram[2] for trigram in possible_trigrams],
                weights=[trigram_freq[trigram] for trigram in possible_trigrams]
            )[0]
            generated_text.append(next_word)
            current_words = (current_words[1], next_word)
        else:
            break
    return ' '.join(generated_text)

# Generate text using unigrams
generated_unigram_text = generate_text_unigram(unigram_freq)
print("Unigram Generated Text:")
print(generated_unigram_text)

# Generate text using bigrams
starting_word = 'to'
generated_bigram_text = generate_text_bigram(starting_word, bigram_freq)
print("\nBigram Generated Text:")
print(generated_bigram_text)

# Generate text using trigrams
starting_words = ('to', 'be')
generated_trigram_text = generate_text_trigram(starting_words, trigram_freq)
print("\nTrigram Generated Text:")
print(generated_trigram_text)

Unigram Generated Text:
the take outrageous , mind them question against of of and 't question is opposing or in the to of

Bigram Generated Text:
to suffer the question : whether 't is the slings and arrows of outrageous fortune , or not to be

Trigram Generated Text:
to be , that is the question : whether 't is nobler in the mind to suffer the slings and


[nltk_data] Downloading package punkt to /home/qingshan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## **Generalization and Zeros**
- **N-gram Model Dependency**:
  - N-gram models rely heavily on the training corpus.
  - Higher N-values in N-grams lead to better modeling of the training corpus.
- **Implications**:
  - Probabilities encode specific facts from the training corpus.
  - As N increases, the model captures more contextual relationships.


### **Visualization of N-gram Models**
- **Sampling Method**:
  - Visualizes the increasing power of higher-order N-grams.
  - 🍎 Example: Sentences generated from unigram, bigram, trigram, and 4-gram models trained on Shakespeare.
- **Key Observations**:
  - **Unigram**: No coherent word relation or punctuation.
  - **Bigram**: Some local word-to-word coherence.
  - **Trigram/4-gram**: Sentences resemble Shakespeare's style but may overfit the training corpus.


### **N-gram Sparsity Problem**
- **Corpus Size and Sparsity**:
  - Shakespeare's corpus: 884,647 words; 29,066 vocabulary.
  - **Sparsity Issue**: V² = 844 million possible bigrams, V⁴ = 7×10¹⁷ possible 4-grams.
  - **Overfitting Example**: "It cannot be but so" from Shakespeare's "King John."


### **Generalization across Genres**
- **Training on Different Corpora**:
  - Compare N-gram models trained on Shakespeare vs. Wall Street Journal (WSJ).
  - Both generate "English-like sentences" but with no overlap in content.
- **Genre Dependence**:
  - Training and test sets must match in genre to avoid poor model performance.


### **Matching Genres and Dialects**
- **Importance of Genre Matching**:
  - Use a training corpus relevant to the task's genre (e.g., legal documents for legal models).
- **Dialect and Variety**:
  - Consider appropriate dialects, especially for social media or spoken transcripts.
  - Example: African American English (AAE) and Nigerian Pidgin have unique features and n-gram patterns.


### **Dealing with Zeros**
- **Zero Probability N-grams**:
  - Some word sequences may be missing from the training corpus, leading to "zero probability n-grams."
  - Example: "denied the offer" may be unseen, resulting in zero probability estimation.
- **Consequences**:
  - Underestimation of probabilities and difficulty in computing perplexity.


### **Solutions to Zeros**
- **Smoothing Algorithms**:
  - Redistribute probability mass to unseen events, mitigating zero probabilities.
- **Unknown Words**:
  - For unseen words, use a pseudo-word `<UNK>` in open vocabulary systems.
  - Convert out-of-vocabulary words to `<UNK>` during training to estimate their probabilities.


### **Unknown Words and Perplexity**

- **Dealing with Unknown Words**:
  - Closed vocabulary systems avoid unknown words using subword tokenization.
  - Open vocabulary systems use `<UNK>` to handle unknown words.
- **Effect on Perplexity**:
  - Perplexity scores depend on the choice of vocabulary and `<UNK>` probabilities.
  - Comparisons of perplexity across models require the same vocabularies.

## **Smoothing in Language Models**
- **Definition:** `Smoothing`, or `discounting` techniques are used to handle the issue of zero probabilities in language models by assigning non-zero probabilities to unseen events
  - by transferring a bit of probability mass from some more frequent events to the unseen events 
- **Importance:** Essential for improving the performance of n-gram models by addressing data sparsity.
- **Methods:** 
  - Laplace (Add-One) Smoothing
  - Add-k Smoothing
  - Stupid Backoff
  - Kneser-Ney Smoothing


### **Laplace (Add-One) Smoothing**
- Add 1 to all the n-gram counts before normalizing them into probabilities
  - effective for small datasets.
- **Formula**
  - Smoothing unigram: $\displaystyle P(w_n)  = \dfrac{C(w_n) + 1}{Σ_w(C(w) + 1)} = \dfrac{C(w_n) + 1}{N + V}$
  - Smoothing bigram: $\displaystyle P(w_n | w_{n-1}) = \frac{C(w_{n-1}w_n) + 1}{Σ_w(C(w_{n-1}w) + 1)} = \frac{C(w_{n-1}w_n) + 1}{C(w_{n-1}) + V}$
    - $C(w_{n-1}w_n)$: Count of the bigram $w_{n-1}w_n$.
    - $C(w_{n-1})$: Count of the unigram $w_{n-1}$.
    - $V$: Size of the vocabulary.
- **Example:**
  - Suppose we have the bigram "I am" with a count of 2, "I" appears 3 times, and the vocabulary size $V = 5$.
  - Calculation: $\displaystyle P(\text{"am"}|\text{"I"}) = \frac{2 + 1}{3 + 5} = \frac{3}{8} = 0.375$.
- **Python Code Example:**

In [2]:
def laplace_smoothing(bigram_count, unigram_count, vocab_size):
    return (bigram_count + 1) / (unigram_count + vocab_size)

result = laplace_smoothing(2, 3, 5)
print(f"Laplace Smoothed Probability: {result:.3f}")

Laplace Smoothed Probability: 0.375


- **Adjusted, or discounted count**  can be turned into a probability like an MLE count by normalizing by $N$:
  - Unigram:
    - $\displaystyle C_{\text{adj}}(w_n) = (C(w_n)+1)\frac{N}{N+V}$
    - It easier to compare directly with the MLE counts 
    - Relative discount: $\displaystyle d_{C(w_n)}=\dfrac{C_{\text{adj}}(w_n)}{C(w_n)}$
  - Bigram:
    - $\displaystyle C_{adj}(w_{n-1}w_n) = \left(C(w_{n-1}w_n) + 1\right)\frac{C(w_{n-1})}{C(w_{n-1}) + V}$
- Add-one smoothing makes a very big change to the counts due to huge word sparsity in n-grams (n>1) in big corpus

### **Add-k Smoothing**
- Generalizes Laplace, allows tuning of the smoothing parameter.
  - Add a fractional count $k (0<k<1)$ instead of 1
- **Formula:**
  - $\displaystyle P(w_n | w_{n-1}) = \frac{C(w_{n-1}w_n) + k}{C(w_{n-1}) + kV}$
    - $k$: Smoothing parameter (a non-negative value).
- **Example:**
  - Using the same example with $k = 0.5$, the calculation becomes $\displaystyle P(\text{"am"}|\text{"I"}) = \frac{2 + 0.5}{3 + 0.5 \times 5} = \frac{2.5}{5.5} \approx 0.455$.
- **Python Code Example:**

In [3]:
def add_k_smoothing(bigram_count, unigram_count, vocab_size, k=0.5):
    return (bigram_count + k) / (unigram_count + k * vocab_size)

result = add_k_smoothing(2, 3, 5, 0.5)
print(f"Add-k Smoothed Probability: {result:.3f}")

Add-k Smoothed Probability: 0.455


- Add-k smoothing, optimized on a devset, is helpful for some tasks like text classification.
- However, it performs poorly in language modeling, 
  - leading to poor variance and inappropriate discounts.

## **Backoff and Interpolation in Language Models**
- **Backoff** uses a lower-order n-gram if a higher-order n-gram has insufficient evidence.
  - **Process:** $\displaystyle P(w_n|w_{n-2}w_{n-1}) ⤠ P(w_n|w_{n-1}) ⤠ P(w_n)$
    - Use trigram if available. 
    - Back off to bigram if trigram is absent.
    - Finally, use unigram if bigram is also absent.
  - **Key Point:** Backoff only occurs if the higher-order n-gram has zero counts.
- **Interpolation** combines probability estimates from multiple n-gram orders.
  - **Process:**
    - Linearly combine unigram, bigram, and trigram probabilities.
    - Use weighted average, with weights summing to 1.
  - **Key Point:** Interpolation always considers all n-grams, not just those with zero counts.


### **Simple Linear Interpolation**
- Combines different order n-grams by linearly interpolating them
- **Formula:**
- $\hat{P}(w_n |w_{n-2}w_{n-1}) = \lambda_1 P(w_n) + \lambda_2 P(w_n|w_{n-1}) + \lambda_3 P(w_n|w_{n-2}w_{n-1})$
  - Each $\lambda$ represents a weight for unigram, bigram, and trigram probabilities.
  - The weights $\lambda$ are chosen to maximize the likelihood of a `held-out corpus` (an additional training corpus held out from the training data).


### **Conditional Interpolation**
- Conditions on the context by assigning heavier weights to more accurate n-grams
- **Formula:**
- $\hat{P}(w_n |w_{n-2}w_{n-1}) = \lambda_1(w_{n-2:n-1})P(w_n) + \lambda_2(w_{n-2:n-1})P(w_n|w_{n-1}) + \lambda_3(w_{n-2:n-1})P(w_n|w_{n-2}w_{n-1})$
  - Weights $\lambda$ vary depending on the context, making them more accurate.
  - Helps in giving appropriate weight based on the confidence in the counts.

### Stupid Backoff
- A simple and efficient non-discounted backoff algorithm, useful for large datasets, but lacks probabilistic interpretation.
- **Explanation:**
  - A heuristic used in place of more complex smoothing methods when scaling to large datasets.
  - $\displaystyle P_{\text{backoff}}(w_n | w_{n-1}) = 
    \begin{cases} 
    \dfrac{C(w_{n-1}w_n)}{C(w_{n-1})} & \text{if } C(w_{n-1}w_n) > 0 \\
    \alpha P(w_n) & \text{otherwise}
    \end{cases}$
  - $\alpha$ is a discounting factor (typically $\alpha = 0.4$).
- **Python Code Example:**

In [4]:
def stupid_backoff(bigram_count, unigram_count, unigram_prob, alpha=0.4):
    if bigram_count > 0:
        return bigram_count / unigram_count
    else:
        return alpha * unigram_prob

# Example usage
result = stupid_backoff(0, 3, 0.2, 0.4)
print(f"Stupid Backoff Probability: {result:.3f}")

Stupid Backoff Probability: 0.080


### **Kneser-Ney Smoothing**
- State-of-the-art, particularly for higher-order n-grams, considers the distribution of words across different contexts.
- **Explanation:**
  - One of the most sophisticated smoothing methods, particularly effective for bigram and trigram models.
- **Formula:**
  $\displaystyle P_{\text{KN}}(w_n | w_{n-1}) = \max\left(\dfrac{C(w_{n-1}w_n) - d}{C(w_{n-1})}, 0\right) + \lambda(w_{n-1})P_{\text{continuation}}(w_n)$
  - $d$ is a discounting factor.
  - $\lambda(w_{n-1})$ is a normalizing constant.
  - $P_{\text{continuation}}(w_n)$ is the continuation probability, i.e., the likelihood of $w_n$ occurring in any context.
- **Python Code Example:**

In [6]:
from collections import defaultdict

def kneser_ney_smoothing(bigram_count, unigram_count, continuation_prob, discount=0.75):
    adjusted_count = max(bigram_count - discount, 0)
    return adjusted_count / unigram_count + (discount / unigram_count) * continuation_prob

# Example continuation probability
continuation_prob = defaultdict(float, {'am': 0.2, 'I': 0.1})

# Example usage
result = kneser_ney_smoothing(2, 3, continuation_prob['am'])
print(f"Kneser-Ney Smoothed Probability: {result:.3f}")

Kneser-Ney Smoothed Probability: 0.467


## **Challenges and Solutions**
   - **Data Sparsity:** Limited data can lead to zero probabilities for unseen word sequences.
     - **Solution:** Techniques like smoothing, backoff models, or neural networks help in assigning non-zero probabilities to rare or unseen sequences.
   - **Long-Range Dependencies:** Traditional models like N-grams struggle with context far back in the sequence.
     - **Solution:** Advanced models like RNNs, LSTMs, and Transformers capture long-range dependencies more effectively.

## **Relation Between Perplexity and Entropy**
- **Perplexity:** A normalized measure of how well an n-gram model predicts a test set.
- **Entropy:** An information-theoretic concept that measures `the uncertainty or information content in a random variable`.
  - Higher entropy = More unpredictability.
  - Lower entropy = Less unpredictability.
- **Relationship:** Perplexity arises from the concept of `cross-entropy`, which is closely tied to entropy.

### Entropy Definition
- **Entropy (H):** Measures the average amount of information produced by a stochastic process.
- **Formula:**
  - $H(X) = -\sum_{x \in \chi} p(x) \log_2 p(x)$
    - $H(X)$: Entropy of random variable $X$.
    - $p(x)$: Probability of each outcome $x$.
    - Logarithm base 2 indicates information is measured in bits.  
    - $\log_2 p(x)$: Amount of information received when event $x$ occurs.
    - $-p(x) \log_2 p(x)$: Weighted contribution of each event to total entropy.
  - Represents the `lower bound on the number of bits` needed to encode a piece of information in the optimal coding scheme.

### 🍎 **Entropy in a Fair Coin Toss**
- **Scenario:**
  - A fair coin has two outcomes: Heads (H) and Tails (T).
- **Calculation:**
  - $H(X) = -\left(\frac{1}{2} \log_2 \frac{1}{2} + \frac{1}{2} \log_2 \frac{1}{2}\right) = 1 \text{ bit}$
- **Interpretation:**
  - Each toss has an entropy of 1 bit, indicating maximum uncertainty.


### 🍎 **Entropy in a Biased Coin**
- **Scenario:**
  - A biased coin with $P(H) = 0.8$ and $P(T) = 0.2$.
- **Calculation:**
  - $H(X) = -\left(0.8 \log_2 0.8 + 0.2 \log_2 0.2\right) \approx 0.72 \text{ bits}$
- **Interpretation:**
  - Lower entropy reflects less uncertainty compared to a fair coin.

### Entropy of Sequences
- **Sequence Entropy:** The entropy of a sequence of words $W = \{w_1, w_2, \dots, w_n\}$ in language $L$:
  - $\displaystyle H(w_1, w_2, \dots, w_n) = -\sum_{w_{1:n} \in L} p(w_{1:n}) \log_2 p(w_{1:n})$
- Treats $L$ as a stochastic process that produces this sequence of words $W$, then **L's Entropy Rate:** (per-word entropy)
  - $\displaystyle H(L) = \lim_{n \to \infty} \frac{1}{n} H(w_1, w_2, \dots, w_n)$

### Cross-Entropy
- **Cross-Entropy:** Measures the difference between the true distribution $p$ and a model $m$.
- **Formula:**
  - $\displaystyle H(p, m) = \lim_{n \to \infty} -\frac{1}{n} \sum_{W \in L} p(W) \log_2 m(W)$
- **Upper Bound:** Cross-entropy $H(p, m)$ is always greater than or equal to entropy $H(p)$.

### Perplexity and Cross-Entropy
- **Perplexity:** Related to cross-entropy, serving as an approximation.
- **Approximation Formula:**
  - $\displaystyle H(W) = -\frac{1}{N} \log_2 P(w_1, w_2, \dots, w_N)$
- **Perplexity Formula:**
  - $\displaystyle \text{Perplexity} = 2^{H(W)} =\sqrt[N]{\frac{1}{P(w_1, w_2, \dots, w_N)}}$

### Summary
- Entropy quantifies the uncertainty in predicting a sequence.
- Cross-entropy evaluates how well a model approximates the true distribution.
- Perplexity is a practical measure derived from cross-entropy to evaluate language models.
- **Large language models**, built on neural networks rather than n-grams, **solve two key problems of n-grams**:
  - The exponential growth of parameters with increasing n-gram order
  - The inability to generalize from training to test examples unless identical words are used. 
- Neural models project words into a `continuous space` where `words with similar contexts have similar representations`.