## NLP ##

## N-gram models ##



---

## üîπ What is an n-gram language model?

An **n-gram language model** is a **probabilistic model** that predicts the next word in a sequence by looking at the previous *n-1* words.

* **n-gram** = a sequence of *n* consecutive words (or tokens) in text.
* The idea:

  $$
  P(w_k \mid w_1, w_2, \dots, w_{k-1}) \approx P(w_k \mid w_{k-n+1}, \dots, w_{k-1})
  $$

  ‚Üí instead of conditioning on the **entire history**, we only condition on the **last (n-1) words**.

---

## üîπ Examples

Corpus:

```
I love machine learning
I love deep learning
```

### 1-gram (Unigram)

* Just single words: `I`, `love`, `machine`, `learning`, `deep`
* Probabilities: $P(w) = \frac{\text{count}(w)}{\text{total words}}$

  * e.g., `P(love) = 2/8`

### 2-gram (Bigram)

* Pairs of words: `I love`, `love machine`, `machine learning`, `love deep`, `deep learning`
* Probabilities:

  $$
  P(w_k \mid w_{k-1}) = \frac{\text{count}(w_{k-1}, w_k)}{\text{count}(w_{k-1})}
  $$

  * e.g., `P(machine | love) = count("love machine") / count("love") = 1/2`

### 3-gram (Trigram)

* Triplets of words: `I love machine`, `love machine learning`, `I love deep`, `love deep learning`
* Same formula but with 2-word history.

---

## üîπ Why use n-grams?

‚úÖ Captures local context of words.
‚úÖ Simple and fast to train.
‚ö†Ô∏è But‚Ä¶

* **Data sparsity** ‚Üí many n-grams never occur in training data.
* **Memory explosion** ‚Üí large n means huge vocabulary size.
* **Short context** ‚Üí doesn‚Äôt capture long-range dependencies.

---

## üîπ Example Usage

Suppose we want to predict the next word after **‚ÄúI love‚Äù**:

* From training:

  * `P(machine | I love) = 0.5`
  * `P(deep | I love) = 0.5`

So the model predicts **either "machine" or "deep"** with equal probability.

---

## üîπ Improvements

* **Smoothing** (Laplace, Kneser-Ney) ‚Üí handle unseen n-grams.
* **Backoff / Interpolation** ‚Üí combine smaller n-grams with larger ones.
* **Neural models** (like RNNs, Transformers) ‚Üí overcome data sparsity and capture longer context.

---

üëâ Would you like me to also make a **small Python example** that builds a bigram model from a toy corpus and shows how it predicts the next word?
