# Understanding n-Gram Language Models

## What is an n-Gram Language Model?
An **n-gram language model** is a way to predict the next word in a sentence based on the previous words. It helps in speech recognition, autocomplete features, and many AI-based text generation systems.

For example, if we have the sentence:

**"I love"**, an n-gram model can predict the next word based on probability, such as:
- "you" (most likely)
- "chocolate"
- "math"

## Probability in Language Models
Mathematically, we express this as:

**P(w | h)**

Where:
- **w** is the next word we want to predict.
- **h** is the history (previous words).
- **P(w | h)** is the probability of word **w** appearing after history **h**.

### Example:
If we have the history **"I love"**, and we are trying to predict the next word **"math"**, the probability is:

P("math" | "I love") = Number of times "I love" is followed by "math" ÷ Total times "I love" appears.

## Why is Language Modeling Important?
Language is **creative**, meaning there are endless ways to form sentences. Computers need a way to **predict** the next word efficiently, and n-grams provide a simple yet powerful approach.

## Different Types of n-Gram Models

### 1. Unigram Model (n = 1)
- The probability of each word is independent of previous words.
- Example:
  P("I love math") = P("I") × P("love") × P("math")

### 2. Bigram Model (n = 2)
- The probability of a word depends only on the previous word.
- Example:
  P("I love math") = P("I") × P("love" | "I") × P("math" | "love")

### 3. Trigram Model (n = 3)
- The probability of a word depends on the two previous words.
- Example:
  P("I love math") = P("I") × P("love" | "I") × P("math" | "I love")

## Markov Assumption
The **Markov assumption** states that a word's probability depends only on a fixed number of previous words.

- **Bigram model** assumes a word depends only on the **1 previous word** (1st-order Markov model).
- **Trigram model** assumes a word depends only on the **2 previous words** (2nd-order Markov model).

### Example:
P("Jingle, Bells, Jingle") = 58% (Markov Assumption: depends on limited history)
P("Jingle, Bells, Jingle, Bells, Jingle") = 60% (Using full n-gram model)

## Joint Probability of a Sentence
The probability of an entire sentence is calculated using:

P(w1, w2, ..., wn) = P(w1) × P(w2 | w1) × P(w3 | w1, w2) × ... × P(wn | w1:n-1)

### Example:
P("I love math") = P("I") × P("love" | "I") × P("math" | "I love")

## Summary
- **n-gram models** predict the next word using probability.
- **Unigram** models assume independence of words.
- **Bigram** models use the previous word for prediction.
- **Trigram** models use the previous two words for prediction.
- The **Markov assumption** simplifies computations by considering limited history.

These models form the basis of many AI applications such as **chatbots, speech recognition, and autocomplete systems**.