
### N-gram Models

N-gram models are a type of probabilistic language model that predicts the next word based on the previous (N-1) words. For example, a bigram (2-gram) looks at the previous one word, a trigram (3-gram) looks at the previous two words, etc.

### How they work?
 N-gram models use a **probabilistic formula** based on conditional probabilities. Here’s a breakdown of the core **formula** and related concepts:

---

###  **General N-gram Formula**

For an N-gram model, the **probability of a word sequence** is approximated by:

$$
P(w_1, w_2, ..., w_n) \approx \prod_{i=1}^{n} P(w_i \mid w_{i-(N-1)}, ..., w_{i-1})
$$

This means:
The probability of a word $w_i$ depends only on the previous $N-1$ words.

---

### For Specific N-values:

#### ➤ **Unigram Model (N = 1)**

Assumes words occur independently:

$$
P(w_1, w_2, w_3) = P(w_1) \cdot P(w_2) \cdot P(w_3)
$$

---

#### ➤ **Bigram Model (N = 2)**

Each word depends on the **previous one**:

$$
P(w_1, w_2, w_3) \approx P(w_1) \cdot P(w_2 \mid w_1) \cdot P(w_3 \mid w_2)
$$

Where:

$$
P(w_n \mid w_{n-1}) = \frac{\text{Count}(w_{n-1}, w_n)}{\text{Count}(w_{n-1})}
$$

---

#### ➤ **Trigram Model (N = 3)**

Each word depends on the **previous two**:

$$
P(w_3 \mid w_1, w_2) = \frac{\text{Count}(w_1, w_2, w_3)}{\text{Count}(w_1, w_2)}
$$

---

### Example (Bigram):

Here's a simple visual representation of the **Bigram Probability Calculation** using the example `"Ashi has a cat Doma"`:

---

### Sentence Breakdown:

```
"Ashi has a cat Doma"
```

| Word 1 | Word 2 | Bigram      |
| ------ | ------ | ----------- |
| Ashi   | has    | (Ashi, has) |
| has    | a      | (has, a)    |
| a      | cat    | (a, cat)    |
| cat    | Doma   | (cat, Doma) |

---

### Bigram Frequencies (from a larger corpus):

| Bigram                 | Count                                      |
| ---------------------- | ------------------------------------------ |
| (has, a)               | 2                                          |
| (has, cat)             | 1                                          |
| (has, food)            | 0                                          |
| TOTAL "has" precedents | 3 (sum of all bigrams starting with "has") |

---

###  Bigram Probability Formula:

$$
P(\text{"a"} \mid \text{"has"}) = \frac{\text{Count}(\text{"has"}, \text{"a"})}{\text{Count}(\text{"has"})}
$$

$$
= \frac{2}{3} \approx 0.67
$$

---

### Visual Summary:

```
"has" → 
     ├── "a" (2 times)        → P = 2/3
     └── "cat" (1 time)       → P = 1/3
```

This tells us:

> If the model sees the word `"has"`, it's **more likely** (67%) to predict `"a"` next than `"cat"` (33%).

---

Would you like me to generate a **diagram-style graphic** of this as an image?


---

### Smoothing (Optional but Important)

In real-world texts, some N-grams might not appear (zero counts). This can cause probability to be 0. To avoid this, use **smoothing techniques**:

* **Add-One (Laplace) Smoothing**

$$
P(w_n \mid w_{n-1}) = \frac{\text{Count}(w_{n-1}, w_n) + 1}{\text{Count}(w_{n-1}) + V}
$$

Where $V$ = vocabulary size.

* **Good-Turing Smoothing**
* **Kneser-Ney Smoothing** (advanced, used in real NLP systems)

---




### Advantages 

- Simple and interpretable

- Easy to implement

- Works reasonably well on small datasets or for specific domains

### Limitations:

- Struggles with long-range dependencies (only looks at fixed N words)

- Suffers from data sparsity: many N-grams may never appear in training data, requiring smoothing techniques

- Does not capture semantic meaning well — just relies on frequency counts

### Neural Network Models

Neural network models for language (e.g., RNNs, LSTMs, Transformers) are deep learning models that learn word representations and patterns in text data through multiple layers of nonlinear transformations.

### How they work ?
Instead of relying on fixed counts, these models learn dense vector embeddings of words that capture semantic and syntactic properties. For example, a Transformer model can attend to all words in a sentence, capturing complex relationships and context over long distances.

### Advantages

- Capture long-range context and dependencies well

- Learn meaningful word embeddings that capture semantic similarity

- Handle complex patterns and generalize better to unseen text

- State-of-the-art performance in many NLP tasks

### Limitations:

- Require large amounts of data and computational power

- Less interpretable than N-gram models

- Can be complex to train and tune

### An N-gram language model using the nltk library


In [3]:
import nltk
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
from collections import Counter

# Sample text
text = "Ashi has a cat Doma. Doma is very naughty. All cats are not like Doma. Ashi still loves her."

# Tokenize text
tokens = word_tokenize(text)

# Generate bigrams
bigrams = list(ngrams(tokens, 2))

# Count frequency of bigrams
bigram_freq = Counter(bigrams)

print("Bigram Frequencies:")
for bg, freq in bigram_freq.items():
    print(f"{bg}: {freq}")

# Estimate P(word2 | word1) where word1 is 'Doma'
word1 = 'Doma'
following_words = {bg[1]: freq for bg, freq in bigram_freq.items() if bg[0] == word1}

total_count = sum(following_words.values())
probabilities = {word: count / total_count for word, count in following_words.items()}

print(f"\nProbabilities of words following '{word1}':")
for word, prob in probabilities.items():
    print(f"P({word} | {word1}) = {prob:.2f}")



Bigram Frequencies:
('Ashi', 'has'): 1
('has', 'a'): 1
('a', 'cat'): 1
('cat', 'Doma'): 1
('Doma', '.'): 2
('.', 'Doma'): 1
('Doma', 'is'): 1
('is', 'very'): 1
('very', 'naughty'): 1
('naughty', '.'): 1
('.', 'All'): 1
('All', 'cats'): 1
('cats', 'are'): 1
('are', 'not'): 1
('not', 'like'): 1
('like', 'Doma'): 1
('.', 'Ashi'): 1
('Ashi', 'still'): 1
('still', 'loves'): 1
('loves', 'her'): 1
('her', '.'): 1

Probabilities of words following 'Doma':
P(. | Doma) = 0.67
P(is | Doma) = 0.33


## Simple Neural Network Language Model with TensorFlow

In [4]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Sample corpus
corpus = [
    "Ashi has a cat Doma",
    "Doma is very naughty",
    "All cats are not like Doma",
    "Ashi still loves her"
]

# Tokenization
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

# Create input sequences
input_sequences = []
for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_seq = token_list[:i+1]
        input_sequences.append(n_gram_seq)

# Padding
max_seq_len = max(len(seq) for seq in input_sequences)
input_sequences = pad_sequences(input_sequences, maxlen=max_seq_len, padding='pre')

# Split X and y
X = input_sequences[:, :-1]
y = input_sequences[:, -1]
y = tf.keras.utils.to_categorical(y, num_classes=total_words)

# Build model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(total_words, 10, input_length=max_seq_len - 1),
    tf.keras.layers.SimpleRNN(50),
    tf.keras.layers.Dense(total_words, activation='softmax')
])
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train
model.fit(X, y, epochs=200, verbose=0)

# Predict next word
seed_text = "Ashi has a"
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_seq_len - 1, padding='pre')

predicted_probs = model.predict(token_list, verbose=0)
predicted_index = np.argmax(predicted_probs, axis=1)[0]
predicted_word = tokenizer.index_word[predicted_index]

print(f"Given seed text: '{seed_text}'")
print(f"Predicted next word: '{predicted_word}'")


Given seed text: 'Ashi has a'
Predicted next word: 'cat'
