In [None]:
# This notebook contains two exercises demonstrating fundamental concepts in Natural Language Processing:

# N-grams: An exercise to generate trigrams (3-grams) from a given text using the NLTK library.
# Bigram Language Model: An exercise to train a simple bigram language model on a text corpus and calculate the probability of a specific bigram.

## N-grams and Bigram Language Model

### What are N-grams?

N-grams are contiguous sequences of N items (words, characters, or phonemes) from a given sample of text or speech. They are widely used in NLP for various tasks such as language modelling, spelling correction, text prediction, and machine translation.

*   **Unigram (N=1)**: A single word. E.g., "Natural", "Language", "Processing"
*   **Bigram (N=2)**: A sequence of two words. E.g., "Natural Language", "Language Processing"
*   **Trigram (N=3)**: A sequence of three words. E.g., "Natural Language Processing"

**How they work:** N-grams capture the local context of words. By analyzing the frequency of these sequences, we can understand common word patterns and relationships in a text.

### What is a Bigram Language Model?

A Bigram Language Model is a type of probabilistic language model that predicts the probability of a word given the immediately preceding word. It's based on the idea that the probability of a word appearing depends only on the previous word, rather than the entire history of words in the sentence (this is known as the Markov assumption).

**Formula:**

The probability of a word $W_i$ given the previous word $W_{i-1}$ is calculated as:

$$P(W_i | W_{i-1}) = \frac{\text{Count}(W_{i-1}, W_i)}{\text{Count}(W_{i-1})}$$

Where:
*   $\text{Count}(W_{i-1}, W_i)$ is the number of times the bigram $(W_{i-1}, W_i)$ appears in the corpus.
*   $\text{Count}(W_{i-1})$ is the number of times the unigram $(W_{i-1})$ appears in the corpus.

**How it works:** To train a bigram language model, you count the occurrences of all bigrams and all unigrams in a given text corpus. Then, for any given bigram, you can calculate its probability using the formula above. This model can be extended to trigrams, quadrigrams, and so on, to capture more context, though data sparsity becomes a greater challenge with larger N values.

## Exercise 1: N-grams

**Task**: Generate trigrams (3-grams) from the following text: "Natural Language Processing with Python."

In [None]:
from nltk import ngrams
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

# Sample text
text = "Natural Language Processing with Python"

# Tokenize the text into words
tokens = nltk.word_tokenize(text)

# Generate trigrams
trigrams = ngrams(tokens, 3)

print("Trigrams:")
for grams in trigrams:
    print(grams)

### Challenge 1: Trigram Frequency

**Task**: Modify Exercise 1 to count the frequency of each unique trigram generated from the text and print the trigrams along with their counts.

**Expected Output:**
```
Trigram Frequencies:
('Natural', 'Language', 'Processing'): 1
('Language', 'Processing', 'with'): 1
('Processing', 'with', 'Python'): 1
```

**Hint**: Use Python's `collections.Counter` to efficiently count the occurrences of unique n-grams within a given text.

## Exercise 2: Bigram Language Model

**Task**: Train a bigram language model on the following text corpus and calculate the probability of the bigram ("Language", "Processing"):

```python
corpus = [
    "Natural Language Processing is fascinating.",
    "Language models are important in NLP.",
    "Machine learning and NLP are closely related."
]
```

In [None]:
from collections import defaultdict
import numpy as np
from nltk import ngrams
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

# Sample text corpus
corpus = [
    "Natural Language Processing is fascinating.",
    "Language models are important in NLP.",
    "Machine learning and NLP are closely related."
]

# Tokenize the text into words
tokenized_corpus = [nltk.word_tokenize(sentence) for sentence in corpus]

# Function to calculate bigram probabilities
def train_bigram_model(tokenized_corpus):
    model = defaultdict(lambda: defaultdict(lambda: 0))
    # Count bigrams
    for sentence in tokenized_corpus:
        for w1, w2 in ngrams(sentence, 2):
            model[w1][w2] += 1

    # Calculate probabilities
    for w1 in model:
        total_count = float(sum(model[w1].values()))
        for w2 in model[w1]:
            model[w1][w2] /= total_count
    return model

# Train the bigram model
bigram_model = train_bigram_model(tokenized_corpus)

# Function to get the probability of a bigram
def get_bigram_probability(bigram_model, w1, w2):
    return bigram_model[w1][w2]

print("Bigram Probability (Processing | Language):")
print(get_bigram_probability(bigram_model, 'Language', 'Processing'))

### Challenge 2: Bigram Probability Calculation

**Task**: Calculate the probability of the bigram ('models', 'are') using the trained bigram language model.

**Expected Output:**
```
Bigram Probability ('models', 'are'):
1.0
```