# N-gram model

## Part 1: Data Preparation & N-gram Implementation
### Imports and Dataset Setup
We will manually load the corpus provided in given PDF and prepare it for tokenization.

In [3]:
# Imports and Setup
import math
import random
from collections import Counter, defaultdict
import nltk

# Download necessary NLTK data (required for Colab)
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize, sent_tokenize

print("Setup complete.")

Setup complete.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [4]:
# Define Dataset
# Text extracted from GenAI-Lab-Week4.pdf (Page 2)
corpus_text = """
Artificial intelligence is transforming modern society.
It is used in healthcare finance education and transportation.
Machine learning allows systems to improve automatically with experience.
Data plays a critical role in training intelligent systems.
Large datasets help models learn complex patterns.
Deep learning uses multi layer neural networks.
Neural networks are inspired by biological neurons.
Each neuron processes input and produces an output.
Training a neural network requires optimization techniques.
Gradient descent minimizes the loss function.
Natural language processing helps computers understand human language.
Text generation is a key task in nlp.
Language models predict the next word or character.
Recurrent neural networks handle sequential data.
LSTM and GRU models address long term dependency problems.
However rnn based models are slow for long sequences.
Transformer models changed the field of nlp.
They rely on self attention mechanisms.
Attention allows the model to focus on relevant context.
Transformers process data in parallel.
This makes training faster and more efficient.
Modern language models are based on transformers.
Education is being improved using artificial intelligence.
Intelligent tutoring systems personalize learning.
Automated grading saves time for teachers.
Online education platforms use recommendation systems.
Technology enhances the quality of learning experiences.
Ethical considerations are important in artificial intelligence.
Fairness transparency and accountability must be ensured.
AI systems should be designed responsibly.
Data privacy and security are major concerns.
Researchers continue to improve ai safety.
Text generation models can create stories poems and articles.
They are used in chatbots virtual assistants and content creation.
Generated text should be meaningful and coherent.
Evaluation of text generation is challenging.
Human judgement is often required.
Continuous learning is essential in the field of ai.
Research and innovation drive technological progress.
Students should build strong foundations in mathematics.
Programming skills are important for ai engineers.
Practical experimentation enhances understanding.
"""

print("Dataset loaded successfully.")

Dataset loaded successfully.


## Preprocessing
We will tokenize the text and convert it to lowercase to standardize the input.

In [5]:
# Preprocessing
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Tokenize
    tokens = word_tokenize(text)
    return tokens

# Process the corpus
tokens = preprocess_text(corpus_text)

print(f"Total tokens: {len(tokens)}")
print(f"Sample tokens: {tokens[:10]}")

Total tokens: 340
Sample tokens: ['artificial', 'intelligence', 'is', 'transforming', 'modern', 'society', '.', 'it', 'is', 'used']


## Build the N-gram Model
This section implements the logic to count N-grams and calculate their probabilities. This mirrors the logic in your ngram_notebook.ipynb but adds a randomized sampling method (generate_text) which is usually better for generation than pure greedy sampling (which tends to get stuck in loops).

In [6]:
# N-gram Model Functions

def count_ngrams(tokens, n):
    """Counts n-grams in the token list."""
    ngrams = [tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
    return Counter(ngrams)

def train_ngram_model(tokens, n):
    """
    Builds a probabilistic N-gram model.
    Returns a nested dictionary: { context_tuple: { next_word: probability } }
    """
    ngrams = count_ngrams(tokens, n)
    context_counts = defaultdict(Counter)

    # Count frequencies of next_words given a context
    for ngram, count in ngrams.items():
        context = ngram[:-1]
        target = ngram[-1]
        context_counts[context][target] += count

    # Calculate probabilities
    model = defaultdict(dict)
    for context, next_words in context_counts.items():
        total_count = sum(next_words.values())
        for word, count in next_words.items():
            model[context][word] = count / total_count

    return model

def generate_text(model, n, seed_text, max_length=50):
    """
    Generates text using the trained model via weighted random sampling.
    """
    # Preprocess seed
    seed_tokens = preprocess_text(seed_text)

    # Ensure seed is long enough; if not, pad or just use what we have (simple version)
    if len(seed_tokens) < n-1:
        print(f"Error: Seed text must have at least {n-1} words.")
        return ""

    # Initialize context
    current_context = tuple(seed_tokens[-(n-1):])
    result = list(current_context)

    for _ in range(max_length):
        # Check if context exists in model
        if current_context not in model:
            break

        # Get possible next words and their probabilities
        possible_words = list(model[current_context].keys())
        probabilities = list(model[current_context].values())

        # Sample the next word based on probability
        next_word = random.choices(possible_words, weights=probabilities, k=1)[0]

        # Append to result
        result.append(next_word)

        # Update context (slide window)
        current_context = tuple(result[-(n-1):])

        # Stop if we generate a period (optional, makes output cleaner)
        if next_word == '.':
            break

    return " ".join(result)

## Train and Generate
Now we train a Trigram model ($N=3$) and generate text. You can change N to 2 for a Bigram model if the output is too repetitive.

In [7]:
# Execution

# Parameters
N = 3  # Trigram model (looks at previous 2 words to predict the 3rd)

# Train the model
ngram_model = train_ngram_model(tokens, N)
print(f"Model trained with N={N}. Vocabulary size: {len(set(tokens))}")

# --- Generation Examples ---

# Example 1: Seed "artificial intelligence"
seed1 = "artificial intelligence"
output1 = generate_text(ngram_model, N, seed1)
print(f"\nInput Seed: '{seed1}'")
print(f"Generated: {output1}")

# Example 2: Seed "neural networks"
seed2 = "neural networks"
output2 = generate_text(ngram_model, N, seed2)
print(f"\nInput Seed: '{seed2}'")
print(f"Generated: {output2}")

# Example 3: Seed "deep learning"
seed3 = "deep learning"
output3 = generate_text(ngram_model, N, seed3)
print(f"\nInput Seed: '{seed3}'")
print(f"Generated: {output3}")

Model trained with N=3. Vocabulary size: 195

Input Seed: 'artificial intelligence'
Generated: artificial intelligence is transforming modern society .

Input Seed: 'neural networks'
Generated: neural networks .

Input Seed: 'deep learning'
Generated: deep learning uses multi layer neural networks are inspired by biological neurons .
