In [20]:
from datasets import load_dataset, Dataset
import numpy as np
import re
import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from collections import defaultdict, Counter
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling, set_seed
import math
import random

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

I used Ag News Dataset in my work. Source of the dataset -  AG's News Corpus. There are short news article in English from 4 categories: world, sport, business, science/technology.

I took 5000 texts from dataset to make calculations faster. The average length of text is 45.58 tokens

In [17]:
dataset = load_dataset("ag_news", split="train[:5000]")

texts = [x["text"] for x in dataset]

# Text Example
print(f"Number of texts: {len(texts)}")
print("Text Example:")
print(texts[0])

Number of texts: 5000
Text Example:
Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.


In [18]:
def tokenize(text):
    return re.findall(r'\b\w+\b', text.lower())

tokenized = [word_tokenize(text.lower()) for text in texts]

all_tokens = [token for text in tokenized for token in text]

lengths = [len(toks) for toks in tokenized]

avg_len = np.mean(lengths)

print(f"Average length of text: {avg_len:.2f} tokens")

Average length of text: 45.58 tokens


## Training a Trigram Language Model

Then I trained a statistical n-gram language model using trigrams (n=3). The model is based on frequency counts of word sequences from a tokenized text corpus. For each pair of consecutive words `(w1, w2)`, it collects statistics about which words `w3` tend to follow, and how frequently.

### Chosen Hyperparameters

- **n = 3 (Trigram model):** I use a context window of two words to predict the third. This balances complexity and data sparsity: bigram models (n=2) may be too simple, while higher-order models (n > 3) require much more data to be effective.
- **Tokenization (implicit):** The model relies on pre-tokenized input (`all_tokens`).

This trigram model serves as a basic foundation for more advanced language modeling tasks such as text generation or autocomplete.


In [4]:
trigrams = list(ngrams(all_tokens, 3))

model = defaultdict(list)
for w1, w2, w3 in trigrams:
    model[(w1, w2)].append(w3)


prob_model = defaultdict(Counter)
for w1, w2, w3 in trigrams:
    prob_model[(w1, w2)][w3] += 1

## LSTM Language Model Training

In this section I implemented a language model using an LSTM (Long Short-Term Memory) neural network. The model is trained to predict the next word in a sequence given the previous words, based on a sliding window of fixed size over tokenized text data.

### Model Description

The model processes sequences of 4 input words and predicts the 5th word. Each word is first converted to an index using a vocabulary mapping, and then passed through an embedding layer followed by an LSTM and a linear output layer. The final prediction is a probability distribution over the vocabulary.

### Architecture

- **Embedding Layer:** Converts word indices into dense vectors of fixed size (100 dimensions).
- **LSTM Layer:** Processes sequences of embeddings with a hidden state size of 128. The `batch_first=True` argument allows input tensors to be shaped as (batch, sequence, features).
- **Fully Connected (Linear) Layer:** Maps the LSTM's last hidden state to vocabulary size logits, used for classification via cross-entropy loss.

### Chosen Hyperparameters

- **Sequence length:** 5 tokens (4 input tokens + 1 target token).
- **Embedding dimension:** 100 — balances representational capacity with training speed.
- **Hidden size of LSTM:** 128 — a common default that offers sufficient modeling capacity for moderately complex tasks.
- **Batch size:** 64 — chosen for stable training and efficiency on typical hardware.
- **Learning rate:** 0.001 — standard for the Adam optimizer.
- **Loss function:** CrossEntropyLoss — suitable for multi-class classification over vocabulary.
- **Epochs:** 3 — enough for demonstration or early convergence on small datasets.

### Vocabulary

- Special tokens include:
  - `<PAD>` (index 0) — for padding sequences if necessary.
  - `<UNK>` (index 1) — for unknown or out-of-vocabulary words.

In [None]:
vocab = Counter(all_tokens)
vocab = {word: i+2 for i, (word, _) in enumerate(vocab.items())}
vocab["<PAD>"] = 0
vocab["<UNK>"] = 1
inv_vocab = {i: w for w, i in vocab.items()}
vocab_size = len(vocab)

sequences = []
seq_length = 5

for tokens in tokenized:
    indexed = [vocab.get(w, 1) for w in tokens]
    for i in range(len(indexed) - seq_length):
        seq = indexed[i:i+seq_length]
        sequences.append(seq)

In [6]:
class TextDataset(Dataset):
    def __init__(self, sequences):
        self.x = [torch.tensor(seq[:-1]) for seq in sequences]
        self.y = [torch.tensor(seq[-1]) for seq in sequences]

    def __len__(self):
        return len(self.x)

    def __getitem__(self, idx):
        return self.x[idx], self.y[idx]

dataset = TextDataset(sequences)
dataloader = DataLoader(dataset, batch_size=64, shuffle=True)

In [7]:
class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        x = self.embedding(x)
        output, _ = self.lstm(x)
        output = self.fc(output[:, -1, :])
        return output

model_lstm = LSTMModel(vocab_size=vocab_size, embedding_dim=100, hidden_dim=128)

In [8]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Trainable parameters: {count_parameters(model_lstm):,}")

Trainable parameters: 4,364,336


In [9]:
optimizer = torch.optim.Adam(model_lstm.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

model_lstm.train()
for epoch in range(3):
    for x_batch, y_batch in dataloader:
        optimizer.zero_grad()
        output = model_lstm(x_batch)
        loss = criterion(output, y_batch)
        loss.backward()
        optimizer.step()
    print(f"Loss: {loss.item():.4f}")

Loss: 4.8697
Loss: 4.3741
Loss: 4.7703


## Fine-Tuning a Pretrained Language Model (DistilGPT2)

Then I fine-tuned a pretrained causal language model — specifically `distilgpt2` — using a custom text corpus. The model is part of the Hugging Face `transformers` library and was originally trained in a self-supervised manner to predict the next token in a sequence (causal language modeling).

### Model Description: DistilGPT2

- **Architecture:** DistilGPT2 is a distilled version of OpenAI's GPT-2 model. It uses the same Transformer decoder architecture as GPT-2 but with fewer layers and parameters.
- **Pretraining Data:** DistilGPT2 was pretrained on a large subset of OpenWebText, which is a filtered and cleaned version of web pages linked from Reddit with high karma scores.
- **Number of Layers:** 6 Transformer decoder blocks (compared to 12 in GPT2-base).
- **Hidden Size:** 768
- **Attention Heads:** 12
- **Total Parameters:** ~82 million
- **Tokenizer:** Byte-Pair Encoding (BPE) tokenizer derived from GPT-2’s original tokenizer.
- **Special Tokens:** The padding token is set manually to the end-of-sequence (EOS) token, as the original GPT-2 models do not use padding by default.

### Tokenization

- Tokenization is performed using the pretrained `distilgpt2` tokenizer with:
  - `max_length = 128` — truncates longer texts to a fixed maximum length.
  - `truncation=True` — ensures that inputs longer than the limit are truncated instead of causing errors.
- The dataset is then tokenized and formatted for language modeling (causal, not masked) using `DataCollatorForLanguageModeling`.

### Training Configuration

- **Epochs:** 1 — fine-tuning is kept minimal for demonstration purposes or due to compute constraints.
- **Batch size:** 2 (per device) — small batch size likely due to memory limitations.
- **Save steps:** 500 — checkpoints are saved periodically.
- **Logging steps:** 100 — training progress is logged frequently.
- **No MLM (Masked Language Modeling):** Causal LM is used (`mlm=False`), appropriate for GPT-style models.

In [None]:
with open("ag_news_small.txt", "w", encoding="utf-8") as f:
    for line in texts:
        f.write(line.strip().replace("\n", " ") + "\n")

model_name = "distilgpt2"
tokenizer_gpt = AutoTokenizer.from_pretrained(model_name)
tokenizer_gpt.pad_token = tokenizer_gpt.eos_token
model_gpt = AutoModelForCausalLM.from_pretrained(model_name)

dataset = load_dataset("text", data_files={"train": "ag_news_small.txt"})

def tokenize_function(example):
    return tokenizer_gpt(example["text"], truncation=True, max_length=128)

tokenized_dataset = dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"])

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer_gpt, mlm=False
)

training_args = TrainingArguments(
    output_dir="./finetuned_gpt2",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=2,
    save_steps=500,
    save_total_limit=2,
    logging_steps=100
)

trainer = Trainer(
    model=model_gpt,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)

trainer.train()

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

comet_ml is installed but the Comet API Key is not configured. Please set the `COMET_API_KEY` environment variable to enable Comet logging. Check out the documentation for other ways of configuring it: https://www.comet.com/docs/v2/guides/experiment-management/configure-sdk/#set-the-api-key
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
100,4.2368
200,4.1532
300,3.9214
400,3.8961
500,3.8814
600,3.915
700,3.7337
800,3.7534
900,3.7232
1000,3.7246


TrainOutput(global_step=2500, training_loss=3.7405161865234375, metrics={'train_runtime': 4381.4381, 'train_samples_per_second': 1.141, 'train_steps_per_second': 0.571, 'total_flos': 82843828568064.0, 'train_loss': 3.7405161865234375, 'epoch': 1.0})

Протестируем модели

In [11]:
correct_phrases = [
    "The stock market rose sharply today",
    "A new study shows promising results",
    "Scientists discovered a new planet",
    "The president gave a speech in Washington",
    "The football team won the championship"
]

incorrect_phrases = [
    "Banana flies reading homework loudly",
    "The rocket sandwiches eleven purple dreams",
    "Computers dance above the microwave",
    "Yesterday the dog painted democracy slowly",
    "Sky jellyphones defeat politics under cheese"
]

In [12]:
def ngram_perplexity_laplace(phrase, prob_model, vocab, n=3):
    tokens = word_tokenize(phrase.lower())
    trigrams = list(ngrams(tokens, n))
    log_prob = 0
    V = len(vocab)
    N = len(trigrams)
    for gram in trigrams:
        context = gram[:-1]
        word = gram[-1]
        next_words = prob_model.get(context, {})
        count = next_words.get(word, 0)
        total = sum(next_words.values())
        prob = (count + 1) / (total + V)
        log_prob += math.log(prob)
    return math.exp(-log_prob / N) if N > 0 else float('inf')


def lstm_perplexity(phrase, model, vocab, device='cpu'):
    model.eval()
    tokens = word_tokenize(phrase.lower())
    token_ids = [vocab.get(t, 1) for t in tokens]
    N = len(token_ids) - 1
    if N <= 0:
        return float('inf')
    loss_fn = nn.CrossEntropyLoss()
    with torch.no_grad():
        losses = []
        for i in range(N):
            input_seq = torch.tensor(token_ids[i:i+4]).unsqueeze(0)  # (1, 4)
            target = torch.tensor([token_ids[i+4]]) if i+4 < len(token_ids) else None
            if target is None:
                break
            output = model(input_seq)
            loss = loss_fn(output, target)
            losses.append(loss.item())
        mean_loss = sum(losses) / len(losses) if losses else float('inf')
        return math.exp(mean_loss)
def calculate_perplexity(model, tokenizer, sentence):
    encodings = tokenizer(sentence, return_tensors='pt')
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    input_ids = encodings.input_ids.to(device)
    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        loss = outputs.loss
    return math.exp(loss.item())

In [13]:
print("== Perplexity (N-gram + Laplace) ==")
for phrase in correct_phrases:
    print(f"[CORRECT] {phrase} -> {ngram_perplexity_laplace(phrase, prob_model, vocab):.2f}")
for phrase in incorrect_phrases:
    print(f"[INCORRECT] {phrase} -> {ngram_perplexity_laplace(phrase, prob_model, vocab):.2f}")

print("\n== Perplexity (LSTM) ==")
for phrase in correct_phrases:
    print(f"[CORRECT] {phrase} -> {lstm_perplexity(phrase, model_lstm, vocab):.2f}")
for phrase in incorrect_phrases:
    print(f"[INCORRECT] {phrase} -> {lstm_perplexity(phrase, model_lstm, vocab):.2f}")

print("== Perplexity (Fine-tuned GPT-2) ==")
for phrase in correct_phrases:
    ppl = calculate_perplexity(model_gpt, tokenizer_gpt, phrase)
    print(f"[CORRECT] {phrase} -> {ppl:.2f}")
for phrase in incorrect_phrases:
    ppl = calculate_perplexity(model_gpt, tokenizer_gpt, phrase)
    print(f"[INCORRECT] {phrase} -> {ppl:.2f}")

== Perplexity (N-gram + Laplace) ==
[CORRECT] The stock market rose sharply today -> 10436.07
[CORRECT] A new study shows promising results -> 13142.02
[CORRECT] Scientists discovered a new planet -> 12896.19
[CORRECT] The president gave a speech in Washington -> 18545.40
[CORRECT] The football team won the championship -> 18552.50
[INCORRECT] Banana flies reading homework loudly -> 18544.00
[INCORRECT] The rocket sandwiches eleven purple dreams -> 18544.25
[INCORRECT] Computers dance above the microwave -> 18545.33
[INCORRECT] Yesterday the dog painted democracy slowly -> 18544.00
[INCORRECT] Sky jellyphones defeat politics under cheese -> 18544.00

== Perplexity (LSTM) ==
[CORRECT] The stock market rose sharply today -> 1018.98
[CORRECT] A new study shows promising results -> 26866.00
[CORRECT] Scientists discovered a new planet -> 17457.32
[CORRECT] The president gave a speech in Washington -> 756.26
[CORRECT] The football team won the championship -> 202.27
[INCORRECT] Banana flies

## Evaluating Language Models with Perplexity

In this section, I evaluated the quality of three different language models — N-gram with Laplace smoothing, LSTM, and fine-tuned GPT-2 — using the perplexity metric. I used a set of 5 grammatically and semantically correct phrases and 5 incorrect or nonsensical phrases to compare how well each model distinguishes coherent language from incoherent sequences.

---

### What is Perplexity?

Perplexity is a widely used metric for evaluating language models. It measures how well a probability model predicts a sample.

- Lower perplexity indicates that the model assigns higher probability to the test sequence — implying better performance.
- Higher perplexity means the model finds the sequence surprising or unlikely.

---

### Results Overview

#### Correct Phrases

| Sentence                                         | N-gram    | LSTM        | GPT-2         |
|--------------------------------------------------|-----------|-------------|---------------|
| The stock market rose sharply today              | 10436.07  | 1018.98     | 133.71        |
| A new study shows promising results              | 13142.02  | 26866.00    | 292.93        |
| Scientists discovered a new planet               | 12896.19  | 17457.32    | 84.61         |
| The president gave a speech in Washington        | 18545.40  | 756.26      | 54.49         |
| The football team won the championship           | 18552.50  | 202.27      | 158.59        |

#### Incorrect Phrases

| Sentence                                              | N-gram    | LSTM          | GPT-2         |
|-------------------------------------------------------|-----------|---------------|---------------|
| Banana flies reading homework loudly                  | 18544.00  | 130M+         | 26,844.95      |
| The rocket sandwiches eleven purple dreams            | 18544.25  | 115,185.68    | 123,535.66     |
| Computers dance above the microwave                   | 18545.33  | 35M+          | 1,566.19       |
| Yesterday the dog painted democracy slowly            | 18544.00  | 622,641.87    | 15,522.77      |
| Sky jellyphones defeat politics under cheese          | 18544.00  | 15,137.29     | 45,892.39      |

---

### Interpretation and Comparison

#### N-gram Model
- Shows little variation in perplexity between correct and incorrect sentences.
- This indicates the model lacks sensitivity to semantic coherence — it is primarily counting word co-occurrences without deeper understanding.
- High perplexity values overall suggest limited predictive power due to data sparsity and lack of context modeling.

#### LSTM Model
- Shows strong differentiation: perplexity is much lower on correct phrases than incorrect ones.
- However, the perplexity values for some incorrect sentences are extremely high (e.g., 130M), indicating instability or sensitivity to out-of-distribution inputs.
- Suggests the model can learn sequential dependencies but may overfit or behave erratically outside training distribution.

#### Fine-tuned GPT-2
- Achieves the lowest perplexity on correct phrases and consistently high perplexity on incorrect ones, but without extreme outliers.
- Demonstrates the best balance of fluency recognition and stability.
- Fine-tuning allows the model to better capture domain-specific language and semantics.

---

### Conclusion

Perplexity is a valuable tool for evaluating how well a language model understands and predicts text. In this comparison:
- GPT-2 (fine-tuned) offers the most reliable and accurate judgments about sentence plausibility.
- LSTM also distinguishes well but is less stable.
- N-gram model fails to differentiate between plausible and nonsensical text due to its shallow nature.

These findings emphasize the importance of using deep, pretrained models and domain-specific fine-tuning for high-quality language understanding.

In [14]:
start_phrases = [
    "I think",
    "The government",
    "Scientists believe",
    "He said that",
    "Blue dog",
    "Today we",
    "The weather",
    "This research",
    "According to",
    "She goes to"
]

In [15]:
set_seed(42)

def generate_with_gpt2(model, tokenizer, prompts, max_new_tokens=20):
    gpt2_outputs = {}
    for prompt in prompts:
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=True, top_k=50)
        text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        gpt2_outputs[prompt] = text
    return gpt2_outputs

def generate_with_lstm(model, tokenizer, start_texts, idx_to_word, word_to_idx, max_len=20):
    model.eval()
    results = []

    for prompt in start_texts:
        words = tokenizer(prompt.lower())
        input_seq = [word_to_idx.get(w, word_to_idx["<UNK>"]) for w in words[-3:]]
        generated = words.copy()

        for _ in range(max_len):
            x = torch.tensor(input_seq).unsqueeze(0)
            with torch.no_grad():
                output = model(x)
                predicted_id = output.argmax(dim=-1).item()

            predicted_word = idx_to_word.get(predicted_id, "<UNK>")
            generated.append(predicted_word)

            input_seq = input_seq[1:] + [predicted_id]

        results.append(" ".join(generated))

    return results

def generate_with_ngram(ngram_model, tokenizer, prompts, max_len=20):
    ngram_outputs = {}
    for prompt in prompts:
        tokens = tokenizer(prompt.lower())
        generated = tokens.copy()
        for _ in range(max_len):
            context = tuple(generated[-2:])
            candidates = [w for w in ngram_model if w[:-1] == context]
            if not candidates:
                context = tuple(generated[-1:])
                candidates = [w for w in ngram_model if w[:-1] == context]
            if not candidates:
                break
            next_word = random.choice(candidates)[-1]
            generated.append(next_word)
        ngram_outputs[prompt] = " ".join(generated)
    return ngram_outputs

ngram_gen = generate_with_ngram(prob_model, word_tokenize, start_phrases)
lstm_gen = generate_with_lstm(model_lstm, word_tokenize, start_phrases, inv_vocab, vocab)
gpt2_gen = generate_with_gpt2(model_gpt, tokenizer_gpt, start_phrases)

print(ngram_gen)
print(lstm_gen)
print(gpt2_gen)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{'I think': 'i think possible the same subs , law . hatem abdel kader , as tiny ear bones indicates that consumers and silicon', 'The government': 'the government partially agreed to\\be on easing fears about 28,000 batteries the race ( r66 billion punta gorda , pressured and derail', 'Scientists believe': 'scientists believe was apparently suffering a comet ... nicky hilton . murdoch was further cement its auto-update servers . bird flu season', 'He said that': "he said that four members on event at new messagelabs report ( computerworld 's official baghdad opened the waiter appears they washed out", 'Blue dog': 'blue dog new semiconductor starts at bungalow billiards in ohio . europe is near silvio berlusconi met survivors and eolas patent infringement', 'Today we': 'today we look phishy ? # 225 ; can # 133 ; emulation with eight-run eighth inning and do things up 129', 'The weather': 'the weather pushed an oval west darfur troop cut news site shows a villa , sc hurricane estimate on

## Text Generation from Different Starting Phrases: Model Comparison

In this section, I generated continuations for a set of 10 different starting phrases using three language models.

These phrases serve as prompts to test how coherently and contextually each model continues text. The goal is to evaluate fluency, coherence, grammaticality, and relevance of generated outputs. 

---

### N-gram Model

- **Characteristics**:
  - Disjointed and fragmented structure.
  - No clear grammatical or semantic coherence.
  - Often mixes unrelated noun phrases without proper connectors.
- **Interpretation**: N-gram models fail to maintain context across more than 2–3 tokens. They lack long-range dependencies and often produce incoherent, unnatural sequences.

---

### LSTM Model

- **Characteristics**:
  - Better sentence structure and fluency than n-gram.
  - Tendency to repeat common training patterns (e.g., news-like phrasing from Reuters data).
  - Overuses clichés and boilerplate text.
- **Interpretation**: LSTM models capture some level of global structure but are limited by training data size and sequence length. Outputs are more fluent but can be generic or repetitive.

---

### Fine-tuned GPT-2

- **Characteristics**:
  - Fluent and grammatically correct.
  - Capable of producing topic-consistent and human-like continuations.
  - Sometimes repeats or drifts slightly, but retains context more effectively.
- **Interpretation**: GPT-2 excels in generating coherent, plausible, and grammatically accurate text. Fine-tuning helps it stay more relevant to the style and vocabulary of the target domain.

---

### Conclusion

- The N-gram model is highly limited for generative tasks and produces mostly incoherent phrases.
- The LSTM model performs moderately better, generating readable but often templated outputs.
- The fine-tuned GPT-2 model consistently produces the most fluent, meaningful, and context-aware completions, confirming the strength of Transformer-based architectures in language generation.


## Conclusions

### Which Model Performed Best?

Among the three evaluated models — N-gram, LSTM, and fine-tuned GPT-2 — the fine-tuned GPT-2 demonstrated the best overall performance in terms of:

- Text generation quality: GPT-2 outputs were significantly more coherent, fluent, and contextually relevant.
- Perplexity: GPT-2 consistently assigned lower perplexity to grammatically and semantically correct sentences, and much higher perplexity to nonsensical ones — indicating a strong ability to distinguish between good and bad language.

So, GPT-2 is clearly the best in terms of output quality and perplexity accuracy, though at a cost of higher computational resources.

---

### How Can Results Be Improved?

- **N-gram**:
  - I can use higher-order n-grams (e.g., 4-grams or 5-grams).
  - Also I can try Kneser-Ney smoothing instead of basic Laplace.
  - Integrate back-off models to improve handling of sparse data.

- **LSTM**:
  - Increase training data volume and diversity.
  - Try to use bidirectional LSTM or stacked LSTMs for deeper representations.
  - Add regularization techniques (dropout, weight decay) to prevent overfitting.

- **GPT-2**:
  - Fine-tune for more epochs or with more data for stronger domain adaptation.
  - Try to use larger GPT models (e.g., GPT-2 medium or large) if resources allow.
  - Experiment with prompt engineering to guide generation behavior.

---

### Difficulties Encountered

- **N-gram limitations**: Very sensitive to vocabulary size and sparsity. Unable to generalize or maintain long-term context.
- **LSTM issues**:
  - Requires careful tuning of sequence length, batch size, and learning rate.
  - Sometimes produces repetitive or generic outputs.
- **GPT-2 challenges**:
  - High memory and compute requirements.
  - Fine-tuning requires careful tokenization and format consistency.
  - Susceptible to overfitting on small datasets or generating biased content if training data isn't well-balanced.

---

### Final Thoughts

Each model serves a different purpose. While simpler models like N-grams are easy to implement and fast to train, modern Transformer-based models like GPT-2 offer vastly superior performance for both perplexity evaluation and text generation — making them the preferred choice for high-quality NLP applications, especially when computational resources are available.