# Demystifying N-Grams in Natural Language Processing

### Introduction

Natural language processing stands at the fascinating intersection of computer science and linguistics, enabling machines to understand and interpret human language. A fundamental concept in NLP is the n-gram, a contiguous sequence of n items from a given sample of text or speech.

### Understanding N-Grams

N-grams are the building blocks of text and speech-based data. They are used to model language based on the prediction of words, given the words that precede them. An n-gram can be a single word (unigram), a pair of consecutive words (bigram), or a sequence of words (trigram and beyond). The choice of n affects the model's performance and its understanding of context.

### Why N-Grams?
* Simplicity and Efficiency: N-grams are straightforward to implement and can effectively capture the local context within text data.
* Versatility: They are used in various NLP tasks, including text classification, sentiment analysis, and machine translation.
* Predictive Modeling: N-grams can predict the likelihood of the next item in a sequence, making them essential for language modeling and text generation.

### Implementing an N-Gram Language Model in Python
To illustrate the concept of n-grams, let's dive into a simple Python implementation of an N-Gram Language Model. This model will be capable of generating new text based on a given seed text.

In [58]:
# Importing the required packages
import random

In [59]:
class NGramLanguageModel:
    def __init__(self, n):
        self.n = n
        self.ngrams = {}
        self.start_tokens = ['<start>'] * (n - 1)

    def train(self, corpus):
        for sentence in corpus:
            tokens = self.start_tokens + sentence.split() + ['<end>']
            for i in range(len(tokens) - self.n + 1):
                ngram = tuple(tokens[i:i + self.n])
                if ngram in self.ngrams:
                    self.ngrams[ngram] += 1
                else:
                    self.ngrams[ngram] = 1

    def generate_text(self, seed_text, length=10):
        seed_tokens = seed_text.split()
        padded_seed_text = self.start_tokens[-(self.n - 1 - len(seed_tokens)):] + seed_tokens
        generated_text = list(padded_seed_text)
        current_ngram = tuple(generated_text[-self.n + 1:])

        for _ in range(length):
            next_words = [ngram[-1] for ngram in self.ngrams.keys() if ngram[:-1] == current_ngram]
            if next_words:
                next_word = random.choice(next_words)
                generated_text.append(next_word)
                current_ngram = tuple(generated_text[-self.n + 1:])
            else:
                break

        return ' '.join(generated_text[len(self.start_tokens):])


In [65]:
import random

class NGramLanguageModel:
    def __init__(self, n):
        self.n = n
        self.ngrams = {}
        # Start and end tokens signify the beginning and conclusion of a sentence to a model.
        self.start_tokens = ['<start>'] * (n - 1)
    """
    We define a method named train to train the language model on a given corpus. 
    Then, we iterate through each sentence in the provided corpus. 
    We tokenize the sentence by adding start tokens, splitting it into individual words, 
    and appending an end token. Moreover, we iterate through the sentence to create n-grams 
    by considering sequences of length n. We extract the current n-gram as a tuple from 
    the token sequence and update the frequency count of the current n-gram in the ngrams dictionary.
    """

    def train(self, corpus):
        for sentence in corpus:
            tokens = self.start_tokens + sentence.split() + ['<end>']
            for i in range(len(tokens) - self.n + 1):
                ngram = tuple(tokens[i:i + self.n])
                if ngram in self.ngrams:
                    self.ngrams[ngram] += 1
                else:
                    self.ngrams[ngram] = 1
            #print(self.ngrams,"\n")
            
            
    #generates text based on the trained language model, starting with a seed text.
    def generate_text(self, seed_text, length=10):
        seed_tokens = seed_text.split()
        padded_seed_text = self.start_tokens[-(self.n - 1 - len(seed_tokens)):] + seed_tokens
        #print(padded_seed_text)
        generated_text = list(padded_seed_text)
        #print(generated_text)
        current_ngram = tuple(generated_text[-self.n + 1:])
        #print(current_ngram)

        for _ in range(length):
            next_words = [ngram[-1] for ngram in self.ngrams.keys() if ngram[:-1] == current_ngram]
            if next_words:
                #print(next_words)
                next_word = random.choice(next_words)
                generated_text.append(next_word)
                #print(generated_text)
                current_ngram = tuple(generated_text[-self.n + 1:])
                #print(current_ngram)
            else:
                break

        return ' '.join(generated_text[len(self.start_tokens):])



<b>NGramLanguageModel</b> class can be trained on a corpus of text and then used to generate new text sequences based on a seed text. It illustrates how n-grams capture the context of a given piece of text to predict subsequent words.

### How It Works
* Initialization: The model is initialized with a specific n-gram size (n).
* Training: It takes a corpus of sentences, splits them into tokens, and builds a dictionary of n-grams with their occurrence counts.
* Text Generation: Given a seed text, the model generates text by predicting the next word based on the current n-gram context.

In [66]:
# Toy corpus
toy_corpus = [
    "This is a simple example.",
    "The example demonstrates an N-gram language model.",
    "N-grams are used in natural language processing.",
    "This is a toy corpus for language modeling."
]

# Change n-gram order here
n = 3 
model = NGramLanguageModel(n)  
model.train(toy_corpus)

# seed text 1
seed_text = "This"  
generated_text = model.generate_text(seed_text, length=3)
print("Seed text:", seed_text)
print("Generated text:", generated_text)


Seed text: This
Generated text: is a simple


In [68]:
# Seed Text 2
seed_text = "This"  
generated_text = model.generate_text(seed_text, length=4)
print("Seed text:", seed_text)
print("Generated text:", generated_text)

Seed text: This
Generated text: is a simple example.


### Disadvantages

* Limited Context: N-grams can only capture limited context, often missing longer dependencies in text that are crucial for understanding sentence structure and meaning.
* Sparsity Problem: As the value of n increases, the model becomes sparse, leading to a dramatic increase in the number of possible n-grams, many of which may never appear in the training data.
* Overfitting to Training Data: N-gram models can overfit to the training data, making them less effective at generalizing to unseen text or capturing the variability in language.
* Inability to Handle Out-of-Vocabulary Words: N-gram models struggle with words not seen during training, limiting their ability to deal with new or rare words.