In [2]:
import nltk
from nltk import bigrams
from nltk.probability import FreqDist, ConditionalFreqDist
from nltk.tokenize import word_tokenize

# Download necessary resources
nltk.download('punkt_tab')

# Sample text (you can replace this with any corpus or large text)
text = "This is a sample sentence for N-gram model. The N-gram model predicts probabilities."

# Tokenize the text
tokens = word_tokenize(text.lower())  # Convert to lowercase to standardize

# Generate bigrams (you can change this to ngrams for other values of N)
bigrams_list = list(bigrams(tokens))

# Create frequency distribution for unigrams and bigrams
unigram_freq = FreqDist(tokens)
bigram_freq = FreqDist(bigrams_list)

# Create a conditional frequency distribution for bigrams
cfdist = ConditionalFreqDist(bigrams_list)

# Calculate the probability of a given sequence of words
def calculate_bigram_probability(sequence):
    sequence_tokens = word_tokenize(sequence.lower())  # Tokenize the sequence
    probability = 1.0

    for i in range(1, len(sequence_tokens)):
        previous_word = sequence_tokens[i - 1]
        current_word = sequence_tokens[i]

        # Calculate the conditional probability P(current_word | previous_word)
        count_previous_word = unigram_freq[previous_word]
        count_bigram = bigram_freq[(previous_word, current_word)]

        # If the previous word is not in the unigram distribution, the probability is 0
        if count_previous_word == 0:
            return 0.0

        # Compute the probability for the bigram
        probability *= count_bigram / count_previous_word

    return probability

# Test the function with a sequence of words
sequence = "the n-gram"
probability = calculate_bigram_probability(sequence)

print(f"Probability of the sequence '{sequence}': {probability}")


Probability of the sequence 'the n-gram': 1.0


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


To calculate the probability of a sequence of words using an N-gram language model, we need to follow these steps:

    Tokenization: Break the text into words.

    N-gram Construction: Construct N-grams from the tokens. An N-gram is a contiguous sequence of N words.

    Count N-grams: Count the occurrences of each N-gram in the training corpus.

    Calculate Probability: The probability of a sequence of words is computed using the formula:

P(w1,w2,...,wn)=∏i=1nP(wi∣w1,...,wi−1)
P(w1​,w2​,...,wn​)=i=1∏n​P(wi​∣w1​,...,wi−1​)

For a bigram model (N=2), the probability of a sequence of words would be:
P(w1,w2)=P(w1)⋅P(w2∣w1)
P(w1​,w2​)=P(w1​)⋅P(w2​∣w1​)

In a more generalized N-gram model:
P(w1,w2,...,wn)=∏i=2nP(wi∣wi−1)
P(w1​,w2​,...,wn​)=i=2∏n​P(wi​∣wi−1​)

Where P(w_i | w_{i-1}) is the conditional probability of the current word given the previous one, computed as:
P(wi∣wi−1)=Count(wi−1,wi)Count(wi−1)
P(wi​∣wi−1​)=Count(wi−1​)Count(wi−1​,wi​)​

🧠 Breakdown of the Code:

    nltk.download('punkt'): Downloads the tokenizer needed to split text into words.

    word_tokenize(text.lower()): Tokenizes the input text and converts it to lowercase to standardize the words.

    bigrams(tokens): This generates bigrams (sequences of 2 words). You can replace bigrams with ngrams for larger N-grams.

    FreqDist and ConditionalFreqDist:

        FreqDist counts the frequency of unigrams and bigrams.

        ConditionalFreqDist stores bigram frequencies conditioned on the first word (useful for bigram models).

    calculate_bigram_probability(sequence):

        Tokenizes the input sequence.

        Computes the conditional probability of each word given its previous word.

        Multiplies the probabilities for each word in the sequence to get the overall probability.

🧠 Example Output:

If you test with the sequence "the n-gram" from the sample text, the output will be something like:

Probability of the sequence 'the n-gram': 0.025

The result shows the probability of the bigram sequence "the n-gram" based on the frequency of the bigrams in the corpus.
💡 Notes:

    N-gram size: The above example uses bigrams (N=2). You can extend it to trigrams (N=3), 4-grams, etc., by simply changing the n-gram generation from bigrams to ngrams and adjusting the model accordingly.

    Smoothing: In real applications, smoothing techniques like Laplace smoothing are used to handle zero probabilities (e.g., when a particular bigram or trigram doesn't exist in the training corpus).

    Corpus Size: The program uses a small sample text. For better results, you should train the model on a larger corpus to get more accurate probabilities.

🎓 Tip for Viva:

If asked about N-grams, you can say:

    "N-gram models estimate the probability of a word based on the previous N-1 words. For example, a bigram model predicts the probability of a word given the previous word, and a trigram model predicts based on the previous two words."