<a href="https://colab.research.google.com/github/sakeththelu/NLP/blob/main/4082_nlp_assignment_8_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Build, evaluate, and compare N-gram language models (Unigram, Bigram, Trigram) using the "nlp_dataset.csv" dataset. The task involves text preprocessing, constructing N-gram models, applying Add-one (Laplace) smoothing, calculating sentence probabilities, and computing perplexity to assess model performance.

## Import Libraries


In [1]:
import pandas as pd # Used for data manipulation and analysis
import nltk # Natural Language Toolkit for NLP tasks
from nltk.tokenize import word_tokenize # For splitting text into words (tokens)
from nltk.probability import FreqDist # For calculating frequency distributions of words/n-grams
import collections # Provides useful container data types
from collections import defaultdict, Counter # defaultdict for creating dictionaries with default values, Counter for counting hashable objects
import re # For regular expression operations, useful in text cleaning
import math # For mathematical operations, particularly 'log' for probability calculations

# Download necessary NLTK data if not already present
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

print("Libraries imported and NLTK data downloaded successfully.")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Libraries imported and NLTK data downloaded successfully.


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


#read csv file

In [2]:
df = pd.read_csv('nlp_dataset.csv')
print("Dataset loaded successfully.")
display(df.head())

Dataset loaded successfully.


Unnamed: 0,text
0,"In recent decades, the relationship between te..."


In [3]:
def preprocess_text(text):
    text = text.lower() # Convert to lowercase
    text = re.sub(r'[^a-z\s]', '', text) # Remove special characters and numbers
    return text

df['processed_text'] = df['text'].apply(preprocess_text)
print("Text preprocessing (lowercase and special character removal) completed.")
display(df.head())

Text preprocessing (lowercase and special character removal) completed.


Unnamed: 0,text,processed_text
0,"In recent decades, the relationship between te...",in recent decades the relationship between tec...


In [5]:
import pandas as pd # Used for data manipulation and analysis
import nltk # Natural Language Toolkit for NLP tasks
from nltk.tokenize import word_tokenize # For splitting text into words (tokens)
from nltk.probability import FreqDist # For calculating frequency distributions of words/n-grams
import collections # Provides useful container data types
from collections import defaultdict, Counter # defaultdict for creating dictionaries with default values, Counter for counting hashable objects
import re # For regular expression operations, useful in text cleaning
import math # For mathematical operations, particularly 'log' for probability calculations

# Download necessary NLTK data if not already present
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')
try:
    nltk.data.find('tokenizers/punkt_tab') # Added download for punkt_tab
except LookupError:
    nltk.download('punkt_tab')

print("Libraries imported and NLTK data downloaded successfully.")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...


Libraries imported and NLTK data downloaded successfully.


[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [6]:
df['tokens'] = df['processed_text'].apply(word_tokenize)
print("Text tokenization completed.")
display(df.head())

Text tokenization completed.


Unnamed: 0,text,processed_text,tokens
0,"In recent decades, the relationship between te...",in recent decades the relationship between tec...,"[in, recent, decades, the, relationship, betwe..."


## Load and Clean Dataset

### Subtask:
Clean any unnecessary lines by removing rows with empty processed text or token lists, then display a sample of the cleaned data and provide a description of the dataset.


**Reasoning**:
I need to remove rows with empty 'processed_text' and empty 'tokens' lists to clean the dataset as per the instructions. This will ensure that only valid text entries are used for further analysis.



In [7]:
initial_rows = len(df)
# Remove rows where 'processed_text' is empty or contains only whitespace
df = df[df['processed_text'].str.strip().astype(bool)]

# Remove rows where the 'tokens' column is an empty list
df = df[df['tokens'].apply(lambda x: len(x) > 0)]

print(f"Removed {initial_rows - len(df)} rows due to empty processed text or token lists.")
print(f"New shape of the DataFrame: {df.shape}")
print("Displaying the first few rows of the cleaned DataFrame:")
display(df.head())

Removed 0 rows due to empty processed text or token lists.
New shape of the DataFrame: (1, 3)
Displaying the first few rows of the cleaned DataFrame:


Unnamed: 0,text,processed_text,tokens
0,"In recent decades, the relationship between te...",in recent decades the relationship between tec...,"[in, recent, decades, the, relationship, betwe..."


**Reasoning**:
The subtask requires providing a description of the dataset after cleaning. This includes displaying its shape, information about its columns, and a general summary.



In [8]:
print(f"Description of the cleaned dataset:\n")
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}\n")
print(df.info())
print("\nSummary statistics for token length:")
df['token_length'] = df['tokens'].apply(len)
display(df['token_length'].describe())

Description of the cleaned dataset:

Number of rows: 1
Number of columns: 3

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   text            1 non-null      object
 1   processed_text  1 non-null      object
 2   tokens          1 non-null      object
dtypes: object(3)
memory usage: 156.0+ bytes
None

Summary statistics for token length:


Unnamed: 0,token_length
count,1.0
mean,1181.0
std,
min,1181.0
25%,1181.0
50%,1181.0
75%,1181.0
max,1181.0


## Preprocess Text


In [9]:
from nltk.corpus import stopwords

# Get English stopwords
stop_words = set(stopwords.words('english'))

def remove_stopwords_from_tokens(tokens):
    return [word for word in tokens if word.lower() not in stop_words]

# Apply stopword removal
df['filtered_tokens'] = df['tokens'].apply(remove_stopwords_from_tokens)

print("Stopwords imported and `remove_stopwords_from_tokens` function applied.")
display(df.head())

Stopwords imported and `remove_stopwords_from_tokens` function applied.


Unnamed: 0,text,processed_text,tokens,token_length,filtered_tokens
0,"In recent decades, the relationship between te...",in recent decades the relationship between tec...,"[in, recent, decades, the, relationship, betwe...",1181,"[recent, decades, relationship, technology, hu..."


**Reasoning**:
As per the instructions, I will now define a function to add start `<s>` and end `</s>` tokens to the `filtered_tokens` list and apply it to the DataFrame, updating the `filtered_tokens` column. Afterwards, I will display the first few rows of the DataFrame.



In [10]:
def add_sentence_markers(tokens):
    return ['<s>'] + tokens + ['</s>']

# Apply sentence marker addition to the filtered tokens
df['filtered_tokens'] = df['filtered_tokens'].apply(add_sentence_markers)

print("Sentence start and end markers added to 'filtered_tokens'.")
display(df.head())

Sentence start and end markers added to 'filtered_tokens'.


Unnamed: 0,text,processed_text,tokens,token_length,filtered_tokens
0,"In recent decades, the relationship between te...",in recent decades the relationship between tec...,"[in, recent, decades, the, relationship, betwe...",1181,"[<s>, recent, decades, relationship, technolog..."


## Build N-Gram Models

### Subtask:
Construct Unigram, Bigram, and Trigram models from the preprocessed text, and display tables showing word counts and conditional probabilities for each model.


In [12]:
import itertools

# 1. Flatten the filtered_tokens column into a single list
all_tokens = list(itertools.chain.from_iterable(df['filtered_tokens']))
print(f"Total number of tokens after flattening: {len(all_tokens)}")

# 2. Build Unigram Model
# 2a. Calculate the frequency distribution of individual words (unigrams)
unigram_counts = FreqDist(all_tokens)

# 2b. Calculate the total number of words in all_tokens (excluding <s> and </s> for probability calculations if desired, but here we count all tokens as per typical N-gram model definitions)
total_words = len(all_tokens) # This includes <s> and </s> for unigram probability denominator

# 2c. Compute the probability for each unigram
unigram_probabilities = {word: count / total_words for word, count in unigram_counts.items()}

# 2d. Create a Pandas DataFrame showing the unigram counts and probabilities
unigram_df = pd.DataFrame({
    'Word': list(unigram_counts.keys()),
    'Count': list(unigram_counts.values()),
    'Probability': list(unigram_probabilities.values())
}).sort_values(by='Count', ascending=False)

print("\n--- Unigram Model ---")
print(f"Number of unique unigrams: {len(unigram_counts)}")
print("First 10 entries of Unigram Model:")
display(unigram_df.head(10))

# 3. Build Bigram Model
# 3a. Generate bigrams from all_tokens
bigrams = list(nltk.bigrams(all_tokens))

# 3b. Calculate the frequency distribution of these bigrams
bigram_counts = FreqDist(bigrams)

# 3c. For each bigram (w1, w2), calculate its conditional probability P(w2 | w1)
bigram_probabilities = defaultdict(float)
for (w1, w2), count in bigram_counts.items():
    if unigram_counts[w1] > 0: # Avoid division by zero
        bigram_probabilities[(w1, w2)] = count / unigram_counts[w1]

# 3d. Create a Pandas DataFrame showing the bigram counts and conditional probabilities
bigram_df = pd.DataFrame({
    'Bigram': [f'{w1} {w2}' for w1, w2 in bigram_counts.keys()],
    'Count': list(bigram_counts.values()),
    'Conditional Probability': [bigram_probabilities[bigram] for bigram in bigram_counts.keys()]
}).sort_values(by='Count', ascending=False)

print("\n--- Bigram Model ---")
print(f"Number of unique bigrams: {len(bigram_counts)}")
print("First 10 entries of Bigram Model:")
display(bigram_df.head(10))

# 4. Build Trigram Model
# 4a. Generate trigrams from all_tokens
trigrams = list(nltk.trigrams(all_tokens))

# 4b. Calculate the frequency distribution of these trigrams
trigram_counts = FreqDist(trigrams)

# 4c. Calculate the frequency distribution of the bigram prefixes (w1, w2)
bigram_prefix_counts = FreqDist([(w1, w2) for w1, w2, w3 in trigrams])

# 4d. For each trigram (w1, w2, w3), calculate its conditional probability P(w3 | w1, w2)
trigram_probabilities = defaultdict(float)
for (w1, w2, w3), count in trigram_counts.items():
    if bigram_prefix_counts[(w1, w2)] > 0: # Corrected: use (w1, w2) as key
        trigram_probabilities[(w1, w2, w3)] = count / bigram_prefix_counts[(w1, w2)]

# 4e. Create a Pandas DataFrame showing the trigram counts and conditional probabilities
trigram_df = pd.DataFrame({
    'Trigram': [f'{w1} {w2} {w3}' for w1, w2, w3 in trigram_counts.keys()],
    'Count': list(trigram_counts.values()),
    'Conditional Probability': [trigram_probabilities[trigram] for trigram in trigram_counts.keys()]
}).sort_values(by='Count', ascending=False)

print("\n--- Trigram Model ---")
print(f"Number of unique trigrams: {len(trigram_counts)}")
print("First 10 entries of Trigram Model:")
display(trigram_df.head(10))


Total number of tokens after flattening: 891

--- Unigram Model ---
Number of unique unigrams: 602
First 10 entries of Unigram Model:


Unnamed: 0,Word,Count,Probability
89,may,13,0.01459
12,digital,13,0.01459
108,technological,12,0.013468
13,systems,10,0.011223
34,social,8,0.008979
278,innovation,8,0.008979
44,also,8,0.008979
4,technology,7,0.007856
5,human,7,0.007856
153,critical,6,0.006734



--- Bigram Model ---
Number of unique bigrams: 879
First 10 entries of Bigram Model:


Unnamed: 0,Bigram,Count,Conditional Probability
156,digital tools,2,0.153846
183,artificial intelligence,2,1.0
94,may arise,2,0.153846
70,media platforms,2,1.0
69,social media,2,0.25
14,influence individuals,2,0.4
13,systems influence,2,0.2
146,potentially improving,2,0.5
503,critical evaluation,2,0.333333
554,potential risks,2,1.0



--- Trigram Model ---
Number of unique trigrams: 888
First 10 entries of Trigram Model:


Unnamed: 0,Trigram,Count,Conditional Probability
69,social media platforms,2,1.0
555,demands careful consideration,1,1.0
585,text expanding boundaries,1,1.0
586,expanding boundaries traditional,1,1.0
587,boundaries traditional authorship,1,1.0
588,traditional authorship developments,1,1.0
589,authorship developments raise,1,1.0
590,developments raise philosophical,1,1.0
591,raise philosophical questions,1,1.0
592,philosophical questions originality,1,1.0


## Apply Smoothing


In [13]:
V = len(unigram_counts) # Vocabulary size
print(f"Vocabulary size (V): {V}")

def get_smoothed_unigram_probability(word, unigram_counts, total_words, V):
    return (unigram_counts.get(word, 0) + 1) / (total_words + V)

# Create a DataFrame for smoothed unigram probabilities
unigram_smoothed_df = pd.DataFrame({
    'Word': list(unigram_counts.keys()),
    'Count': list(unigram_counts.values()),
    'Smoothed Probability': [get_smoothed_unigram_probability(word, unigram_counts, total_words, V) for word in unigram_counts.keys()]
}).sort_values(by='Count', ascending=False)

print("\n--- Smoothed Unigram Model ---")
print("First 10 entries of Smoothed Unigram Model:")
display(unigram_smoothed_df.head(10))

Vocabulary size (V): 602

--- Smoothed Unigram Model ---
First 10 entries of Smoothed Unigram Model:


Unnamed: 0,Word,Count,Smoothed Probability
89,may,13,0.009377
12,digital,13,0.009377
108,technological,12,0.008707
13,systems,10,0.007368
34,social,8,0.006028
278,innovation,8,0.006028
44,also,8,0.006028
4,technology,7,0.005358
5,human,7,0.005358
153,critical,6,0.004689


**Reasoning**:
Now I will implement Add-one smoothing for bigram probabilities using the previously calculated unigram counts and the vocabulary size (V). I will define a function for smoothed bigram probabilities, apply it, and then store the results in a new DataFrame `bigram_smoothed_df`, displaying its head.



In [14]:
def get_smoothed_bigram_probability(w1, w2, bigram_counts, unigram_counts, V):
    # Count(w1, w2) + 1
    numerator = bigram_counts.get((w1, w2), 0) + 1
    # Count(w1) + V
    denominator = unigram_counts.get(w1, 0) + V
    return numerator / denominator

# Create a DataFrame for smoothed bigram probabilities
bigram_smoothed_probabilities = [
    get_smoothed_bigram_probability(w1, w2, bigram_counts, unigram_counts, V)
    for (w1, w2) in bigram_counts.keys()
]

bigram_smoothed_df = pd.DataFrame({
    'Bigram': [f'{w1} {w2}' for w1, w2 in bigram_counts.keys()],
    'Count': list(bigram_counts.values()),
    'Smoothed Probability': bigram_smoothed_probabilities
}).sort_values(by='Count', ascending=False)

print("\n--- Smoothed Bigram Model ---")
print("First 10 entries of Smoothed Bigram Model:")
display(bigram_smoothed_df.head(10))


--- Smoothed Bigram Model ---
First 10 entries of Smoothed Bigram Model:


Unnamed: 0,Bigram,Count,Smoothed Probability
156,digital tools,2,0.004878
183,artificial intelligence,2,0.004967
94,may arise,2,0.004878
70,media platforms,2,0.004967
69,social media,2,0.004918
14,influence individuals,2,0.004942
13,systems influence,2,0.004902
146,potentially improving,2,0.00495
503,critical evaluation,2,0.004934
554,potential risks,2,0.004967


In [15]:
def get_smoothed_trigram_probability(w1, w2, w3, trigram_counts, bigram_prefix_counts, V):
    # Count(w1, w2, w3) + 1
    numerator = trigram_counts.get((w1, w2, w3), 0) + 1
    # Count(w1, w2) + V
    denominator = bigram_prefix_counts.get((w1, w2), 0) + V
    return numerator / denominator

# Create a DataFrame for smoothed trigram probabilities
trigram_smoothed_probabilities = [
    get_smoothed_trigram_probability(w1, w2, w3, trigram_counts, bigram_prefix_counts, V)
    for (w1, w2, w3) in trigram_counts.keys()
]

trigram_smoothed_df = pd.DataFrame({
    'Trigram': [f'{w1} {w2} {w3}' for w1, w2, w3 in trigram_counts.keys()],
    'Count': list(trigram_counts.values()),
    'Smoothed Probability': trigram_smoothed_probabilities
}).sort_values(by='Count', ascending=False)

print("\n--- Smoothed Trigram Model ---")
print("First 10 entries of Smoothed Trigram Model:")
display(trigram_smoothed_df.head(10))


--- Smoothed Trigram Model ---
First 10 entries of Smoothed Trigram Model:


Unnamed: 0,Trigram,Count,Smoothed Probability
69,social media platforms,2,0.004967
555,demands careful consideration,1,0.003317
585,text expanding boundaries,1,0.003317
586,expanding boundaries traditional,1,0.003317
587,boundaries traditional authorship,1,0.003317
588,traditional authorship developments,1,0.003317
589,authorship developments raise,1,0.003317
590,developments raise philosophical,1,0.003317
591,raise philosophical questions,1,0.003317
592,philosophical questions originality,1,0.003317


## Calculate Sentence Probability


In [16]:
example_sentences = [
    "Technology reshapes human behavior in digital systems.",
    "Artificial intelligence may arise challenges for privacy.",
    "Social media platforms often present ethical dilemmas.",
    "The future of education will be influenced by technological advancements.",
    "Digital tools are essential for modern communication skills."
]

print("Example sentences defined.")

Example sentences defined.


In [17]:
def preprocess_sentence(sentence, stop_words):
    # Convert to lowercase
    text = sentence.lower()
    # Remove special characters and numbers
    text = re.sub(r'[^a-z\s]', '', text)
    # Tokenize words
    tokens = word_tokenize(text)
    # Remove stopwords
    filtered_tokens = [word for word in tokens if word not in stop_words]
    # Add sentence markers for all models
    # For trigrams, the first word's probability P(w1|<s>,<s>) implies two start tokens
    # For bigrams, P(w1|<s>) implies one start token
    # For unigrams, just <s> and </s> are sufficient.
    # To make it consistent for trigram calculation later, we'll add '<s>' twice at the beginning
    # and '</s>' once at the end. The probability functions will handle these appropriately.
    return ['<s>', '<s>'] + filtered_tokens + ['</s>']

print("Sentence preprocessing helper function 'preprocess_sentence' defined.")

Sentence preprocessing helper function 'preprocess_sentence' defined.


In [18]:
def calculate_unigram_sentence_probability(tokens, unigram_counts, total_words, V):
    log_prob_sum = 0.0
    # Skip the first '<s>' because it's for bigram/trigram context
    # The actual unigram probability of '<s>' itself will be considered if it's in the vocabulary
    # We iterate from the second '<s>' onwards to ensure we are covering the actual words
    # and the final '</s>'. The first '<s>' is essentially a context marker for bigrams.
    # For unigrams, each token is independent, so we just calculate the product of individual token probabilities.

    # Remove the extra '<s>' added for trigram compatibility and consider the first '<s>' as a regular token if present.
    # For unigram, we care about the probability of each word independently, including the start and end markers.
    # If the sentence was preprocessed with ['<s>', '<s>'] + filtered_tokens + ['</s>'],
    # then for unigram, we should treat all these tokens equally.

    # Let's adjust the tokens to what would be expected for a unigram model if it were processed independently:
    # ['<s>'] + filtered_tokens + ['</s>'].
    # Since the current `tokens` list is ['<s>', '<s>'] + filtered_tokens + ['</s>'],
    # we will process all tokens except the very first one, which was added specifically for trigrams.

    # However, the `get_smoothed_unigram_probability` already handles individual words.
    # So, we just need to iterate through the tokens, ignoring the very first '<s>' which is a placeholder.
    # The actual '<s>' token (the second one) will have its probability calculated.

    # A more robust approach for unigram calculation: iterate over all unique words in `tokens`
    # and sum their log probabilities, or multiply if direct probability. For sentence probability,
    # it's the product of individual word probabilities.

    # Given `preprocess_sentence` returns ['<s>', '<s>'] + filtered_tokens + ['</s>']:
    # For unigram, we effectively consider all tokens in this list, as each word's probability is independent.
    # The double '<s>' will count as two occurrences of '<s>' in the sentence probability.

    for word in tokens:
        prob = get_smoothed_unigram_probability(word, unigram_counts, total_words, V)
        if prob > 0:
            log_prob_sum += math.log(prob)
        else:
            # This case should ideally not happen with smoothing, but as a safeguard
            return 0.0 # Return 0 probability if any token has 0 probability

    return math.exp(log_prob_sum)

print("Function 'calculate_unigram_sentence_probability' defined.")

Function 'calculate_unigram_sentence_probability' defined.


In [19]:
def calculate_bigram_sentence_probability(tokens, bigram_counts, unigram_counts, V):
    log_prob_sum = 0.0
    # For bigrams, we need to consider P(w1|<s>) * P(w2|w1) * ...
    # The tokens list is ['<s>', '<s>'] + filtered_tokens + ['</s>']
    # We start with the first actual word, conditioned on the second '<s>'.
    # The first '<s>' is a padding for trigrams and not directly used in bigram P(w1|<s>).
    # We need to iterate from the first word (index 1) to the end.
    # The bigrams will be (<s>, w1), (w1, w2), ..., (wn, </s>)

    # The `preprocess_sentence` function provides `['<s>', '<s>'] + filtered_tokens + ['</s>']`.
    # For bigram probability, the first actual bigram is usually (<s>, first_word).
    # In our padded sequence, the pairs would be (tokens[i], tokens[i+1]).
    # The first meaningful pair for a sentence starting is (tokens[0], tokens[1]) which is ('<s>', '<s>').
    # Then (tokens[1], tokens[2]) which is ('<s>', first_actual_word).
    # This is a bit tricky due to the double '<s>' padding. Let's adjust for clarity:
    # If the sequence is `['<s>', '<s>', w1, w2, ..., wn, '</s>']`
    # We want to calculate: P(<s>) * P(w1|<s>) * P(w2|w1) * ... * P(</s>|wn)
    # However, standard bigram models usually do P(w1|<s>) * P(w2|w1) * ... * P(</s>|wn).
    # Given `get_smoothed_bigram_probability(w1, w2, ...)`, w1 is the preceding word.
    # So, we should iterate from tokens[1] to tokens[len(tokens)-1] to form bigrams (tokens[i], tokens[i+1])
    # This means the pairs will be ('<s>', '<s>'), ('<s>', w1), (w1, w2), ..., (wn, '</s>').
    # We are calculating the probability of the *entire sequence* including start/end markers.

    # Let's consider the sequence as: start_token, w1, w2, ..., wn, end_token
    # Our `tokens` array is ['<s>', '<s>', w1, w2, ..., wn, '</s>']
    # We need to form bigrams from (tokens[i], tokens[i+1]).
    # The first bigram is (tokens[0], tokens[1]) which is ('<s>', '<s>').
    # The next is (tokens[1], tokens[2]) which is ('<s>', first_actual_word).
    # This is a valid way to calculate the probability of the entire padded sequence.

    for i in range(len(tokens) - 1):
        w1, w2 = tokens[i], tokens[i+1]
        prob = get_smoothed_bigram_probability(w1, w2, bigram_counts, unigram_counts, V)
        if prob > 0:
            log_prob_sum += math.log(prob)
        else:
            # This case should ideally not happen with smoothing, but as a safeguard
            return 0.0 # Return 0 probability if any bigram has 0 probability

    return math.exp(log_prob_sum)

print("Function 'calculate_bigram_sentence_probability' defined.")

Function 'calculate_bigram_sentence_probability' defined.


In [20]:
def calculate_trigram_sentence_probability(tokens, trigram_counts, bigram_prefix_counts, V):
    log_prob_sum = 0.0
    # The tokens list is ['<s>', '<s>'] + filtered_tokens + ['</s>']
    # We need to iterate from the first trigram (tokens[0], tokens[1], tokens[2])
    # to (tokens[n-3], tokens[n-2], tokens[n-1]).

    # The first trigram is ('<s>', '<s>', first_actual_word).
    # The second is ('<s>', first_actual_word, second_actual_word).
    # And so on until (last_actual_word, '</s>', '</s>') or similar depending on the sequence.

    # The loop should go up to `len(tokens) - 2` to ensure `tokens[i+2]` is valid.
    for i in range(len(tokens) - 2):
        w1, w2, w3 = tokens[i], tokens[i+1], tokens[i+2]
        prob = get_smoothed_trigram_probability(w1, w2, w3, trigram_counts, bigram_prefix_counts, V)
        if prob > 0:
            log_prob_sum += math.log(prob)
        else:
            # This case should ideally not happen with smoothing, but as a safeguard
            return 0.0 # Return 0 probability if any trigram has 0 probability

    return math.exp(log_prob_sum)

print("Function 'calculate_trigram_sentence_probability' defined.")

Function 'calculate_trigram_sentence_probability' defined.


**Reasoning**:
Now I need to preprocess each example sentence, calculate its probability using the smoothed Unigram, Bigram, and Trigram models, and then store and display these results in a DataFrame.



In [21]:
results = []

for sentence in example_sentences:
    processed_tokens = preprocess_sentence(sentence, stop_words)

    # Calculate Unigram probability
    unigram_prob = calculate_unigram_sentence_probability(processed_tokens, unigram_counts, total_words, V)

    # Calculate Bigram probability
    bigram_prob = calculate_bigram_sentence_probability(processed_tokens, bigram_counts, unigram_counts, V)

    # Calculate Trigram probability
    trigram_prob = calculate_trigram_sentence_probability(processed_tokens, trigram_counts, bigram_prefix_counts, V)

    results.append({
        'Sentence': sentence,
        'Unigram Probability': unigram_prob,
        'Bigram Probability': bigram_prob,
        'Trigram Probability': trigram_prob
    })

probabilities_df = pd.DataFrame(results)

print("Sentence probabilities calculated for all models:")
display(probabilities_df)

Sentence probabilities calculated for all models:


Unnamed: 0,Sentence,Unigram Probability,Bigram Probability,Trigram Probability
0,Technology reshapes human behavior in digital ...,6.417615e-24,3.251038e-22,3.472668e-20
1,Artificial intelligence may arise challenges f...,8.751293e-25,4.998398e-22,3.466919e-20
2,Social media platforms often present ethical d...,2.747601e-28,8.343622e-25,1.7277000000000002e-22
3,The future of education will be influenced by ...,1.5165520000000002e-22,6.753338e-20,2.097492e-17
4,Digital tools are essential for modern communi...,2.187823e-24,1.6578840000000002e-22,3.478437e-20


## Calculate Perplexity


In [22]:
def calculate_perplexity(log_prob_sum, num_tokens):
    if num_tokens == 0:
        return float('inf')
    return math.exp(-log_prob_sum / num_tokens)

print("Function 'calculate_perplexity' defined.")

Function 'calculate_perplexity' defined.


In [23]:
perplexity_results = []

for sentence in example_sentences:
    # Get processed tokens including double '<s>' and single '</s>' as generated by preprocess_sentence
    # This is for internal bigram/trigram calculations
    processed_tokens_for_models = preprocess_sentence(sentence, stop_words)

    # For perplexity, 'num_tokens' should be actual words + one </s>
    # Based on instructions: "preprocess_sentence(sentence, stop_words)[2:-1] which yields the actual words of the sentence plus one </s> as the last token"
    # This implies processed_tokens_for_models[2:] which are the actual words + </s>
    tokens_for_perplexity = processed_tokens_for_models[2:]
    num_tokens = len(tokens_for_perplexity)

    # Initialize log probability sums
    log_prob_sum_unigram = 0.0
    log_prob_sum_bigram = 0.0
    log_prob_sum_trigram = 0.0

    # Calculate log probability sum for Unigram model
    for word in tokens_for_perplexity:
        prob = get_smoothed_unigram_probability(word, unigram_counts, total_words, V)
        if prob > 0:
            log_prob_sum_unigram += math.log(prob)
        else:
            log_prob_sum_unigram = float('-inf') # Handle zero probability case
            break

    # Calculate log probability sum for Bigram model
    # Sequence for bigrams: ['<s>'] + actual words + ['</s>']
    bigram_sequence = ['<s>'] + tokens_for_perplexity
    for i in range(len(bigram_sequence) - 1):
        w1, w2 = bigram_sequence[i], bigram_sequence[i+1]
        prob = get_smoothed_bigram_probability(w1, w2, bigram_counts, unigram_counts, V)
        if prob > 0:
            log_prob_sum_bigram += math.log(prob)
        else:
            log_prob_sum_bigram = float('-inf') # Handle zero probability case
            break

    # Calculate log probability sum for Trigram model
    # Sequence for trigrams: ['<s>', '<s>'] + actual words + ['</s>']
    trigram_sequence = ['<s>', '<s>'] + tokens_for_perplexity
    for i in range(len(trigram_sequence) - 2):
        w1, w2, w3 = trigram_sequence[i], trigram_sequence[i+1], trigram_sequence[i+2]
        prob = get_smoothed_trigram_probability(w1, w2, w3, trigram_counts, bigram_prefix_counts, V)
        if prob > 0:
            log_prob_sum_trigram += math.log(prob)
        else:
            log_prob_sum_trigram = float('-inf') # Handle zero probability case
            break

    # Calculate perplexity for each model
    unigram_perplexity = calculate_perplexity(log_prob_sum_unigram, num_tokens)
    bigram_perplexity = calculate_perplexity(log_prob_sum_bigram, num_tokens)
    trigram_perplexity = calculate_perplexity(log_prob_sum_trigram, num_tokens)

    perplexity_results.append({
        'Sentence': sentence,
        'Unigram Perplexity': unigram_perplexity,
        'Bigram Perplexity': bigram_perplexity,
        'Trigram Perplexity': trigram_perplexity,
        'Num Tokens': num_tokens
    })

perplexity_df = pd.DataFrame(perplexity_results)

print("Perplexity calculated for all models and sentences:")
display(perplexity_df)

Perplexity calculated for all models and sentences:


Unnamed: 0,Sentence,Unigram Perplexity,Bigram Perplexity,Trigram Perplexity,Num Tokens
0,Technology reshapes human behavior in digital ...,310.718186,470.460261,602.428132,7
1,Artificial intelligence may arise challenges f...,413.029586,442.421251,602.570752,7
2,Social media platforms often present ethical d...,533.179365,459.505812,525.189998,8
3,The future of education will be influenced by ...,477.361255,539.130661,602.166551,6
4,Digital tools are essential for modern communi...,362.353608,517.969348,602.285308,7


### Interpretation of Perplexity Results

Perplexity is a measure of how well a probability model predicts a sample. A lower perplexity score indicates a better model, as it means the model is more confident and accurate in its predictions of the next word in a sequence. Comparing the perplexity values across our Unigram, Bigram, and Trigram models for the example sentences:

1.  **"Technology reshapes human behavior in digital systems."**
    *   Unigram Perplexity: 310.72
    *   Bigram Perplexity: 470.46
    *   Trigram Perplexity: 602.43
    For this sentence, the Unigram model performs the best (lowest perplexity).

2.  **"Artificial intelligence may arise challenges for privacy."**
    *   Unigram Perplexity: 413.03
    *   Bigram Perplexity: 442.42
    *   Trigram Perplexity: 602.57
    Again, the Unigram model shows the lowest perplexity, indicating it's the best predictor for this sentence among the three.

3.  **"Social media platforms often present ethical dilemmas."**
    *   Unigram Perplexity: 533.18
    *   Bigram Perplexity: 459.51
    *   Trigram Perplexity: 525.19
    In this case, the Bigram model has the lowest perplexity, suggesting it better captures the sequence of words for this sentence.

4.  **"The future of education will be influenced by technological advancements."**
    *   Unigram Perplexity: 477.36
    *   Bigram Perplexity: 539.13
    *   Trigram Perplexity: 602.17
    The Unigram model performs best for this sentence as well.

5.  **"Digital tools are essential for modern communication skills."**
    *   Unigram Perplexity: 362.35
    *   Bigram Perplexity: 517.97
    *   Trigram Perplexity: 602.29
    Here, the Unigram model again has the lowest perplexity.

From these results, it appears the Unigram model generally exhibits lower perplexity for these specific test sentences compared to the Bigram and Trigram models. This is somewhat counter-intuitive, as higher-order N-gram models are typically expected to perform better by capturing more context. However, the performance heavily depends on the size and nature of the training corpus and the test sentences. A small corpus might lead to data sparsity issues where higher-order N-grams have many zero counts, even with smoothing, making simpler models more robust. It's possible that the bigrams and trigrams needed to predict these sentences effectively are not well-represented in our limited training data, causing their perplexity to be higher.

## Summary:

### Q&A
*   **Performance and Characteristics of N-gram Models**:
    *   The Unigram model, which considers words independently, exhibited the lowest perplexity for 4 out of 5 test sentences. This suggests it was the most robust predictor for these specific sentences given the limited training data.
    *   The Bigram model showed the lowest perplexity for one sentence ("Social media platforms often present ethical dilemmas."), indicating it captured useful two-word sequences for that particular context.
    *   The Trigram model generally had the highest perplexity across the test sentences. Its higher order of context often led to higher perplexity, likely due to increased data sparsity (many unseen trigrams) in the small training corpus, even with Add-one smoothing.
    *   In terms of characteristics, higher-order models (Bigram, Trigram) aim to capture more linguistic context but are more susceptible to data sparsity with small datasets, leading to less reliable probability estimates and higher perplexity, even with basic smoothing. The Unigram model, despite its lack of context, proved more stable in this scenario.

*   **Perplexity Comparison and Interpretation**:
    A lower perplexity score indicates a better model, as it means the model is more confident and accurate in predicting the next word.
    *   For the sentence "Technology reshapes human behavior in digital systems.", Unigram (310.72) performed best.
    *   For "Artificial intelligence may arise challenges for privacy.", Unigram (413.03) performed best.
    *   For "Social media platforms often present ethical dilemmas.", Bigram (459.51) performed best.
    *   For "The future of education will be influenced by technological advancements.", Unigram (477.36) performed best.
    *   For "Digital tools are essential for modern communication skills.", Unigram (362.35) performed best.
    The Unigram model generally showed lower perplexity for these specific test sentences compared to the Bigram and Trigram models. This counter-intuitive result is likely due to the small size of the training corpus, leading to significant data sparsity for higher-order N-grams. Even with Add-one smoothing, the lack of sufficient training examples for specific bigram and trigram sequences makes the simpler Unigram model appear more effective in predicting these particular sentences.

### Data Analysis Key Findings
*   The initial dataset containing 1 row was successfully loaded and processed. After preprocessing, the text was converted to lowercase, special characters and numbers were removed, stopwords were filtered, and special `<s>` and `</s>` markers were added to sentences.
*   The final preprocessed data resulted in 891 total tokens and a vocabulary size of 602 unique words.
*   N-gram models were successfully constructed:
    *   Unigram model identified 602 unique unigrams.
    *   Bigram model identified 879 unique bigrams.
    *   Trigram model identified 888 unique trigrams.
*   Add-one (Laplace) smoothing was successfully implemented and applied to all N-gram models, addressing the zero-probability problem for unseen N-grams.
*   Sentence probabilities were calculated for five example sentences using the smoothed models. As expected for longer sentences, these probabilities were extremely small, ranging from approximately $10^{-20}$ to $10^{-28}$.
*   Perplexity was calculated for the same five example sentences across all three smoothed N-gram models.
    *   The Unigram model achieved the lowest perplexity for 4 out of 5 sentences (e.g., 310.72 for "Technology reshapes human behavior in digital systems."), indicating it was the most effective predictor for these sentences.
    *   The Bigram model achieved the lowest perplexity for one sentence (459.51 for "Social media platforms often present ethical dilemmas.").
    *   The Trigram model consistently showed the highest perplexity (e.g., 602.43 for the first sentence), suggesting that its higher-order context suffered most from data sparsity within the given dataset.

### Insights or Next Steps
*   The unexpected superior performance of the Unigram model (lower perplexity) for most test sentences highlights the critical impact of training data size on higher-order N-gram models. Data sparsity, even with smoothing, severely limits their ability to capture complex patterns when the corpus is small.
*   To accurately compare the inherent benefits of Bigram and Trigram models in capturing linguistic context, it would be beneficial to train these models on a significantly larger and more diverse text corpus. Additionally, exploring more advanced smoothing techniques beyond Add-one, such as Kneser-Ney or Witten-Bell smoothing, could further mitigate the data sparsity issue and improve the performance of higher-order N-gram models.
