#Q1: Probabilistic N-Gram Language Model(50 points)

**Objective:**

The objective of this question is to implement and experiment with an N-Gram language model using the Reuters dataset. The task involves building a probabilistic N-Gram model and creating a text generator based on the trained model with customizable parameters.

**Tasks:**


**1.Text Preprocessing (5 points):**
*   Implement the preprocess_text function to perform necessary text preprocessing. You may use NLTK or other relevant libraries for this task. (Already provided, no modification needed)


**2.Build Probabilistic N-Gram Model (15 points):**

*   Implement the build_probabilistic_ngram_model function to construct a probabilistic N-Gram model from the Reuters dataset.


**3.Generate Text with Customizable Parameters (15 points):**

*   Implement the generate_text function to generate text given a seed text and the probabilistic N-Gram model.
*   The function should have parameters for probability_threshold and min_length to customize the generation process.
*   Ensure that the generation stops when either the specified min_length is reached or the probabilities fall below probability_threshold.


**4.Experimentation and Parameter Tuning (5 points):**

*   Use Google Colab to experiment with different values of n_value, probability_threshold, and min_length.
Find the optimal parameters that result in coherent and meaningful generated text.
*   Provide a detailed analysis of the impact of changing each parameter on the generated text's quality.
*   Discuss any challenges faced during parameter tuning and propose potential improvements.


**5.Results and Conclusion (10 points):**

*   Summarize your findings and present the optimal parameter values for n_value, probability_threshold, and min_length.
*   Discuss the trade-offs and considerations when selecting these parameters.
*   Conclude with insights gained from the experimentation.

In [1]:
import nltk
from nltk.corpus import reuters
from nltk import ngrams
import random
import string
from collections import defaultdict, Counter

# Download the Reuters dataset if not already downloaded
nltk.download('reuters')
nltk.download('punkt')

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import string

# Function to preprocess text
def preprocess_text(text):
    # Fill in: Implement text preprocessing steps like lowercasing, removing punctuation, etc.
    # You may use NLTK or other libraries for this.

    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize text
    tokens = word_tokenize(text)

    return tokens

# print("testing preprocessing:")
# text_example = "The Quick Brown Fox , is Jumping Over The rabbit."
# preprocessed_text = preprocess_text(text_example)
# print(preprocessed_text)

# Function to build a probabilistic n-gram model
def build_probabilistic_ngram_model(corpus, n):
    # Fill in: Implement code to build an n-gram model from the given corpus.
    # You may use NLTK's word_tokenize function.

    n_gram_probabilities = defaultdict(dict)

    # Generate all tokens from the corpus
    # allTokens = [token for doc in corpus for token in doc]

    # Iterate over each document in the corpus to build the n-gram model
    for tokens in corpus:
        n_grams = list(ngrams(tokens, n))
        n_minus_1_grams = list(ngrams(tokens, n-1))

        n_gram_counts = Counter(n_grams)
        n_minus_1_gram_counts = Counter(n_minus_1_grams)

        # Calculate probabilities without Laplace smoothing
        for n_gram in n_gram_counts:
            prefix = n_gram[:-1]
            word = n_gram[-1]
            # Calculate probability without adding vocabulary size for smoothing
            probability = n_gram_counts[n_gram] / n_minus_1_gram_counts[prefix]
            n_gram_probabilities[prefix][word] = probability
    return n_gram_probabilities

def generate_text(model, seed_text, n, probability_threshold=0.02, min_length=8):
    generated_text = preprocess_text(seed_text)
    current_length = len(generated_text)

    while current_length < min_length:
        # Get the last n-1 words as the current context
        context = tuple(generated_text[-(n-1):])

        # Check if the context exists in the model
        if context in model:
            possible_words = list(model[context].items())
            # Sort words by their probability in descending order
            possible_words.sort(key=lambda x: x[1], reverse=True)
            # Filter out words based on the probability threshold
            possible_words = [word for word in possible_words if word[1] >= probability_threshold]
            if not possible_words:
                break  # If no words meet the threshold, stop the generation

            # Randomly select a word from the filtered list based on their probabilities
            next_word = random.choices([word[0] for word in possible_words], [word[1] for word in possible_words])[0]
            generated_text.append(next_word)
            current_length += 1
        else:
            break  # If the context is not found in the model, stop the generation

    return ' '.join(generated_text)

In [3]:

# Load the Reuters dataset
corpus = [reuters.raw(file_id) for file_id in reuters.fileids()]

# Preprocess the entire corpus
preprocessed_corpus = [preprocess_text(text) for text in corpus]

# Choose an n for the n-gram model
n_value = 3  # You may change this value

# print(preprocessed_corpus[:4])
probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)
# print (probabilistic_ngram_model)
seed_text = "Inflation is"

# for n_value in range (2, 10):
print("n is: ", n_value)
# Build the probabilistic n-gram model
# probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)
generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.02, min_length=8)
print(f"Generated Text: {generated_text}")
seed_text = generated_text




n is:  3
Generated Text: inflation is expected to know their ships final


In [10]:
def evaluate_ngram_performance(corpus, ngram_values, probability_thresholds, minimum_lengths):
    max_text_length = -1  # Reflects the maximum length of generated text
    optimal_params = None  # To hold the best configuration

    for ngram_size in ngram_values:
        for threshold in probability_thresholds:
            for min_length in minimum_lengths:
                model = build_probabilistic_ngram_model(corpus, ngram_size)
                total_text_length = 0  # Sum of lengths of generated texts

                # Seed texts for generating new text to evaluate the model
                for seed_text in ["Inflation is", "The weather"]:
                    generated_text = generate_text(model, seed_text, ngram_size, threshold, min_length)
                    print(f"Generated Text: {generated_text}")
                    total_text_length += len(generated_text.split())

                if total_text_length > max_text_length:
                    max_text_length = total_text_length
                    optimal_params = (ngram_size, threshold, min_length)

    return optimal_params

best_n, best_threshold, best_min_length = evaluate_ngram_performance(
    preprocessed_corpus,
    ngram_values=[2,3,4,5],
    probability_thresholds=[0.02, 0.05, 0.1, 0.5, 1],
    minimum_lengths=[20, 10, 8, 5]
)

print(f"Best Parameters: n={best_n}, threshold={best_threshold}, min_length={best_min_length}")


Generated Text: inflation is clear import levies affect earnings endusers can stay firm waterman acquisition ltnew milford savings ltfambo 3rd term significance
Generated Text: the weather slack from july 21 will retire in santos the 43dlrashare offer 25 this ceiling by greek industry achieved
Generated Text: inflation is reached reasonably well enable its biannual opec boosted
Generated Text: the weather manufacturing base larger new airline has provoked us
Generated Text: inflation is 3734 us remained firm linked with
Generated Text: the weather delaying approval calls report qualified due
Generated Text: inflation is isthmus pay 40
Generated Text: the weather despite suppositions to
Generated Text: inflation is renewing its flexible pricing period yincludes ill was instead with 2519 in defense electronics gtx statement later alfaro
Generated Text: the weather manufacturing base was muted as real run until their side effect as taiwan the 3491030 new acting to
Generated Text: inflatio

***faced challenges:***
* passing the right compatible data types to functions.
* text was not being generated with some probability_threshold like 1.
* probability calculation can be improved by using Laplace smoothing.

#Q2: Sentiment Analysis with Naive Bayes Classifier(50 Points)

**Objective:**

You are tasked with implementing a Naive Bayes classifier for sentiment analysis. The provided code is incomplete, and your goal is to complete the missing parts. Additionally, you should train the classifier on a small dataset and analyze its performance.

**Tasks:**

1.**Complete the Code (35 points)**: Fill in the missing parts in the provided Python code for the Naive Bayes classifier. Pay special attention to the `extract_features` function.

2.**Train and Test**: Train the Naive Bayes classifier on the training data and test it on a separate test set. Evaluate the accuracy of the classifier.

3.**Analysis (15 points)**: Discuss the results. Identify any misclassifications and try to understand why the classifier may fail in those cases. Provide examples of sentences that were not predicted correctly and explain possible reasons.


In [23]:
import random
import math
import string
from collections import defaultdict, Counter

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.corpus import movie_reviews
import nltk

# Download NLTK resources
nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [24]:
def get_features(tokens):
    # Remove punctuation
    tokens = [word for word in tokens if word not in string.punctuation]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word.lower() not in stop_words]

    # Perform stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]

    return tokens


In [25]:

class NaiveBayesClassifier:
    def __init__(self, classes):
        self.classes = classes
        self.class_probs = defaultdict(float)
        self.feature_probs = defaultdict(lambda: defaultdict(float))

        # adding feature count and class count to the class definition because of usage
        self.feature_counts = defaultdict(lambda: defaultdict(float))
        self.class_counts = defaultdict(float)

    def train(self, training_data):
        total_docs = len(training_data)
        for tokens, class_ in training_data:
            features = get_features(tokens)
            self.class_counts[class_] += 1
            for feature in features:
                self.feature_counts[feature][class_] += 1

        # calculate class probabilities and using logarithm for easier calculation
        for c in self.classes:
            self.class_probs[c] = log(self.class_counts[c] / total_docs)

        # feature probabilities
        for feature, class_counts in self.feature_counts.items():
            for c in self.classes:
                # apply Laplace smoothing assuming binary featurs (0 or 1)
                self.feature_probs[feature][c] = log((class_counts[c] + 1) / (self.class_counts[c] + 2))

    def classify(self, features):
        max_class = None
        max_prob = float('-inf')
        # calculate the log probabilities for each class based on the features
        for class_ in self.classes:
            prob = self.class_probs[class_]
            for feature in get_features(features):
                prob += self.feature_probs[feature].get(class_, log(1 / (self.class_counts[class_] + 2)))
            if prob > max_prob:
                max_prob = prob
                max_class = class_
        return max_class


In [26]:
# Load the movie reviews dataset from NLTK
data = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(data)

# Shuffle the dataset for randomness
random.shuffle(data)

# Split the dataset into training and testing sets
split_ratio = 0.8
split_index = int(len(data) * split_ratio)
train_set = data[:split_index]
test_set = data[split_index:]

# Train the Naive Bayes classifier
classes = set(sentiment for _, sentiment in train_set)
classifier = NaiveBayesClassifier(classes)
classifier.train(train_set)

def calculate_accuracy(dataset, dataset_type):
    # Test the classifier on the testing set
    correct_predictions = 0
    for example in dataset:
        tokens, true_sentiment = example
        features = get_features(tokens)
        predicted_sentiment = classifier.classify(features)
        if predicted_sentiment == true_sentiment:
            correct_predictions += 1

    accuracy = correct_predictions / len(dataset)
    print(f"{dataset_type} Accuracy: {accuracy}")

calculate_accuracy(train_set, 'Train')
calculate_accuracy(test_set, 'Test')

Train Accuracy: 0.895
Test Accuracy: 0.6975


#Submission Instructions:


1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

3.Clearly present the results of your parameter tuning in the notebook.

4.Provide a brief summary of your findings and insights in the conclusion section.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Experiment with various seed texts to showcase the diversity of generated text.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.