#Q1: Probabilistic N-Gram Language Model(50 points)

**Objective:**

The objective of this question is to implement and experiment with an N-Gram language model using the Reuters dataset. The task involves building a probabilistic N-Gram model and creating a text generator based on the trained model with customizable parameters.

**Tasks:**


**1.Text Preprocessing (5 points):**
*   Implement the preprocess_text function to perform necessary text preprocessing. You may use NLTK or other relevant libraries for this task. (Already provided, no modification needed)


**2.Build Probabilistic N-Gram Model (15 points):**

*   Implement the build_probabilistic_ngram_model function to construct a probabilistic N-Gram model from the Reuters dataset.


**3.Generate Text with Customizable Parameters (15 points):**

*   Implement the generate_text function to generate text given a seed text and the probabilistic N-Gram model.
*   The function should have parameters for probability_threshold and min_length to customize the generation process.
*   Ensure that the generation stops when either the specified min_length is reached or the probabilities fall below probability_threshold.


**4.Experimentation and Parameter Tuning (5 points):**

*   Use Google Colab to experiment with different values of n_value, probability_threshold, and min_length.
Find the optimal parameters that result in coherent and meaningful generated text.
*   Provide a detailed analysis of the impact of changing each parameter on the generated text's quality.
*   Discuss any challenges faced during parameter tuning and propose potential improvements.


**5.Results and Conclusion (10 points):**

*   Summarize your findings and present the optimal parameter values for n_value, probability_threshold, and min_length.
*   Discuss the trade-offs and considerations when selecting these parameters.
*   Conclude with insights gained from the experimentation.

In [1]:
import nltk
from nltk.corpus import reuters
from nltk import ngrams
import random
import string
from collections import defaultdict

# Download the Reuters dataset if not already downloaded
nltk.download('reuters')
nltk.download('punkt')

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [2]:
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

In [3]:
import re

def split_sentences(text):
    sentences = re.split(r"<eos>", text)
    return sentences

In [27]:
# Function to preprocess text
def preprocess_text(text):
    # Fill in: Implement text preprocessing steps like lowercasing, removing punctuation, etc.
    # You may use NLTK or other libraries for this.

    # Step 1: Lowercasing
    text = text.lower()

    # Step 2: Sentence tokenizing
    sentences = sent_tokenize(text)

    # Step 3: Removing punctuations
    new_sentences = []
    translator = str.maketrans("", "", string.punctuation)
    for sentence in sentences:
      new_sentence = sentence.translate(translator)
      new_sentence = new_sentence + " <eos>"
      new_sentences.append(new_sentence)

    return " ".join(new_sentences)

# Function to build a probabilistic n-gram model
def build_probabilistic_ngram_model(corpus, n):
    # Fill in: Implement code to build an n-gram model from the given corpus.
    # You may use NLTK's word_tokenize function.
    model = {}
    vocab = ["<eos>"]
    for text in corpus:
      sentences = split_sentences(text)
      for sentence in sentences:
        tokens = word_tokenize(sentence)
        for token in tokens:
          if token not in vocab:
            vocab.append(token)
        tokens.append("<eos>")
        tokens = ["<bos>"] + tokens
        ngram_list = []
        for i in range(n):
          ngram_list.extend(list(ngrams(tokens,n-i)))
        # ngram_list = [ngram_list.extend(list(ngrams(tokens,n-i))) for i in range(n)]
        # ngram_list = list(ngrams(tokens, n)) + list(ngrams(tokens, n-1))
        for ngram in ngram_list:
          if ngram not in model:
            model[ngram] = 0
          model[ngram] = model[ngram] + 1
    return model, vocab

def generate_next_word(key, model, vocab, probability_threshold):
  if len(key) == 0: # This means we have reached unigrams
    # checking unigrams
    cnt = sum([model[(token,)] for token in vocab])
    prob_tokens = {}
    for token in vocab:
      new_key = (token,)
      new_cnt = model[new_key]
      prob_tokens[token] = new_cnt / cnt
  else:
    cnt = model.get(tuple(key)) # number of occurence of the words which are known to us (key)
    prob_tokens = {} # To save the probs of words to be the next word
    under_tresh = False # To check if max prob is under the threshold
    zero_prob = False # To indicate if there is no such combination of words in the training dataset, in that case we use a lower-order of grams (n-1,n-2,...,1)
    if cnt == None: # if the key itself is did not occur in train set, set the zero_prob
      zero_prob = True
    else:
      for token in vocab:
        new_key = key + [token,] # The new key based on this token
        new_cnt = model.get(tuple(new_key)) # number of occurences of this token provided that the n-1 previous words are before it.
        if new_cnt == None: # If we don't have this token after n-1 previous words in train set, set the prob of it to zero
          prob_tokens[token] = 0
        else:
          prob_tokens[token] = new_cnt / cnt # this will be the prob of this token to come next regarding to n-1 previous words
      max_prob = max([value for value in prob_tokens.values()]) # Calculate the maximum value of probs for tokens
      if max_prob < probability_threshold: # if max prob is less than the thresh, we set under_thresh to maybe stop the generation
        under_tresh = True
        if max_prob == 0: # if the max prob is zero, means all probs are zero, in that case we use a lower-order of grams (n-1,n-2,...,1)
          zero_prob = True
  next_word = ""
  if not zero_prob:
    next_word = random.choices(list(prob_tokens.keys()), weights=prob_tokens.values(), k=1)[0] # predicting next word using roullete-wheel. probs are weights and greater weight means greater chance to be picked
  return next_word, under_tresh, zero_prob


# Function to generate text using the probabilistic n-gram model with stop criteria
def generate_text(model, vocab, seed_text, n, probability_threshold=0.1, min_length=10):
    # Fill in: Implement code to generate text given a seed text and the n-gram model.
    # Use the model to predict the next words and generate a sequence.
    generated_text = seed_text
    # Preprocessing the seed text
    seed_text = seed_text.lower()
    # Tokenizing the seed text
    seed_tokens = word_tokenize(seed_text)
    tokens_cnt = len(seed_tokens)
    # Add extra <bos> to first of seed text if needed
    seed_tokens = ["<bos>"] + seed_tokens
    seed_tokens = ["<bos>"] * (n - len(seed_tokens) - 1) + seed_tokens
    # Where to start the key based on the value of n in n-gram
    start_index = len(seed_tokens) - n + 1
    # The key
    key = seed_tokens[start_index:]
    min_length_reached = False
    while(True):
      next_word, under_thresh, zero_prob = generate_next_word(key, model, vocab, probability_threshold)
      if zero_prob: # In this case, try (n-1)-gram, (n-2)-gram, ..., unigram
        start_index += 1
        key = seed_tokens[start_index:] # The new key for (i-1)-gram
        continue
      if (under_thresh and min_length_reached) or (next_word=="<eos>" and min_length_reached): # Applying stopping criteria (max prob under prob_threshold, min_length reached, or next word being <eos>)
        break
      # Updating number of tokens to check if we reached min_length
      tokens_cnt += 1
      if tokens_cnt == min_length:
        min_length_reached = True
      # Adding next word which is generated to generated_text
      generated_text = generated_text + " " + next_word
      seed_tokens.append(next_word)
      # Updaing start_index
      start_index = len(seed_tokens) - n + 1
      # The new key for next step and prediction of next word
      key = seed_tokens[start_index:]

    return generated_text


In [5]:
# Load the Reuters dataset
corpus = [reuters.raw(file_id) for file_id in reuters.fileids()]

# Preprocess the entire corpus
preprocessed_corpus = [preprocess_text(text) for text in corpus]

# Choose an n for the n-gram model
n_value = 5  # You may change this value

# Build the probabilistic n-gram model
probabilistic_ngram_model, vocab = build_probabilistic_ngram_model(preprocessed_corpus, n_value)

In [None]:
# n_value=2, probability_threshold=0.01, min_length=10
# Test the text generator
seed_text = "Inflation is"
generated_text = generate_text(probabilistic_ngram_model, vocab, seed_text, n_value, probability_threshold=0.01, min_length=10)
print(f"Generated Text: {generated_text}")

Generated Text: Inflation is forecast initially apply after a spokeswoman was the change in 1986 and quotas which would range


In [None]:
# n_value=2, probability_threshold=0.05, min_length=10
# Test the text generator
seed_text = "Inflation is"
generated_text = generate_text(probabilistic_ngram_model, vocab, seed_text, n_value, probability_threshold=0.05, min_length=10)
print(f"Generated Text: {generated_text}")

Generated Text: Inflation is fraught with 195000 dlrs of the buyout or


In [None]:
# n_value=2, probability_threshold=0.1, min_length=10
# Test the text generator
seed_text = "Inflation is"
generated_text = generate_text(probabilistic_ngram_model, vocab, seed_text, n_value, probability_threshold=0.1, min_length=10)
print(f"Generated Text: {generated_text}")

Generated Text: Inflation is expected economic advisors convinced of about 63 cts payable may 31 net 23 mln


In [None]:
# n_value=3, probability_threshold=0.05, min_length=10
# Test the text generator
seed_text = "Inflation is"
generated_text = generate_text(probabilistic_ngram_model, vocab, seed_text, n_value, probability_threshold=0.05, min_length=10)
print(f"Generated Text: {generated_text}")

Generated Text: Inflation is likely to tell you when youve hit someone


In [None]:
# n_value=3, probability_threshold=0.01, min_length=10
# Test the text generator
seed_text = "Inflation is"
generated_text = generate_text(probabilistic_ngram_model, vocab, seed_text, n_value, probability_threshold=0.01, min_length=10)
print(f"Generated Text: {generated_text}")

Generated Text: Inflation is rising with trade unions actu had threatened to take economic and political issues he said


In [None]:
# n_value=3, probability_threshold=0.1, min_length=10
# Test the text generator
seed_text = "Inflation is"
generated_text = generate_text(probabilistic_ngram_model, vocab, seed_text, n_value, probability_threshold=0.1, min_length=10)
print(f"Generated Text: {generated_text}")

Generated Text: Inflation is expected to be exchanged for 406 shares of


In [None]:
# n_value=4, probability_threshold=0.01, min_length=10
# Test the text generator
seed_text = "Inflation is"
generated_text = generate_text(probabilistic_ngram_model, vocab, seed_text, n_value, probability_threshold=0.01, min_length=10)
print(f"Generated Text: {generated_text}")

Generated Text: Inflation is running at an annual rate of 347 mln units the national association of realtors nar said


In [None]:
# n_value=4, probability_threshold=0.05, min_length=10
# Test the text generator
seed_text = "Inflation is"
generated_text = generate_text(probabilistic_ngram_model, vocab, seed_text, n_value, probability_threshold=0.05, min_length=10)
print(f"Generated Text: {generated_text}")

Generated Text: Inflation is expected to remain above 40 billion dlrs is about the same as for common stock


In [None]:
# n_value=4, probability_threshold=0.1, min_length=10
# Test the text generator
seed_text = "Inflation is"
generated_text = generate_text(probabilistic_ngram_model, vocab, seed_text, n_value, probability_threshold=0.1, min_length=10)
print(f"Generated Text: {generated_text}")

Generated Text: Inflation is not such a constructive factor as this time last year


In [28]:
# n_value=5, probability_threshold=0.01, min_length=10
# Test the text generator
seed_text = "Inflation is"
generated_text = generate_text(probabilistic_ngram_model, vocab, seed_text, n_value, probability_threshold=0.01, min_length=10)
print(f"Generated Text: {generated_text}")

Generated Text: Inflation is not such a constructive factor as this time last year and 68 billion on october 5 he declined to elaborate


In [31]:
# n_value=5, probability_threshold=0.05, min_length=10
# Test the text generator
seed_text = "Inflation is"
generated_text = generate_text(probabilistic_ngram_model, vocab, seed_text, n_value, probability_threshold=0.05, min_length=10)
print(f"Generated Text: {generated_text}")

Generated Text: Inflation is running at an annual rate of 25 pct or slightly above is not convinced that growth will pick up in future


In [33]:
# n_value=5, probability_threshold=0.1, min_length=10
# Test the text generator
seed_text = "Inflation is"
generated_text = generate_text(probabilistic_ngram_model, vocab, seed_text, n_value, probability_threshold=0.1, min_length=10)
print(f"Generated Text: {generated_text}")

Generated Text: Inflation is expected to be completed within 60 days at which time the mcdowell board would be restructured to include interpharm management


#Q2: Sentiment Analysis with Naive Bayes Classifier(50 Points)

**Objective:**

You are tasked with implementing a Naive Bayes classifier for sentiment analysis. The provided code is incomplete, and your goal is to complete the missing parts. Additionally, you should train the classifier on a small dataset and analyze its performance.

**Tasks:**

1.**Complete the Code (35 points)**: Fill in the missing parts in the provided Python code for the Naive Bayes classifier. Pay special attention to the `extract_features` function.

2.**Train and Test**: Train the Naive Bayes classifier on the training data and test it on a separate test set. Evaluate the accuracy of the classifier.

3.**Analysis (15 points)**: Discuss the results. Identify any misclassifications and try to understand why the classifier may fail in those cases. Provide examples of sentences that were not predicted correctly and explain possible reasons.


In [66]:
import random
import math
import string
from collections import defaultdict

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.corpus import movie_reviews
import nltk

# Download NLTK resources
nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [67]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [68]:
def get_features(tokens):
    # Remove punctuation
    tokens = [word for word in tokens if word not in string.punctuation]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word.lower() not in stop_words]

    # Perform stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]

    return tokens

In [69]:
class NaiveBayesClassifier:
    def __init__(self, classes):
        self.classes = classes
        self.class_probs = defaultdict(float)
        self.class_counts = defaultdict(float)
        self.feature_probs = defaultdict(lambda: defaultdict(float))
        self.feature_counts = defaultdict(lambda: defaultdict(float))

    def train(self, training_data, alpha):
        # Implement training here
        # You should use get_features function to extract useful tokens from
        # dataset and use them to train the classifier.
        # Calculate class probabilities
        total_docs = len(training_data)
        for c in self.classes:
            docs_in_class = sum(1 for doc, label in training_data if label == c)
            self.class_probs[c] = docs_in_class / total_docs

        # Calculate feature probabilities
        for doc, label in training_data:
            features = get_features(doc)
            for feature in features:
                self.feature_counts[feature][label] += 1

        # Normalize feature probabilities and apply laplacian smoothing
        for c in self.classes:
            total_features_in_class = sum(self.feature_counts[feature][c] for feature in self.feature_counts)
            for feature in self.feature_counts:
                # We use log likelihood. instead of multiplying some probs, we add their logs to prevent vanishing of small numbers in coding.
                self.feature_probs[feature][c] = math.log((self.feature_counts[feature][c] + alpha) / (total_features_in_class + alpha * len(self.feature_counts)))

    def classify(self, features):
      # Implement classification here
      # Compute posterior probabilities
      scores = {}
      for c in self.classes:
          score = math.log(self.class_probs[c]) # Probability of that class
          for feature in features:
              if feature in self.feature_probs:
                  score += self.feature_probs[feature][c] # Probability of that token or word to appear in this class (sentiment)
          scores[c] = score

      # Predict the class with the highest score
      return max(scores, key=scores.get)

In [70]:
# Load the movie reviews dataset from NLTK
data = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(data)

# Shuffle the dataset for randomness
random.shuffle(data)

# Split the dataset into training and testing sets
split_ratio = 0.8
split_index = int(len(data) * split_ratio)
train_set = data[:split_index]
test_set = data[split_index:]

# Train the Naive Bayes classifier
classes = set(sentiment for _, sentiment in train_set)
classifier = NaiveBayesClassifier(classes)
classifier.train(train_set, alpha = 1)

def calculate_accuracy(dataset, dataset_type):
    # Test the classifier on the testing set
    correct_predictions = 0
    for example in dataset:
        tokens, true_sentiment = example
        features = get_features(tokens)
        predicted_sentiment = classifier.classify(features)
        if predicted_sentiment == true_sentiment:
            correct_predictions += 1

    accuracy = correct_predictions / len(dataset)
    print(f"{dataset_type} Accuracy: {accuracy}")



In [71]:
calculate_accuracy(train_set, 'Train')
calculate_accuracy(test_set, 'Test')

Train Accuracy: 0.96875
Test Accuracy: 0.79


### some examples of wrong predictions

In [73]:
for example in test_set[30:50]:
  tokens, true_sentiment = example
  features = get_features(tokens)
  predicted_sentiment = classifier.classify(features)
  if predicted_sentiment != true_sentiment:
    text = " ".join(tokens)
    print(f"predicted: {predicted_sentiment}, ground_truth: {true_sentiment}, {text}")

predicted: neg, ground_truth: pos, > from the commercials , this looks like a mild - mannered neil simonesque tale with mary tyler moore baring her bra touted as the highlight . instead it turns out to be a hilarious film running in high gear from beginning to end . the concept is deceptively pedestrian . an adult adopted son is looking for his biological parents and encounters eccentric characters along the way . the movie demonstrates just how far a good script and actors can take a mundane idea . the son and his wife take off on the search accompanied by a woman from the agency who located his parents . following one dead end lead after another , each funnier than the previous , they eventually end up in new mexico with his real biological parents : alan alda and lily tomlin . it ' s difficult to condense the mile - a - minute plot . seemingly hundreds of scenes jump on top of each other without giving you a chance to recover from the last one . without giving too much away , one of

#Submission Instructions:


1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

3.Clearly present the results of your parameter tuning in the notebook.

4.Provide a brief summary of your findings and insights in the conclusion section.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Experiment with various seed texts to showcase the diversity of generated text.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.