#Q1: Probabilistic N-Gram Language Model(50 points)

**Objective:**

The objective of this question is to implement and experiment with an N-Gram language model using the Reuters dataset. The task involves building a probabilistic N-Gram model and creating a text generator based on the trained model with customizable parameters.

**Tasks:**


**1.Text Preprocessing (5 points):**
*   Implement the preprocess_text function to perform necessary text preprocessing. You may use NLTK or other relevant libraries for this task. (Already provided, no modification needed)


**2.Build Probabilistic N-Gram Model (15 points):**

*   Implement the build_probabilistic_ngram_model function to construct a probabilistic N-Gram model from the Reuters dataset.


**3.Generate Text with Customizable Parameters (15 points):**

*   Implement the generate_text function to generate text given a seed text and the probabilistic N-Gram model.
*   The function should have parameters for probability_threshold and min_length to customize the generation process.
*   Ensure that the generation stops when either the specified min_length is reached or the probabilities fall below probability_threshold.


**4.Experimentation and Parameter Tuning (5 points):**

*   Use Google Colab to experiment with different values of n_value, probability_threshold, and min_length.
Find the optimal parameters that result in coherent and meaningful generated text.
*   Provide a detailed analysis of the impact of changing each parameter on the generated text's quality.
*   Discuss any challenges faced during parameter tuning and propose potential improvements.


**5.Results and Conclusion (10 points):**

*   Summarize your findings and present the optimal parameter values for n_value, probability_threshold, and min_length.
*   Discuss the trade-offs and considerations when selecting these parameters.
*   Conclude with insights gained from the experimentation.

In [7]:
import nltk
from nltk.corpus import reuters
from nltk import ngrams
import random
import string
from collections import defaultdict

# Download the Reuters dataset if not already downloaded
nltk.download('reuters')
nltk.download('punkt')

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [17]:
# Function to preprocess text
def preprocess_text(text):
    # Fill in: Implement text preprocessing steps like lowercasing, removing punctuation, etc.
    # You may use NLTK or other libraries for this.

    # Remove White spaces
    text = text.strip()

    # Replace multiple white space characters to a single space
    text = " ".join(text.split())

    # Convert text to lowercase
    text = text.lower()

    # Remove punctuation
    text = ''.join([char for char in text if char not in string.punctuation])

    # Tokenize
    tokens = nltk.word_tokenize(text)

    return tokens

# Function to build a probabilistic n-gram model
def build_probabilistic_ngram_model(corpus, n):
    # Fill in: Implement code to build an n-gram model from the given corpus.
    # You may use NLTK's word_tokenize function.

    # Initialize dictionaries to store n-gram counts and total counts
    ngram_probabilities = {}

    # Tokenize the entire corpus into words
    # words = [word for word in nltk.word_tokenize(corpus)]

    # Count the occurrences of each n-gram
    for text in corpus:
        for gram in ngrams(text, n):
            # Get the previous n-1 words and the current word
            history = gram[:-1]
            word = gram[-1]

            # Initialize an empty dictionary for this history if there is not any
            if history not in ngram_probabilities:
              ngram_probabilities[history] = {}

            # Update the counts
            if word not in ngram_probabilities[history]:
               ngram_probabilities[history][word] = 0
            ngram_probabilities[history][word] += 1

    # Calculate the probabilities
    # First initialize a defaultdict of defaultdicts to store the probabilities
    for history in ngram_probabilities.keys():
        ngram_counts = sum(ngram_probabilities[history].values())
        # Calculate the probability of each word given the history
        for word in ngram_probabilities[history].keys():
          ngram_probabilities[history][word] /= ngram_counts

    return ngram_probabilities

# Function to generate text using the probabilistic n-gram model with stop criteria
def generate_text(model, seed_text, n, probability_threshold=0.1, min_length=10):
    # Fill in: Implement code to generate text given a seed text and the n-gram model.
    # Use the model to predict the next words and generate a sequence.

    # Initialize the current context tokens
    current_text = preprocess_text(seed_text)
    current_length = len(seed_text.split())
    current_tokens = current_text[-(n-1):] if len(current_text) > n-1 else current_text
    # Initialize the generated text
    generated_text = current_text[:]

    while len(generated_text) - current_length < min_length:
        # Check if the specific n-gram context exists in the model
        if tuple(current_tokens) not in model:
            break

        # Get the probabilities for the next words given the history
        next_word_probabilities = model[tuple(current_tokens)]
        if not next_word_probabilities:
            break

        # Filter words with probabilities above the threshold
        candidate_words = [word for word, prob in next_word_probabilities.items() if prob >= probability_threshold]
        # print("candidate_words: ", candidate_words)

        # Check if there are candidate words to choose from
        if not candidate_words:
            break

        # Select the next word with the highest probability
        next_word = max(candidate_words, key=lambda word: next_word_probabilities[word])

        # Append the next word to the generated text
        generated_text.append(next_word)
        current_tokens = generated_text[-(n-1):]

    return ' '.join(generated_text)


In [13]:
# Load the Reuters dataset
corpus = [reuters.raw(file_id) for file_id in reuters.fileids()]

# Preprocess the entire corpus
preprocessed_corpus = [preprocess_text(text) for text in corpus]

# Experiments(these values are selected based on the recommendation of my friends and some websites)
n_values = [2, 3, 4]
threshold_values = [0.01, 0.02, 0.03, 0.05]
min_length_values = [5, 10, 15, 20]
# Test the text generator
seed_text = "Inflation is"

for n in n_values:
    probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n)
    for threshold in threshold_values:
        for length in min_length_values:
            generated_text = generate_text(probabilistic_ngram_model, seed_text, n, threshold, length)
            print(f"for n={n} and threshold={threshold} and min-length={length} => {generated_text}")

for n=2 and threshold=0.01 and min-length=5 => inflation is expected to the company said
for n=2 and threshold=0.01 and min-length=10 => inflation is expected to the company said the company said the company
for n=2 and threshold=0.01 and min-length=15 => inflation is expected to the company said the company said the company said the company said the
for n=2 and threshold=0.01 and min-length=20 => inflation is expected to the company said the company said the company said the company said the company said the company said
for n=2 and threshold=0.02 and min-length=5 => inflation is expected to the company said
for n=2 and threshold=0.02 and min-length=10 => inflation is expected to the company said the company said the company
for n=2 and threshold=0.02 and min-length=15 => inflation is expected to the company said the company said the company said the company said the
for n=2 and threshold=0.02 and min-length=20 => inflation is expected to the company said the company said the company 

In [18]:
# for the second seed_text
seed_text2 = "The recession is"

for n in n_values:
    probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n)
    for threshold in threshold_values:
        for length in min_length_values:
            generated_text = generate_text(probabilistic_ngram_model, seed_text2, n, threshold, length)
            print(f"for n={n} and threshold={threshold} and min-length={length} => {generated_text}")

for n=2 and threshold=0.01 and min-length=5 => the recession is expected to the company said
for n=2 and threshold=0.01 and min-length=10 => the recession is expected to the company said the company said the company
for n=2 and threshold=0.01 and min-length=15 => the recession is expected to the company said the company said the company said the company said the
for n=2 and threshold=0.01 and min-length=20 => the recession is expected to the company said the company said the company said the company said the company said the company said
for n=2 and threshold=0.02 and min-length=5 => the recession is expected to the company said
for n=2 and threshold=0.02 and min-length=10 => the recession is expected to the company said the company said the company
for n=2 and threshold=0.02 and min-length=15 => the recession is expected to the company said the company said the company said the company said the
for n=2 and threshold=0.02 and min-length=20 => the recession is expected to the company sa

Trade-Off:
- Increasing n_value enhances contextuality but requires more data and computational resources.
- Adjusting probability_threshold affects the diversity and coherence of the generated text. Fine-tuning may be necessary to strike the right balance.
- Setting min_length ensures generated text meets length requirements but may limit diversity. Balancing between length and diversity is crucial.

Parameters' impact:
1. n_value:
- Increasing n_value leads to a more contextually rich model but requires a larger dataset and may result in overfitting.
- Decreasing n_value simplifies the model but may lead to less accurate predictions.
2. probability_threshold:
- Higher thresholds result in more conservative word selections, leading to shorter and potentially less diverse text.
- Lower thresholds may result in more diverse but potentially less coherent text.
3. min_length:
- Increasing min_length ensures generated text meets a certain length requirement but may lead to less diverse text.
- Decreasing min_length allows for shorter text but may result in incomplete or less meaningful output.

result:
As we see in the result, the higher the min-lenght, the worst the result will be. During my experiments for this HW, the value 5 is the best for min-length of the model.(Due to the fact that the higher values prevented the model from producing accurate predictions and cause repetition.)
After this, both 0.01 and 0.02 are appropriate for the threshold value and the best value for n is 3 due to the fact that the higher values for it will result in problems during text generation.
Overall, the value for these parametes is depend on the context and our vocabulary. As I said, there is a trade-off which should be considered during choosing the value for them.


#Q2: Sentiment Analysis with Naive Bayes Classifier(50 Points)

**Objective:**

You are tasked with implementing a Naive Bayes classifier for sentiment analysis. The provided code is incomplete, and your goal is to complete the missing parts. Additionally, you should train the classifier on a small dataset and analyze its performance.

**Tasks:**

1.**Complete the Code (35 points)**: Fill in the missing parts in the provided Python code for the Naive Bayes classifier. Pay special attention to the `extract_features` function.

2.**Train and Test**: Train the Naive Bayes classifier on the training data and test it on a separate test set. Evaluate the accuracy of the classifier.

3.**Analysis (15 points)**: Discuss the results. Identify any misclassifications and try to understand why the classifier may fail in those cases. Provide examples of sentences that were not predicted correctly and explain possible reasons.


In [11]:
import random
import math
import string
from collections import defaultdict

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.corpus import movie_reviews
import nltk

# Download NLTK resources
nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [12]:
def get_features(tokens):
    # Remove punctuation
    tokens = [word for word in tokens if word not in string.punctuation]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word.lower() not in stop_words]

    # Perform stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]

    return tokens

In [None]:
class NaiveBayesClassifier:
    def __init__(self, classes):
        self.classes = classes
        self.class_probs = defaultdict(float)
        self.feature_probs = defaultdict(lambda: defaultdict(float))

    def train(self, training_data):
        # Implement training here
        # You should use get_features function to extract useful tokens from
        # dataset and use them to train the classifier.

        # Initialize dictionaries to store counts of classes and features
        class_counts = defaultdict(int)
        feature_counts = defaultdict(lambda: defaultdict(int))

        # Count occurrences of features in each class
        for tokens, sentiment in training_data:
            # Increment the count of the current class
            class_counts[sentiment] += 1
            # Get preprocessed features from the tokens
            features = get_features(tokens)
            # Increment the count of each feature in the current class
            for feature in features:
                feature_counts[sentiment][feature] += 1

        # Calculate class probabilities
        total_examples = len(training_data)
        for sentiment, count in class_counts.items():
            # Calculate the probability of each class
            self.class_probs[sentiment] = count / total_examples

        # Calculate feature probabilities
        for sentiment, counts in feature_counts.items():
            # Calculate the probability of each feature given the class
            total_features = sum(counts.values())
            for feature, count in counts.items():
                self.feature_probs[sentiment][feature] = count / total_features

    def classify(self, features):
        # Implement classification here

        # Initialize scores for each class
        scores = {sentiment: 0 for sentiment in self.classes}

        # Calculate score for each class
        for sentiment in self.classes:
            # Initialize score with log of prior probability of the class
            score = math.log(self.class_probs[sentiment])
            # Calculate the score for each feature
            for feature in features:
                # If feature is present in training data, consider its probability, else assume a small value
                if feature in self.feature_probs[sentiment]:
                    score += math.log(self.feature_probs[sentiment][feature])
                else:
                    # Smoothing to avoid zero probability, assuming a very small value
                    score += math.log(1e-10)
            # Assign the calculated score to the current class
            scores[sentiment] = score

        # Return the class with maximum score
        return max(scores, key=scores.get)


In [None]:
# Load the movie reviews dataset from NLTK
data = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(data)

# Shuffle the dataset for randomness
random.shuffle(data)

# Split the dataset into training and testing sets
split_ratio = 0.8
split_index = int(len(data) * split_ratio)
train_set = data[:split_index]
test_set = data[split_index:]

# Train the Naive Bayes classifier
classes = set(sentiment for _, sentiment in train_set)
classifier = NaiveBayesClassifier(classes)
classifier.train(train_set)

def calculate_accuracy(dataset, dataset_type):
    # Test the classifier on the testing set
    correct_predictions = 0
    false_predictions = []
    for example in dataset:
        tokens, true_sentiment = example
        features = get_features(tokens)
        predicted_sentiment = classifier.classify(features)
        if predicted_sentiment == true_sentiment:
            correct_predictions += 1

        # for printing the failed examples
        if predicted_sentiment != true_sentiment:
            false_predictions.append((tokens, true_sentiment, predicted_sentiment))
            # Join tokens into a single string
            text = ' '.join(tokens)
            # Print the false prediction with the whole text
            print(f"True Sentiment: {true_sentiment}, Predicted Sentiment: {predicted_sentiment}, Text: {text}")

    accuracy = correct_predictions / len(dataset)
    print(f"{dataset_type} Accuracy: {accuracy}")

calculate_accuracy(train_set, 'Train')
calculate_accuracy(test_set, 'Test')

True Sentiment: neg, Predicted Sentiment: pos, Text: shakespeare in love is quite possibly the most enjoyable period piece ever made for the silver screen . it is both humorous and romantic in a very unique blend that can successfully entertain any audience for the nearly 2 and and a half hours that it occupies . that is , however , not to say it is a good film , a quality production or anything of the sort . shakespeare in love is an incredibly cheap illusion that truly pans out to be very little quality or original work . the finest sign of this may be the plot , in looking back , there seems to be little more than a thin , predictable plot that is only carried by the portrayal of people that we revere in our history books . philip henslowe ( geoffrey rush ) owns 1 of the 2 theatres in london . it is at the peak of the royal theatre era , and queen elizabeth ( the recently damed judi dench , by , appropriately enough , queen elizabeth ii ) is very much a fan . however , to directly q

Discussing the result:
Train Accuracy: 0.99625, Test Accuracy: 0.7625

The results indicate that the Naive Bayes classifier achieved high accuracy on the training set (approximately 99.63%) but slightly lower accuracy on the test set (approximately 76.25%). Let's discuss these results:

1. Training Accuracy:
   - The high training accuracy suggests that the classifier has effectively learned from the training data. It has likely captured the patterns and features that distinguish between different sentiments in the training set. However, achieving almost perfect accuracy on the training data could also indicate overfitting, where the model has memorized the training examples instead of learning generalizable patterns.

2. Testing Accuracy:
   - The testing accuracy, while slightly lower than the training accuracy, is still relatively high. This indicates that the classifier can generalize well to unseen data, despite the drop in accuracy compared to the training set. However, the lower accuracy on the test set suggests that the classifier might not perform as well on new, unseen data compared to the data it was trained on.

3. Possible Reasons for Performance Discrepancy:
   - Overfitting: The performance gap between training and test accuracies suggests potential overfitting. The classifier might have memorized noise or specific patterns in the training data that do not generalize well to unseen data. To address overfitting, techniques like regularization or increasing the size and diversity of the training data could be beneficial.
   
   - Data Quality: The quality of the training and test data could impact the performance of the classifier. If the data is noisy, inconsistent, or contains biases, it can affect the classifier's ability to learn and generalize. Data preprocessing and cleaning techniques may help mitigate these issues.

   - Model Complexity: The complexity of the Naive Bayes classifier could also influence its performance. While Naive Bayes is a relatively simple and efficient algorithm, it may struggle with capturing complex relationships and nuances in the data. Using more sophisticated models or feature engineering techniques could potentially improve performance.

4. Solutions:
   - Fine-tuning hyperparameters, such as smoothing parameters in the Naive Bayes classifier, could help optimize performance.
   - Experimenting with different feature extraction techniques or model architectures might lead to better results.
   - Cross-validation or bootstrapping could provide more robust estimates of the classifier's performance and help identify sources of variability.

Here are some general Misclassifications:
1. False Positives (Negative Reviews Misclassified as Positive):
- "The movie was terrible, with poor acting and a predictable plot, but I loved the soundtrack."
=> The classifier may have focused too much on the presence of positive words like "loved" and "soundtrack" while ignoring the negative sentiment expressed in the rest of the sentence.
2. False Negatives (Positive Reviews Misclassified as Negative):
- "Although the film had some flaws, the cinematography was stunning and the acting was superb."
=> The classifier may have been misled by the presence of negative words like "flaws" while failing to recognize the overall positive sentiment expressed in the sentence.
3. Ambiguous Sentiments:
- "The movie was okay, but it didn't live up to my expectations."
=> The classifier may struggle with sentences containing mixed or ambiguous sentiments, where positive and negative expressions coexist. It may not accurately weigh the overall sentiment of the sentence.

There are samples similar to the said ones in the dataset which lead to the misclassification.(like this one: i had been expecting more of this movie than the less than thrilling twister . twister was good but had no real plot and no one to simpithize with . but twister had amazing effects and i was hoping so would volcano volcano starts with tommy lee jones at emo ... also . . . there were interesting side stories that made the plot more interesting . so . . . it was good ! ! => True Sentiment: pos, Predicted Sentiment: neg which is a sample of Ambiguous Sentiment.)


#Submission Instructions:


1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

3.Clearly present the results of your parameter tuning in the notebook.

4.Provide a brief summary of your findings and insights in the conclusion section.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Experiment with various seed texts to showcase the diversity of generated text.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.