# Problem Set 3 - Implementing Traditional NLP Techniques

### DSME 6635: Artificial Intelligence for Business Research (Spring 2024)

### Due at 12:30PM, Tuesday, February 6, 2024

Please first copy the CoLab file onto your own Google Drive. Finish the questions below and submit the **CoLab link** of your solutions in [this Google Sheet](https://docs.google.com/spreadsheets/d/1nOE-saTptG73WMCONDB1Z3pt-jHhmDA_1OHpQVHqQ1M/edit#gid=306873990). The total achievable points are 8 for this problem set. Please name you solution as

- `Member1LastName_Member1FirstName-Member2LastName_Member2FirstName_PS3.ipynb` (e.g., `Cao_Leo-Zhang_Renyu_PS3.ipynb`)

In this problem set, we will implement a set of traditional NLP techniques. We will start from pre-processing of the textual information. We will then build a dictionary method to do sentiment classification. Last, we will build a Naive Bayes Classifier for sentiment classification.


## 1. Pre-processing of Text

In this question, you will implement various functions that help us to preprocess text information. Remember that before conducting any NLP tasks, we need to first convert sentence into words, which is a process called **tokenization**. In this exercise, you will implement a tokenization function to preprocess tweets. The function will take a tweet and conduct the following:

1. The function will use [**`word_tokenize` from NLTK**](https://www.nltk.org/api/nltk.tokenize.html) to tokenize the documents into words. (We will use `TweetTokenizer` since it can help us preseve the hashtag #).
2. The function will then get rid of all non-words (punctuations, emoji etc), which are not useful for sentiment classification.
3. The function stem all the remaining words using the [NLTK PorterStemmer](https://www.nltk.org/howto/stem.html).




In [None]:
import nltk
nltk.download('twitter_samples')
nltk.download('punkt')

In [None]:
from nltk.corpus import twitter_samples
from nltk.tokenize import word_tokenize
from nltk.stem import *

def preprocess_tweet(tweet):
    """
    The function preprocess a tweet.
    Input:
        tweet: a string representing a raw tweet
    Output:
        tokens: a list of words from the tweet after pre-processing
    """
    tokens = None

    ### BEGIN SOLUTION
    ### END SOLUTION

    return tokens


In [None]:
assert preprocess_tweet('this is a test!') == ['thi', 'is', 'a', 'test']
assert preprocess_tweet('work worked working') == ['work', 'work', 'work']
assert preprocess_tweet('programming is so fun :)') == ['program', 'is', 'so', 'fun']

## 2. Dictionary-Based Sentimemt Analysis

In this question, you will implement the dictionary-based sentiment classifier. The first function you build will complete the following tasks:

1. You will go through the training tweets and preprocess each one of it (sentence -> a list of words).

2. You will then go through the words and labels and build a dictionary where the key is each unqiue word and the value is a list of two numbers representing how many times the words has appeared in positive or negative. Example: if 'good' appears 2 times in positive tweets and 0 times in negative tweets, the corresponding dictionary item is `{'good': [2, 0]}`.

In [None]:
def build_dictionary(pos_tweet, neg_tweet):
    """
    This function builds a sentiment frequency dictionary from data
    Input:
        pos_tweet: a list of positive tweets
        neg_tweet: a list of negative tweets
    Output:
        sentiment_dictionary: a dictionary where the key is the word and the value is
        how many times a word appears in positive and negative tweets.
    """
    pos_processed = []
    neg_processed = []
    sentiment_dictionary = {}

    ### BEGIN SOLUTION
    ### END SOLUTION

    return sentiment_dictionary

In [None]:
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
train_prob = 0.8
num_train_data = int(len(all_positive_tweets)*train_prob)

test_pos = all_positive_tweets[num_train_data:]
train_pos = all_positive_tweets[:num_train_data]
test_neg = all_negative_tweets[num_train_data:]
train_neg = all_negative_tweets[:num_train_data]

sentiment_dictionary = build_dictionary(train_pos, train_neg)

assert sentiment_dictionary['top'] == [29, 4]
assert sentiment_dictionary['hate'] == [9, 45]

In the following function (`process_tweet`), you will take in a tweet and the sentiment dictionary and conduct the following two steps:

1. You will preprocess each the tweet (sentence -> list of words).

2. You will then generate a vector of **three** numbers representing the tweet: `[1, the sum of words' positive values, the sum of words' negative values]`, where the first row is always 1.

    

In [None]:
def process_tweets(dictionary,tweet):
    """
    The function to process a tweet into three values
    Input:
        dictionary: a dictionary where the key is the word and the value is
        how many times a word appears in positive and negative tweets.
        tweet: the tweet in a string
    Output:
        returned_values: a list of three values, where the first is 1, the
        second is the sum of positive values of words, the third is the sum
        of negative values of words.
    """
    returned_values = [1, 0, 0]

    ### BEGIN SOLUTION
    ### END SOLUTION

    return returned_values

In [None]:
assert process_tweets(sentiment_dictionary, 'him me') == [1, 306, 632]
assert process_tweets(sentiment_dictionary, train_pos[1]) == [1, 4486, 3361]

The following function `lr_sentiemnt()` will take the dictionary and the positive as well as negative data to build a logistic classification model based on the positive and negative frequencies. You will then report your accuracy on the testing data. The logistics regression can be implemented through sklearn's logistics regression.

In particular, you need to finish the following steps:

1. Use `process_tweets()` to process each of the tweet in the positive and negative training samples.

2. Build a logistics regression classifier where the features are the processed tweets (each tweet has 3 numbers), and the label is 1 if the tweet is positive otherwise 0. You will use the [sklearn package to build this logistic regressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) (<font color='red'>Please use random_state = 0 and solver='liblinear' for the logistics regression</font> to pass the unit test.)

3. Evaluate your classifier on the testing data. Report the percentage of tweets that you successfully classified in both positive and negative testing tweets (i.e., accuracy for both positive and negative data). You should also report the overall accuracy of the training data (both positive and negative combined).


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

def lr_sentiment(dictionary, train_pos, train_neg, test_pos, test_neg):
    """
    This function builds a Logistics classifer to classify tweets based on
    frequency tables
    Input:
        train_pos, train_neg: positive and negative training tweets.
        test_pos, test_neg: positive and negative testing tweets.
        dictionary: a dictionary where the key is the word and the value is
        how many times a word appears in positive and negative tweets.
    Output:
        nb_model: the variable storing the logistics classifier.
        training_accuracy: the overall accuracy of the model over all training data.
        pos_accuracy: the accuracy of classifying tweets in test_pos.
        neg_accuracy: the accuracy of classifying tweets in test_neg.
    """

    lr_model = None
    pos_accuracy = neg_accuracy = training_accuracy = 0

    ### BEGIN SOLUTION
    ### END SOLUTION

    return lr_model, training_accuracy, pos_accuracy, neg_accuracy


In [None]:
lr_model, training_accuracy, pos_accuracy, neg_accuracy = lr_sentiment(sentiment_dictionary, train_pos,
                                                    train_neg, test_pos, test_neg)
assert np.isclose(training_accuracy, 0.691375)
assert np.isclose(pos_accuracy, 0.622)

## 3. Naive Bayes

In this question, you will build a Naive Bayes Classifier to do sentiment classification on the same data. You are going to leverage the ([**NLTK's Naive Bayes Classifier**](https://www.nltk.org/_modules/nltk/classify/naivebayes.html)). In particular, the function you will build should finish the following steps:

1. Pre-process the training and testing data. You need to use the pre-process function developed in Question 1 to change tweet to list of words.

2. For each list of words, you need to translate it into a dictionary. The key of the dictionary is the unique words in tweet, the value is the number of times this word appearing in this tweet. And you should create a tuple where the first element is this dictionary and the second element is the label of the tweet. For positive tweet, the label should be 'pos'; and for negative tweet, the label should be 'neg'.

3. You will then use the pre-processed data to train the Naive Bayes classifer from `nltk`.

4. You will then use the Naive Bayes classifer you trained to classify each tweet in positive and negative testing tweets and report the positive and negative testing accuracies.

**Note: This problem set may take a while to run since you are building a dictionary (the length of the dictionary = the number of unique words in the tweet) for each training and testing data point.**


In [None]:
def nb_sentiment(train_pos, train_neg, test_pos, test_neg):
    """
    This function builds a Naive Bayes classifer to classify tweets.
    Input:
        train_pos, train_neg: positive and negative training tweets.
        test_pos, test_neg: positive and negative testing tweets.
    Output:
        nb_model: the variable storing the naive bayes classifier.
        pos_accuracy: the accuracy of classifying tweets in test_pos.
        neg_accuracy: the accuracy of classifying tweets in test_neg.
    """
    nb_model = None
    pos_accuracy = neg_accuracy = 0

    ### BEGIN SOLUTION
    ### END SOLUTION

    return nb_model, pos_accuracy, neg_accuracy

In [None]:
nb_model, pos_accuracy, neg_accuracy = nb_sentiment(train_pos, train_neg, test_pos, test_neg)
assert np.isclose(pos_accuracy, 0.782)
assert np.isclose(neg_accuracy, 0.772)
assert nb_model.labels() == ['pos', 'neg']
assert nb_model.classify({'I':1, 'hate':1, 'the':1, 'world':1}) == 'neg'

## End of Problem Set 3.