-------------------------------------------------------
# ** TWITTER SENTIMENT ANALYSIS **
-------------------------------------------------------
## Description
I will be using the natural language toolkit (NLTK).

In this project I will perform sentiment analysis on tweets using an approach that is based on naive Bayes. Given a tweet, I will decide if it has a positive sentiment or a negative one. In particular, we will

* Train a naive Bayes model on a sentiment analysis task
* Test our model
* Predict on your own tweet

This project will contain both the mathematical model and the necessary program to perform such tasks. The goal of this project is to teach and educate users on how to perform sentiment analysis from scratch. I will include the Step by Step approach and include explanations as per needed.  

### Sentiment analysis using naive Bayes

Naive Bayes can be used for sentiment analysis. It takes a short time to train and also has a short prediction time.

#### How do you train a naive Bayes classifier for sentiment analysis of tweets?
- The first part of training a naive bayes classifier is to identify the number of classes that we have. In our case, only two: positive and negative tweets.
- We will create a probability for each class.<br>
$P(T_{pos})$ will be the probability that the tweet is positive.<br>
$P(T_{neg})$ will be the probability that the tweet is negative.

Assume that $T$ is the total number of, $T_{pos}$ is the total number of positive tweets, and $T_{neg}$ is the total number of negative tweets. Then

$$P(T_{pos}) = \frac{T_{pos}}{T},$$

$$P(T_{neg}) = \frac{T_{neg}}{T}.$$


#### Prior and logprior

The prior probability represents the underlying probability in the target population that a tweet is positive versus negative. In other words, if we had no specific information and blindly picked a tweet out of the population set, what is the probability that it will be positive versus that it will be negative? 

The prior is the ratio of the probabilities $\frac{P(D_{pos})}{P(D_{neg})}$.
We can take the log of the prior to rescale it, and we will call this the logprior:

$$\text{logprior} = \log \left( \frac{P(T_{pos})}{P(T_{neg})} \right) = \log \left( \frac{T_{pos}}{T_{neg}} \right).$$

Note that $\log(\frac{a}{b})$ is the same as $\log(a) - \log(b)$. Hence, the logprior can also be calculated as the difference between two logs:

$$\text{logprior} = \log (P(T_{pos})) - \log (P(T_{neg})) = \log (T_{pos}) - \log (T_{neg}).$$

#### Positive and negative probability of a word
To compute the positive probability and the negative probability for a specific word in the vocabulary, we will use the following inputs:

- $freq_{pos}$ and $freq_{neg}$ are the frequencies of that specific word in the positive or negative class. That is, the positive frequency of a word is the number of times the word is counted with the label of 1.
- $N_{pos}$ and $N_{neg}$ are the total number of positive and negative words for all documents (for all tweets), respectively.
- $V$ is the number of unique words in the entire set of documents, for all classes, whether positive or negative.

We will use these to compute the positive and negative probability for a specific word using this formula:

$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}. $$
$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}. $$

We add the "+1" in the numerator for additive smoothing.  This [wiki article](https://en.wikipedia.org/wiki/Additive_smoothing) explains more about additive smoothing.

#### Log-likelihood
To compute the log-likelihood of that very same word, we can implement the following equations:

$$\text{loglikelihood} = \log \left(\frac{P(W_{pos})}{P(W_{neg})} \right).$$

#### Sentiment score
To compute the sentiment score of a tweet we simply add its logprior and the sum of log-likelihoods of its words:
$$ \text{score} = \text{logprior} + \sum_{i=1}^m (\text{loglikelihood}_i),$$

where $m$ is the number of words in the tweet.

- If the score > 0, then the sentiment is positive.
- If the score = 0, then the sentiment is neutral.
- If the score < 0, then the sentiment is negative.

## Objectives

a. Implement a helper function that finds the frequency of word-label pairs.

b. Train the model using naive Bayes. 

c. Predict the sentiment score. 

## Code
### Preliminaries

In [1]:
# This question requires NLTK library. Make sure it is installed!
# Do not modify!
import pdb
from nltk.corpus import stopwords, twitter_samples
import numpy as np
import pandas as pd
import nltk
import string
import re
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
from os import getcwd

nltk.download('twitter_samples')
nltk.download('stopwords')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/rileykrisch/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rileykrisch/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
filePath = f"{getcwd()}/../tmp2/"
nltk.data.path.append(filePath)

In [3]:
# Get the sets of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

# Split the data into two pieces, one for training and one for testing (validation set)
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]

train_x = train_pos + train_neg
test_x = test_pos + test_neg

# Avoid assumptions about the length of all_positive_tweets
train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))

### Preprocessing

The first step is to process the data to make it useful to our model. We will do it in several steps.
- First we will remove noise from the data. That is, we will remove words that do not tell us much about the content. For example, words like 'I, you, are, is, etc...' do not give us information on the sentiment.
- We will remove hyperlinks, stock market tickers, retweet symbols, and hashtags. They do not contain much information on the sentiment.
- We will remove all the punctuation from a tweet. The reason for doing this is because we want to treat words with or without the punctuation as the same word. We do not need to consider "happy", "happy?", "happy!", "happy," and "happy." as different words.
- Finally, we want to use stemming to only keep track of one variation of each word. That is, we will treat "motivation", "motivated", and "motivate" similarly by grouping them within the same stem of "motiv-".

The function `process_tweet` does this for us.

In [4]:
def process_tweet(tweet):
    '''
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet

    '''
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks
    #tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    tweet = re.sub(r'https?://[^\s\n\r]+', '', tweet)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and  # remove stopwords
            word not in string.punctuation):  # remove punctuation
            # tweets_clean.append(word)
            stem_word = stemmer.stem(word)  # stemming word
            tweets_clean.append(stem_word)

    return tweets_clean

In [5]:
custom_tweet = "RT @Twitter @chapagain Hello there! Have a fantastic day. :) #good #morning http://chapagain.com.np"

# print cleaned tweet
print(process_tweet(custom_tweet))

['hello', 'fantast', 'day', ':)', 'good', 'morn']


### Implement a helper function

Next we will implement `count_tweets`, a lookup helper function that takes in an empty dictionary, a word, and a label (1 or 0). The label is 1 for positive tweets and 0 is for negative tweets.

The function returns a dictionary where the keys are a tuple (word, label) and the values are the corresponding frequency. That is, the dictionary should tell us the number of times that word and label tuple appears in the collection of tweets. 

For example: given a list of tweets `["i am rather excited", "you are rather happy"]` and the label 1, the function will return a dictionary that contains the following key-value pairs:

{
    ("rather", 1): 2,
    ("happi", 1) : 1, 
    ("excit", 1) : 1
}

- Notice how for each word in the given string, the same label 1 is assigned to each word.
- Notice how the words "i" and "am" are not saved, because they were removed by process_tweet.
- Notice how the word "rather" appears twice in the list of tweets, and so its count value is 2.

Your task is to create a function `count_tweets` that takes a list of tweets as input, cleans them, and returns a dictionary.
- The key in the dictionary is a tuple containing the stemmed word and its class label, e.g. ("happi",1).
- The value the number of times this word appears in the given collection of tweets (an integer).

In [6]:
def count_tweets(result, tweets, ys):
    '''
    Input:
        result: a dictionary that will be used to map each word-label pair to its frequency
        tweets: a list of tweets
        ys: a list corresponding to the sentiment of each tweet (either 0 or 1)
    Output:
        result: a dictionary mapping each pair to its frequency
    '''
    # Your code
    ### START CODE HERE ###
    for tweet, label in zip(tweets, ys):
        words = process_tweet(tweet)      
        
        for word in words:
            pair = (word, label)  
            if pair in result:
                result[pair] += 1 
            else:
                result[pair] = 1  
    
    ### END CODE HERE ###
    

    return result

In [7]:
# Test your function
result = {}
tweets = ['i am happy', 'i am tricked', 'i am sad', 'i am tired', 'i am tired']
ys = [1, 0, 0, 0, 0]
count_tweets(result, tweets, ys)

{('happi', 1): 1, ('trick', 0): 1, ('sad', 0): 1, ('tire', 0): 2}

**Expected Output**: {('happi', 1): 1, ('trick', 0): 1, ('sad', 0): 1, ('tire', 0): 2}

Build the `freqs` dictionary for later uses. Given your `count_tweets` function, you can compute a dictionary called `freqs` that contains all the frequencies. We will use this dictionary in several parts of this assignment.

In [8]:
freqs = count_tweets({}, train_x, train_y)

### Train the model using naive Bayes

Given a `freqs` dictionary, `train_x` (a list of tweets) and a `train_y` (a list of labels for each tweet), implement a naive Bayes classifier.

##### Calculate $V$
- You can compute the number of unique words that appear in the `freqs` dictionary to get your $V$ (you can use the `set` function).

##### Calculate $freq_{pos}$ and $freq_{neg}$
- Using your `freqs` dictionary, you can compute the positive and negative frequency of each word $freq_{pos}$ and $freq_{neg}$.

##### Calculate $N_{pos}$, and $N_{neg}$
- Using `freqs` dictionary, you can also compute the total number of positive words and total number of negative words $N_{pos}$ and $N_{neg}$.

##### Calculate $T$, $T_{pos}$, $T_{neg}$
- Using the `train_y` input list of labels, calculate the number of documents (tweets) $T$, as well as the number of positive documents (tweets) $T_{pos}$ and number of negative documents (tweets) $T_{neg}$.
- Calculate the probability that a document (tweet) is positive $P(T_{pos})$, and the probability that a document (tweet) is negative $P(T_{neg}).$

##### Calculate the logprior
- Compute the logprior as $log(T_{pos}) - log(T_{neg}).$

##### Calculate log-likelihood
- Finally, you can iterate over each word in the vocabulary, use the provided `lookup` function to get the positive frequencies, $freq_{pos}$, and the negative frequencies, $freq_{neg}$, for that specific word.
- Compute the positive probability of each word $P(W_{pos})$, negative probability of each word $P(W_{neg})$ using the equations below.

$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}, $$
$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}.$$


- You can then compute the loglikelihood: $log \left( \frac{P(W_{pos})}{P(W_{neg})} \right)$.

**Note:** We will use a dictionary to store the log-likelihoods for each word. The key is the word, the value is the log-likelihood of that word.

In [9]:
def lookup(freqs, word, label):
    '''
    Input:
        freqs: a dictionary with the frequency of each pair (or tuple)
        word: the word to look up
        label: the label corresponding to the word
    Output:
        n: the number of times the word with its corresponding label appears.
    '''
    n = 0  # freqs.get((word, label), 0)

    pair = (word, label)
    if (pair in freqs):
        n = freqs[pair]

    return n

In [10]:
def train_naive_bayes(freqs, train_x, train_y):
    '''
    Input:
        freqs: dictionary from (word, label) to how often the word appears
        train_x: a list of tweets
        train_y: a list of labels correponding to the tweets (0,1)
    Output:
        logprior: the log prior. (equation 3 above)
        loglikelihood: the log likelihood of you Naive bayes equation. (equation 6 above)
    '''
    loglikelihood = {}
    logprior = 0


    # calculate V, the number of unique words in the vocabulary
    vocab = list(zip(*freqs.keys()))[0]

    V = len(set(vocab))  

    ### START YOUR CODE HERE ###
    T_pos = sum(train_y) 
    T_neg = len(train_y) - T_pos
    
    N_pos = sum([freq for (word, label), freq in freqs.items() if label == 1])
    N_neg = sum([freq for (word, label), freq in freqs.items() if label == 0])
    
    logprior = np.log(T_pos / T_neg)
    ### END CODE HERE ###

    
    # For each word in the vocabulary...
    for word in vocab:
        # get the positive and negative frequency of the word
        freq_pos = lookup(freqs, word, 1.0)
        freq_neg = lookup(freqs, word, 0.0)

        # calculate the probability that each word is positive, and negative
        p_w_pos = (freq_pos+1)/(N_pos+V)
        p_w_neg = (freq_neg+1)/(N_neg+V)

        # calculate the log likelihood of the word
        logprior = np.log(T_pos / T_neg)
        loglikelihood[word] = np.log(p_w_pos)-np.log(p_w_neg)
        


    return logprior, loglikelihood

Call the `train_naive_bayes` function to calculate the log prior and log likelihood values, then display the log prior dictionary and the total number of word-class pairs in the log likelihood dictionary.

In [11]:
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)
print(logprior)
print(len(loglikelihood))

0.0
9161


**Expected Output**:<br>
0.0<br>
9161

### Predict the sentiment score

Next step is to implement the `naive_bayes_predict` function to make predictions on tweets.

The function can be described as follows:
* The function takes in the `tweet`, `logprior`, `loglikelihood`.
* It returns the log of the probability that the tweet belongs to the positive or negative class.
* For each tweet, sum up loglikelihoods of each word in the tweet.
* Also add the logprior to this sum to get the predicted sentiment of that tweet.

$$ \text{score} = \text{logprior} + \sum_{i=1}^m (\text{loglikelihood}_i).$$

Note we calculate the prior from the training data, and that the training data is evenly split between positive and negative labels (4000 positive and 4000 negative tweets).  This means that the ratio of positive to negative tweets is equal to 1. Hence, the logprior is equal to 0. This implies that we are just adding zero to the sum of log-likelihoods. However, please remember to include the logprior, because whenever the data is not perfectly balanced, the logprior will be a non-zero value.

In [12]:
def naive_bayes_predict(tweet, logprior, loglikelihood):
    '''
    Input:
        tweet: a string
        logprior: a number
        loglikelihood: a dictionary of words mapping to numbers
    Output:
        score: the sum of all the logliklihoods of each word in the tweet (if found in the dictionary) + logprior (a number)

    '''
    # Your code
    ### START CODE HERE ###
    score = logprior
    words = process_tweet(tweet)

    for word in words:
        if word in loglikelihood:
            score = score + loglikelihood[word]

    ### END CODE HERE ###

    return score

In [13]:
# Test your function
my_tweet = 'She smiled.'
score = naive_bayes_predict(my_tweet, logprior, loglikelihood)
print('The expected output is', score)

The expected output is 1.5574928203010936


**Expected Output**:
- The expected output is around 1.55. This value implies that the sentiment is positive.

### Finding the accuracy

The below implementation of `test_naive_bayes` checks the accuracy of your predictions.
* The function takes in your `test_x`, `test_y`, log_prior, and loglikelihood
* It returns the accuracy of your model.
* First, use `naive_bayes_predict` function to make predictions for each tweet in text_x.

The accuracy of the model can be computed as $$\frac{\text{number of tweets classified correctly}}{\text{total number of tweets}}.$$ 

In [14]:
def test_naive_bayes(test_x, test_y, logprior, loglikelihood, naive_bayes_predict=naive_bayes_predict):
    """
    Input:
        test_x: A list of tweets
        test_y: the corresponding labels for the list of tweets
        logprior: the logprior
        loglikelihood: a dictionary with the loglikelihoods for each word
    Output:
        accuracy
    """
    accuracy = 0  # return this properly

    y_hats = []
    for tweet in test_x:
        # if the prediction is > 0
        if naive_bayes_predict(tweet, logprior, loglikelihood) > 0:
            # the predicted class is 1
            y_hat_i = 1.0
        else:
            # otherwise the predicted class is 0
            y_hat_i = 0.0

        # append the predicted class to the list y_hats
        y_hats.append(y_hat_i)

    # error is the average of the absolute values of the differences between y_hats and test_y
    error = np.mean(abs(y_hats-test_y))

    # Accuracy is 1 minus the error
    accuracy = 1 - error

    return accuracy

In [15]:
print("Naive Bayes accuracy = %0.4f" %
      (test_naive_bayes(test_x, test_y, logprior, loglikelihood)))

Naive Bayes accuracy = 0.9955


**Expected Output**:

`Naive Bayes accuracy = 0.9955`

In [16]:
# Run this cell to test your function
for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']:
    # print( '%s -> %f' % (tweet, naive_bayes_predict(tweet, logprior, loglikelihood)))
    p = naive_bayes_predict(tweet, logprior, loglikelihood)
#     print(f'{tweet} -> {p:.2f} ({p_category})')
    print(f'{tweet} -> {p:.2f}')

I am happy -> 2.14
I am bad -> -1.31
this movie should have been great. -> 2.12
great -> 2.13
great great -> 4.26
great great great -> 6.39
great great great great -> 8.52


**Expected Output**:

- I am happy -> 2.14
- I am bad -> -1.31
- this movie should have been great. -> 2.12
- great -> 2.13
- great great -> 4.26
- great great great -> 6.39
- great great great great -> 8.52

In [17]:
# Feel free to check the sentiment of your own tweet below
my_tweet = 'You are doing great!'
naive_bayes_predict(my_tweet, logprior, loglikelihood)

np.float64(2.1289430658835196)