# use Naive Bayes to run sentiment analysis of tweets  
Author: Xuebing-2023
in this notebook, I will
1. Preprocess the tweets
2. train the naive bayes model
3. predict the new sentence
4. test the naive bayes model
5. do some error analysis

In [85]:
import nltk
import string #for punctuation
import re
import os
import numpy as np

## download the data

In [2]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/zhaosisi/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [17]:
#get the positive and negative tweets
all_positive_tweets = nltk.corpus.twitter_samples.strings('positive_tweets.json')
all_negative_tweets = nltk.corpus.twitter_samples.strings('negative_tweets.json')
print('all_positive_tweets: ',len(all_positive_tweets),'\nall_negative_tweets: ',len(all_positive_tweets))

all_positive_tweets:  5000 
all_negative_tweets:  5000


In [24]:
#split data into train and test sets use 0.8/0.2 rule
train_x = all_positive_tweets[:4000] + all_negative_tweets[:4000]
test_x = all_positive_tweets[4000:] + all_negative_tweets[4000:]

train_y = np.append(np.ones(4000),np.zeros(4000))
test_y = np.append(np.ones(1000), np.zeros(1000))
print(len(train_x), len(train_y), len(test_x), len(test_y))

8000 8000 2000 2000


## preprocessing the sentence
1. lowercase
2. remove url
3. remove stop words
4. stemming

In [45]:
def process_tweet(tweet):
    '''
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet
    '''
    stemmer = nltk.stem.PorterStemmer()
    stopwords_english = nltk.corpus.stopwords.words('english')
    tokenizer = nltk.tokenize.TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks
    tweet = re.sub(r'https?://[^\s\n\r]+', '', tweet)
    # remove hashtags
    tweet = re.sub(r'#', '', tweet)
    # tokenize tweets
    tweet_tokens = tokenizer.tokenize(tweet)
    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and  # remove stopwords
                word not in string.punctuation):  # remove punctuation
            stem_word = stemmer.stem(word)  # stemming word
            tweets_clean.append(stem_word)

    return tweets_clean

In [47]:
#example:
custom_tweet = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"
print(process_tweet(custom_tweet))

['hello', 'great', 'day', ':)', 'good', 'morn']


## build the vocabulary of frequencies
1. for all the training dataset, train_x and train_y, use it to build a dic
2. the dic contains {(word, sentiment): ct}

In [50]:
def count_tweets(tweets, y_sentiment):
    '''
    Input:
        tweets: a list of tweets
        y_sentiment: a list corresponding to the sentiment of each tweet (either 0 or 1)
    Output:
        vocabulary: a dictionary mapping each pair to its frequency
    '''
    vocabulary = {}
    for tweet,y in zip(tweets, y_sentiment):
        words = process_tweet(tweet)
        for word in words:
            # define the key, which is the word and label tuple
            pair = (word, y)
            vocabulary[pair] = vocabulary.get(pair,0) + 1
    return vocabulary

In [51]:
#example:
tweets = ['i am happy', 'i am tricked', 'i am sad', 'i am tired', 'i am tired']
ys = [1, 0, 0, 0, 0]
count_tweets(tweets, ys)

{('happi', 1): 1, ('trick', 0): 1, ('sad', 0): 1, ('tire', 0): 2}

In [54]:
#here to build the frequency for all the training x and y for later use
freqs = count_tweets(train_x, train_y)
len(freqs)

11430

## train the naive bayes
1. logprior is $log(D_{pos}) - log(D_{neg})$
    - $D_{pos}$ is number of positive training sample(tweets)
    - $D_{neg}$ is number of positive training sample(tweets)
2. loglikelihood: $log \left( \frac{P(W_{pos})}{P(W_{neg})} \right)$
    - $ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V} , P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}$
    - $freq_{pos},freq_{neg}$ is the ct of frequency of the single word showed up in pos/neg vocabulary
    - $N_{pos}, N_{neg}$ is the sum of ct of all the pos/neg vocabulary
    - V is the number of unique word in vocabulary
    

In [56]:
def train_naive_bayes(freqs, train_x, train_y):
    '''
    Input:
        freqs: dictionary from (word, sentiment) to how often the word appears
        train_x: a list of tweets
        train_y: a list of labels correponding to the tweets (0,1)
    Output:
        logprior: the log prior
        loglikelihood: the log likelihood of you Naive bayes equation
    '''
    loglikelihood = {}
    logprior = 0.0
   
    # Calculate D_pos, D_neg, the number of documents
    D_pos = train_y.sum()
    D_neg = len(train_y) - D_pos
    # Calculate logprior
    logprior = np.log(D_pos) - np.log(D_neg)
    
    
    # calculate V, the number of unique words in the vocabulary
    unique_vocab = set(pair[0] for pair in freqs)
    V = len(unique_vocab) 
    # calculate N_pos, N_neg, V_pos, V_neg from the vocabulary freqs
    N_pos = N_neg = 0
    for pair in freqs.keys():
        # positive_sentiment
        if pair[1] > 0:
            N_pos += freqs[pair]
        # else, the label is negative
        else:
            N_neg += freqs[pair]

    for word in unique_vocab:
        # get the positive and negative frequency of the word
        freq_pos = freqs.get((word,1.0),0)
        freq_neg = freqs.get((word,0.0),0)

        # calculate the probability that each word is positive, and negative
        p_w_pos = (freq_pos + 1)/(N_pos + V)
        p_w_neg = (freq_neg + 1)/(N_neg + V)

        # calculate the log likelihood of the word
        loglikelihood[word] = np.log(p_w_pos) - np.log(p_w_neg)#np.log(p_w_pos/p_w_neg)
    return logprior, loglikelihood

In [57]:
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)
print(logprior)
print(len(loglikelihood))

0.0
9161


# predict the Naive Bayes model
1. calculate the probability that the tweet belongs to the positive or negative class.
2. $ p = logprior + \sum_i^N (loglikelihood_i)$
3. For each tweet, sum up loglikelihoods of each word in the tweet and add the logprior

In [58]:
def naive_bayes_predict(tweet, logprior, loglikelihood):
    '''
    Input:
        tweet: a string
        logprior: a number
        loglikelihood: a dictionary of words mapping to numbers
    Output:
        p: the sum of all the logliklihoods of each word in the tweet (if found in the dictionary) + logprior (a number)

    '''
    # process the tweet to get a list of words
    words = process_tweet(tweet)

    # initialize probability to zero
    p = logprior

    for word in words:

        p += loglikelihood.get(word,0)

    ### END CODE HERE ###

    return p

In [63]:
#example: 
my_tweet = 'my dog looks very happy'
p = naive_bayes_predict(my_tweet, logprior, loglikelihood)
print('The expected output is', p)

The expected output is 1.967074372229086


## test the model
1. only use prior and the loglikelihood dic, can predict the test_y

In [74]:
def test_naive_bayes(test_x, test_y, logprior, loglikelihood):
    """
    Input:
        test_x: a list of tweets
        test_y: the corresponding labels for the list of tweets
        logprior: the logprior
        loglikelihood: a dictionary with the loglikelihoods for each word
    Output:
        accuracy: (# of tweets classified correctly)/(total # of tweets)
    """
    y_hats = []
    for tweet in test_x:
        p = naive_bayes_predict(tweet, logprior, loglikelihood)
        if p > 0: y_hats.append(1)
        else: y_hats.append(0)
    error = sum(y_hats != test_y)
    accuracy = 1- error/len(test_y)

    return accuracy

In [75]:
accuracy= test_naive_bayes(test_x, test_y, logprior, loglikelihood)
print("Naive Bayes accuracy = %0.4f" %(accuracy))

Naive Bayes accuracy = 0.9955


## Error analysis
1. for some of the miss classified data points, print the tweet, find out what's going wrong
    - human is good at Adversarial attacks (Sarcasm, Irony and Euphemisms), machine does not

In [84]:
print('Truth Predicted Tweet')
for x,y in zip(test_x, test_y):
    p = naive_bayes_predict(x, logprior, loglikelihood)
    y_hat_i = 1.0 if p>0 else 0.0
    if y != y_hat_i: 
        print(y, y_hat_i, x)

Truth Predicted Tweet
1.0 0.0 @jaredNOTsubway @iluvmariah @Bravotv Then that truly is a LATERAL move! Now, we all know the Queen Bee is UPWARD BOUND : ) #MovingOnUp
1.0 0.0 A new report talks about how we burn more calories in the cold, because we work harder to warm up. Feel any better about the weather? :p
1.0 0.0 Harry and niall and -94 (when harry was born) ik it's stupid and i wanna change it :D https://t.co/gHAt8ZDAfF
1.0 0.0 off to the park to get some sunlight : )
1.0 0.0 @msarosh Uff Itna Miss karhy thy ap :p
0.0 1.0 @rcdlccom hello, any info about possible interest in Jonathas ?? He is close to join Betis :( greatings
0.0 1.0 @phenomyoutube u probs had more fun with david than me : (
0.0 1.0 pats jay : (
0.0 1.0 Sr. Financial Analyst - Expedia, Inc.: (#Bellevue, WA) http://t.co/ktknMhvwCI #Finance #ExpediaJobs #Job #Jobs #Hiring
