<a href="https://colab.research.google.com/github/thedatadj/natural-language-processing/blob/main/sentiment-analysis/naive_bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<table>
    <tr>
        <td>
            <b>Model</b>
        </td>
        <td>
            Naive Bayes
        </td>
    </tr>
    <tr>
        <td>
            <b>Task</b>
        </td>
        <td>
            Classify a tweet as having a positive sentiment or a negative sentiment.
        </td>
    </tr>
    <tr>
        <td>
            <b>Main library</b>
        </td>
        <td>
            NLTK
        </td>
    </tr>
    <tr>
        <td>
            <b>Dataset</b>
        </td>
        <td>
            twitter_samples from NLTK datasets.
        </td>
    </tr>
    <tr>
        <td>
            Based on
        </td>
        <td>
             An assignment from the Natural Language Processing Specialization in coursera.
        </td>
    </tr>
</table>

# Data Loading

In [None]:
import numpy as np
import pandas as pd

In [None]:
# Twitter dataset
import nltk
from nltk.corpus import twitter_samples
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.


True

Get the sets of positive and negative tweets.

In [None]:

all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
all_positive_tweets[:3]

['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)',
 '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!',
 '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!']

Split the sets into a training set, and a testing set.

In [None]:
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]

train_x = train_pos + train_neg
test_x = test_pos + test_neg

Set of labels for the tweets, independent of the number of tweets.

In [None]:
train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))

# Data Preprocessing

In [None]:
# String manipulation
import re

**Remove unnecessary characters from the tweets.**

In [None]:
train_x0 = []
for tweet in train_x:
    tweet = re.sub(r"\$\w*", "", tweet)
    tweet = re.sub(r"^RT[\s]+", "", tweet)
    tweet = re.sub(r"https?://[^\s\n\r]+", "", tweet)
    tweet = re.sub(r"#", "", tweet)
    train_x0.append(tweet)
train_x = train_x0

**Tokenization, stemming, remove punctuations and stopwords.**


In [None]:
from nltk.tokenize import TweetTokenizer

In [None]:
tokenizer = TweetTokenizer(preserve_case=False,
                           strip_handles=True,
                           reduce_len=True)

In [None]:
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

In [None]:
stemmer = PorterStemmer()

In [None]:
nltk.download('stopwords')
stopwords_list = stopwords.words("english")
train_x0 = []
for tweet in train_x:
    tweet_tokens = tokenizer.tokenize(tweet)
    tweet_clean = []
    for token in tweet_tokens:
        if token not in stopwords_list and token not in string.punctuation:
            token_stem = stemmer.stem(token)
            tweet_clean.append(token_stem)
    train_x0.append(tweet_clean)
train_x = train_x0

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Example of a tweet after preprocessing.

In [None]:
train_x[0]

['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']

# Dictionary of frequencies


Join the tweet and its label into a tuple.

In [None]:
# Set of tweets and its label
data = list(zip(train_x, train_y))
data[0]

(['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)'], 1.0)

Create a dictionary mapping (word, class) to its frequency.

In [None]:
dicti = {}
for tweet, label in data:
    for token in tweet:
        if (token, label) not in dicti:
            dicti[(token, label)] = 0
        dicti[(token, label)] += 1

In [None]:
dicti[('top', 1)]

30

# Training

Now, I need to create a probability for each word in the vocabulary `vocab` being part of each class.

$P(hi|1) = \frac{\text{Number of times that word appears in the positive class}}{\text{Number of total words in the positive class}}$

In [None]:
vocab = list(set([t[0] for t in dicti.keys()]))
vocab[:10]

['st',
 'beeti',
 '🍹',
 'owli',
 'edward',
 'tropic',
 'reaali',
 'supernatur',
 'poorli',
 'tard']

In [None]:
nv = len(vocab)
nv

9161

Compute:
* Total number of positive words
* Total number of negative words

In [None]:
# Total number of positive words
totalpos = 0

# Total number of negative words
totalneg = 0

for pair in dicti:
    label = pair[1]
    if label == 1:
        frequency = dicti[pair]
        totalpos += frequency
    else:
        frequency = dicti[pair]
        totalneg += frequency

In [None]:
print("Total number of positive words =", totalpos)
print("Total number of negative words =", totalneg)

Total number of positive words = 27543
Total number of negative words = 27137


Now I need to calculate the probability of each word.

In [None]:
prediction = {}
for word in vocab:
    posfrequency = dicti.get((word, 1), 0)
    negfrequency = dicti.get((word, 0), 0)
    posprobability = (posfrequency + 1)/(totalpos + nv)
    negprobability = (negfrequency + 1)/(totalneg + nv)
    prediction[word] = np.log(posprobability/negprobability)

In [None]:
len(prediction)

9161

# Prediction
A function that takes a tweet and predicts whether it has a positive sentiment or a negative sentiment.

In [None]:
def predict(tweet):
    # Tokenize tweet
    tweet_tokens = tokenizer.tokenize(tweet)
    tweet_clean = []
    for token in tweet_tokens:
        if token not in stopwords_list and token not in string.punctuation:
            token_stem = stemmer.stem(token)
            tweet_clean.append(token_stem)

    p = 0
    for token in tweet_clean:
        if token in prediction:
            p += prediction[token]

    return p

In [None]:
predict("She smiled.")

1.557492820301094

In [None]:
predict("He laughed.")

-0.1652737774400095