# <center> <font size = 24 color = 'steelblue'> <b> Twitter data sentiment analysis using NLTK - Naive Bayes

# <a id= 'p0'> 
<font size = 4>
    
**Table of contents:**<br>
[1. Objective](#p1)<br>
[2. Solution](#p2)<br>
>[2.1. Import necessary packages](#p2.1)<br>
>[2.2. Data acquisition](#p2.2)<br>
>[2.3. Data cleaning](#p2.3)<br>
>[2.4 Data exploration](#p2.4)<br>
>[2.5 Model development](#p2.5)<br>
>[2.6 Prediction for a single tweet input using the classifier](#p2.6)<br>
>[2.7 Prediction for a single tweet input using the classifier](#p2.7)
    

##### <a id = 'p1'>
<font size = 10 color = 'midnightblue'> <b> **Objective**
    

<div class="alert alert-block alert-info">
<font size = 4> 

- Objective of this project is to train a naive bayes classifier using the labeled twitter data and use the model for sentiment prediction.
- For the given project, use the twitter sample data from nltk corpus for training this model.
- The twitter sample corpus provides 3 files,
    - positive_tweets: tweets labeled as positive sentiment tweets,
    - negative_tweets: tweets labeled as negative sentiment tweets
    - tweets.20150430-223406 : containing unlabeled tweets

##### <a id = 'p2'>
<font size = 10 color = 'midnightblue'> <b> Solution

<a id = 'p2.1'>
    
## <font size = 6 color = pwdrblue> **Import necessary packages**

In [None]:
pip install advertools
pip install vaderSentiment
pip install textblob

In [None]:
import nltk

# for data cleaning :
import re
import advertools as adv # handling pictorial emojis
from string import punctuation

# Exploration and Visualization
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# for model development
import random
from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Evaluation:
from sklearn.metrics import classification_report


<font size = 5 color = seagreen><b> Downloading necessary nltk corpus

In [None]:
nltk.download("twitter_samples")
nltk.download('words')
nltk.download('punkt')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
nltk.download('words')
nltk.download('maxent_ne_chunker')
nltk.download('stopwords')
nltk.download('wordnet')

[top](#p0)

<a id = 'p2.2'>
    
## <font size = 6 color = pwdrblue><b> Data Acquisition

In [None]:
nltk.corpus.twitter_samples.fileids()

In [None]:
positive = [tweet for tweet in nltk.corpus.twitter_samples.strings('positive_tweets.json')]
negative = [tweet for tweet in nltk.corpus.twitter_samples.strings('negative_tweets.json')]

<font size = 5 color = seagreen><b> Create a collective labelled data set by labeling each tweet

In [None]:
labeled_tweets = [(p, 'pos') for p in positive] + [(n, 'neg') for n in negative]

<font size = 5 color = seagreen><b> Shuffle data to get random train and test samples.

In [None]:
random.shuffle(labeled_tweets)

<font size = 5 color = seagreen><b>  Consider first 1500 data rows as test and remaining as train.

In [None]:
test  = labeled_tweets[:1500]
train = labeled_tweets[1500:]

[top](#p0)

<a id = 'p2.3'>
    
## <font size = 6 color = pwdrblue><b> Data Cleaning

<div class="alert alert-block alert-success">
<font size = 4> 
    
**Implement basic data cleaning steps on this data such as:**
  * Remove stopwords
  * Remove punctuation
  * Remove hyperlinks/urls
  * Remove mentions i.e @abc_reader
  * Remove any other additional symbols

<font size = 5 color = seagreen><b> Define separate functions for each of the steps.

In [None]:
# Function for separating text and pictorial emojis
def handling_emoji(tweet):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               "]+", flags=re.UNICODE)
    text =  emoji_pattern.sub(r'', tweet)
    emoji_list = adv.extract_emoji(tweet)['emoji']
    emojis = ' '.join([e[0] for e in emoji_list if len(e) >0 ])
    return text + ' ' + emojis

In [None]:
# Function to remove punctuation using regex
def removePunct(tweet):
    pat = re.compile('[A-Za-z][{}]+'.format(punctuation))
    txt = re.findall(pat, tweet)
    if len(txt) > 0:
        for t in txt:
            tweet = tweet.replace(t[-1], '')
        return tweet
    else :
        return tweet

In [None]:
# Function to remove hyperlink/urls:
def removeLinks(tweet):
    pat = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    return re.sub(pat,'', tweet)

In [None]:
# Function to remove mentions using regex:
def removeMentions(tweet):
    return re.sub(r'@[A-Za-z0-9_]+', '',tweet)

In [None]:
# Function to remove stopwords from the text:
stopwords = nltk.corpus.stopwords.words('english')
def removeStpWrds(tokens):
    return [words.lower() for words in tokens if words.lower() not in stopwords]

In [None]:
# Function to remove symbols like # and words like "RT":
def removeSymbols(tweet):
    tweet = re.sub(r'#', '',tweet) # removed "#"" symbol
    tweet = re.sub(r'RT[\s]+', '',tweet) # rmoved RT
    return tweet

<font size = 5 color = seagreen><b> Create a common function for data cleaning implemneting all these steps.

In [None]:
def data_cleaning(tweet):
    tweet  = handling_emoji(tweet)
    tweet  = removeLinks(tweet)
    tweet  = removeMentions(tweet)
    tokens = removePunct(tweet).split()
    tokens = removeStpWrds(tokens)
    tokens = [w for w in tokens if w not in punctuation]
    tweet  = ' '.join(tokens) # remove extra spaces
    tweet  = removeSymbols(tweet) # remove "#" or RT
    tweet = re.sub('[0-9]+','', tweet)
    return tweet

<font size = 5 color = seagreen><b> Apply the data-cleaning function on the train data set.

In [None]:
train_clean = []
empty_twt_count =0
for tweet, lab in train:
    tweet = data_cleaning(tweet)
    if len(tweet) >0 :
        train_clean.append((tweet, lab))
    elif len(tweet) == 0:
        empty_twt_count_neg += 1
print(train_clean[:5])

<div class="alert alert-block alert-info">
<font size = 4> 
    
**Note:**
- Additionally a spell check can be performed to remove any irrelevant words.
- However, when assuming that the irrelevant mispelled words are not repeated a lot and hence will not affect the analysis significantly.
- Data cleaning step for text data analysis may incorporate a lot of other steps based on the text data quality and objective of the analysis.


[top](#p0)

<a id = 'p2.4'>
    
## <font size = 6 color = pwdrblue><b> Explore the data<br>

<font size =5 color = seagreen><b>  Study distribution of pos to neg tweets in train data

In [None]:
pd.Series([label for tweet, label in train]).value_counts().plot.pie(cmap = 'Set3', autopct = "%.2f %%")
plt.ylabel('')
plt.show()

<font size =5 color = seagreen><b>  Extract positive and negative tweets from train data for frequency analysis.

In [None]:
pos_clean = [p for p, l in train_clean if l == 'pos' ]
neg_clean = [n for n, l in train_clean if l == 'neg' ]

<font size =5 color = seagreen><b>  Visually analyze the positive and negative tweets using wordcloud.


In [None]:
positive_words_freq = nltk.FreqDist(" ".join(pos_clean).split())
wordcloud_pos = WordCloud(width = 800, height = 600,
                      background_color = "white", colormap = 'viridis',
                     max_words = 50)
wc_pos = wordcloud_pos.generate_from_frequencies(frequencies=positive_words_freq)

In [None]:
negative_words_freq = nltk.FreqDist(" ".join(neg_clean).split())
wordcloud_neg = WordCloud(width = 800, height = 600,
                      background_color = "white", colormap = 'Accent',
                     max_words = 50)
wc_neg = wordcloud_neg.generate_from_frequencies(frequencies=negative_words_freq)


In [None]:
f,ax = plt.subplots(1,2, figsize = (20,8))
ax[0].set_title("Positive Tweets", size = 30, pad = 22, weight = 'bold')
ax[0].imshow(wordcloud_pos, interpolation="bilinear")
ax[1].set_title("Negative Tweets", size = 30, pad = 22, weight = 'bold')
ax[1].imshow(wordcloud_neg, interpolation="bilinear")
ax[0].axis("off")
ax[1].axis("off")
plt.show()

<font size =5 color = seagreen><b> Create a vocabulary using these tweets to generate features.

<div class="alert alert-block alert-success">
<font size = 4> 

**Use only words in train data.**

In [None]:
strings = [w for w, l in train_clean]
words = ' '.join(strings).split()

In [None]:
print(f"Total word length = {len(words)}")

In [None]:
unique_words = set(words)
print(f"Total no. of unique words = {len(unique_words)}")

<div class="alert alert-block alert-success">
<font size = 4> 

**Use this vocabulary to generate features manually:**
- Taking the simplest case, we will generate features based on presence or absence of a word in the labeled tweet.
- It will generate a fundamental sparse matrix with labels.
- For campatibility with NLTK naive Bayes, we need a feature dictionary for each tweet.

In [None]:
def feature_extraction(tweet):
    return {f'contains_{w}' : int(w in tweet) for w in unique_words}

In [1]:
train_features = [(feature_extraction(t), l) for t,l in train_clean]

NameError: name 'train_clean' is not defined

<a id = 'p2.5'>
    
## <font size = 6 color = pwdrblue><b> Model development

<font size =5 color = seagreen><b>  Train the model

In [None]:
clf = nltk.NaiveBayesClassifier.train(train_features)

<font size =5 color = seagreen><b> Check accuracy on train data

In [None]:
acc_train = nltk.classify.accuracy(clf, train_features)
print("Accuracy on train data : ", acc_train)

<font size =5 color = seagreen><b>  Get the most informative features to understand the model.

In [None]:
clf.show_most_informative_features(n = 15)

[top](#p0)

<a id = 'p2.6'>
    
## <font size = 6 color = pwdrblue><b> Prediction for a single tweet input using the classifier

<font size =5 color = seagreen><b>   Use this model to classify a new tweet.

In [None]:
sample_tweet = "Absolutely loving the new features in the latest update! The user interface is sleek, and the performance is top-notch. Great job, @ProductTeam! 👏 #HappyCustomer #ProductLove"
print(sample_tweet)

<font size =5 color = seagreen><b>   Clean the tweet and extract features:

In [None]:
clean_tweet = data_cleaning(sample_tweet)

In [None]:
feature_set = feature_extraction(clean_tweet)

In [None]:
clf.classify(feature_set)

<font size =5 color = seagreen><b>   Get sentimet score using the textblob and vadersentiment packages without employing any data cleaning process

In [None]:
def getAnalysis(score):
    if score <= 0:
        return 'Negative'
    else:
        return 'Positive'

<font size =5 color = seagreen><b>   Using TextBlob

In [None]:
# create a text blob object of the text
blob = TextBlob(sample_tweet)

# get the sentiment object with polarity value
sentiment_obj = blob.analyzer.analyze(sample_tweet)

# get the sentiment based on polarity value
sentiment = getAnalysis(blob.sentiment.polarity)

In [None]:
# Display results
print("Text: ", sample_tweet)
print("Sentiment object : ", sentiment_obj)
print("Sentiment : ",sentiment)

<font size =5 color = seagreen><b>   Using vader

In [None]:
# create an analyzer object
analyzer = SentimentIntensityAnalyzer()

# obtain the polarity scores
vs = analyzer.polarity_scores(sample_tweet)

# display results
print("{} \n{}".format(sample_tweet, str(vs)))
print("Sentiment : ",getAnalysis(vs['compound']))

[top](#p0)

<a id = 'p2.7'>
    
## <font size = 6 color = pwdrblue><b>  Prediction for the test-set

<font size =5 color = seagreen><b> Implement cleaning and feature extraction steps for test data

In [None]:
test_set = []
test_feature_set = []
test_labels = []

for tweet, lab in test:
    tweet_clean = data_cleaning(tweet)
    if len(tweet) >0 :
        features = feature_extraction(tweet_clean)
        test_set.append((features, lab))
        test_feature_set.append(features)
        test_labels.append(lab)
    elif len(tweet) == 0:
        empty_tweet += 1
print('Count of cleaned test tweets :', len(test_set))

<font size =5 color = seagreen><b>Classify using trained classifer model.

In [None]:
test_pred_nb = clf.classify_many(test_feature_set)

<font size =5 color = seagreen><b> Evaluate the model by getting classification report

In [None]:
confusion_mat_nb = pd.crosstab(index = test_labels, columns = test_pred_nb )

In [None]:
cs_nb  =classification_report(y_true = test_labels, y_pred = test_pred_nb )
print(cs_nb)

<font size =5 color = seagreen><b> Get prediction and accuracies for test set using TextBlob and vade sentiment models also.

In [None]:
# prediction using TextBlob
def classify_tb(tweet):
    # create a text blob object of the text
    blob = TextBlob(tweet)

    # get the sentiment object with polarity value
    sentiment_obj = blob.analyzer.analyze(tweet)

    # get the sentiment based on polarity value
    sentiment = getAnalysis(blob.sentiment.polarity)

    return sentiment

In [None]:
test_pred_tb = [classify_tb(tweet) for tweet, lab in test]

In [None]:
test_pred_tb = list(map(lambda x: x.lower()[:3], test_pred_tb))

In [None]:
confusion_mat_tb = pd.crosstab(index = test_labels, columns = test_pred_tb )
cf_tb = classification_report(y_true = test_labels, y_pred = test_pred_tb )

In [None]:
# prediction using Vader
def usingVader(tweet):
    # obtain the polarity scores
    vs = analyzer.polarity_scores(tweet)
    sentiment = getAnalysis(vs['compound'])
    return sentiment

In [None]:
test_pred_vs = [usingVader(tweet) for tweet, lab in test]

In [None]:
test_pred_vs = list(map(lambda x: x.lower()[:3], test_pred_vs))

In [None]:
confusion_mat_vs = pd.crosstab(index = test_labels, columns = test_pred_vs )
cf_vs = classification_report(y_true = test_labels, y_pred = test_pred_vs )

<font size =5 color = seagreen><b> Disaplay all results and compare:

In [None]:
print("\nNaive Bayes :\n\n", cs_nb)
print("\nText Blob :\n\n", cf_tb)
print("\nVader :\n\n",cf_vs)

## 

<div class="alert alert-block alert-success">
<font size = 4> 

**Conclusion:**
* The manually trained classifier (Naive Bayes) gets a very accuracy.


[top](#p0)