# Tutorial 7 (Part I)

## Introduction
Whether you are referring to Twitter, Goodreads, or Amazon, there is scarcely any digital place that is not overrun with peoples' opinions. In the modern world, it is vital for businesses to delve into these perspectives and gain knowledge about their goods or services. However, the sheer volume of this data makes it virtually hard to measure it manually. Sentiment Analysis, yet another benefit of data science, enters the picture at this point. In this post, we'll examine what sentiment analysis entails and the many Python implementations that are available.

## What is Sentiment Analysis?
Sentiment analysis falls under the heading of text classification and is a use case of natural language processing (NLP). Simply described, sentiment analysis includes categorizing a text into several emotions, such as happy or sad, neutral, or happy or sad. Determining the underlying tone, emotion, or sentiment of a document is the ultimate goal of sentiment analysis. Another name for this is opinion mining.

## VADER
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based sentiment analyzer that has been trained on social media text. Just like Text Blob, its usage in Python is pretty simple.

In [1]:
import warnings
warnings.filterwarnings('ignore') # We can suppress the warnings

In [2]:
pip install vaderSentiment

Collecting vaderSentiment
  Obtaining dependency information for vaderSentiment from https://files.pythonhosted.org/packages/76/fc/310e16254683c1ed35eeb97386986d6c00bc29df17ce280aed64d55537e9/vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata (572 bytes)
Using cached vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2
Note: you may need to restart the kernel to use updated packages.


In [3]:
# Load thelibraries
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Create and initialise an object
sentiment = SentimentIntensityAnalyzer()
text_1 = "The book was a perfect balance between wrtiting style and plot."
text_2 =  "The pizza tastes terrible."

sent_1 = sentiment.polarity_scores(text_1)
sent_2 = sentiment.polarity_scores(text_2)

print("Sentiment of text 1:", sent_1)
print("Sentiment of text 2:", sent_2)

Sentiment of text 1: {'neg': 0.0, 'neu': 0.73, 'pos': 0.27, 'compound': 0.5719}
Sentiment of text 2: {'neg': 0.508, 'neu': 0.492, 'pos': 0.0, 'compound': -0.4767}


In [4]:
# Install and import nltk
!pip install nltk
import nltk

# Download the lexicon
nltk.download("vader_lexicon")



[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\miqba\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [5]:
# Import the lexicon 
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Create an instance of SentimentIntensityAnalyzer
sent_analyzer = SentimentIntensityAnalyzer()

# Example
sentence = "VADER is pretty good at identifying the underlying sentiment of a text!"
print(sent_analyzer.polarity_scores(sentence))

{'neg': 0.0, 'neu': 0.585, 'pos': 0.415, 'compound': 0.75}


* a positive sentiment, compound ≥ 0.05.
* a negative sentiment, compound ≤ -0.05.
* a neutral sentiment, the compound is between ]-0.05, 0.05[

The previous result shows that the sentence does not have any negative information (neg=0). It has some neutral and positive tones (neu=0.585 and pos=0.415). However, the overall sentiment is positive, because compound > 0.05

In [6]:
# What about this sentence with repeated exclamations and capitalization?
sentence_ = "VADER is a REALLY AMAZING library!!!!"
print(sent_analyzer.polarity_scores(sentence_))

{'neg': 0.0, 'neu': 0.373, 'pos': 0.627, 'compound': 0.8284}


As you can see from this example, the compound jumped to 0.82, which makes the sentence more positive than the one before, as per the value of the compound.

In [7]:
# A last example with negative sentiment
negative_sent = "I do HATE those fake news on internet!!😡"
print(sent_analyzer.polarity_scores(negative_sent))

{'neg': 0.619, 'neu': 0.381, 'pos': 0.0, 'compound': -0.8449}


From this last sentence, we can see that the sentence does not have any positive information (pos = 0). It has some neutral and positive tones (neu = 0.424 and neg = 0.576). However, the overall sentiment is negative, because compound < -0.05. Removing the exclamations will make the sentiment less negative.

## VADER on Large Dataset
We are going to use this license-free tweets dataset available on the Sentiment140 website, in order to know how well VADER does. Before that, we are going to use this helper function which will immediately return the polarity (pos, neg, or neu) instead of the dictionary output.

In [8]:
import pandas as pd

# Read the data set
data_url = "https://raw.githubusercontent.com/keitazoumana/VADER_sentiment-Analysis/main/data/testdata.manual.2009.06.14.csv"
sentiment_data = pd.read_csv(data_url)

sentiment_data.head(5)

Unnamed: 0,4,3,Mon May 11 03:17:40 UTC 2009,kindle2,tpryan,"@stellargirl I loooooooovvvvvveee my Kindle2. Not that the DX is cool, but the 2 is fantastic in its own right."
0,4,4,Mon May 11 03:18:03 UTC 2009,kindle2,vcu451,Reading my kindle2... Love it... Lee childs i...
1,4,5,Mon May 11 03:18:54 UTC 2009,kindle2,chadfu,"Ok, first assesment of the #kindle2 ...it fuck..."
2,4,6,Mon May 11 03:19:04 UTC 2009,kindle2,SIX15,@kenburbary You'll love your Kindle2. I've had...
3,4,7,Mon May 11 03:21:41 UTC 2009,kindle2,yamarama,@mikefish Fair enough. But i have the Kindle2...
4,4,8,Mon May 11 03:22:00 UTC 2009,kindle2,GeorgeVHulme,@richardebaker no. it is too big. I'm quite ha...


We are only interested in two columns.

'4' corresponding to the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive).
'@stellargi..right' corresponding to the actual tweet.
Let format the dataset for better clarification.

In [9]:
def format_data(data):

    last_col = str(data.columns[-1])
    first_col = str(data.columns[0])

    data.rename(columns = {last_col: 'tweet_text', first_col: 'polarity'}, inplace=True) 

    # Change 0, 2, 4 to negative, neutral and positive
    labels = {0: 'negative', 2: 'neutral', 4: 'positive'}
    data['polarity'] = data['polarity'].map(labels)

    # Get only the two columns
    return data[['tweet_text', 'polarity']]

In [10]:
data = format_data(sentiment_data)
data.head(3)

Unnamed: 0,tweet_text,polarity
0,Reading my kindle2... Love it... Lee childs i...,positive
1,"Ok, first assesment of the #kindle2 ...it fuck...",positive
2,@kenburbary You'll love your Kindle2. I've had...,positive


In [11]:
def format_output(output_dict):
  
  polarity = "neutral"

  if(output_dict['compound'] >= 0.05):
    polarity = "positive"

  elif(output_dict['compound'] <= -0.05):
    polarity = "negative"

  return polarity

def predict_sentiment(text):
  
  output_dict =  sent_analyzer.polarity_scores(text)
  return format_output(output_dict)

In [12]:
data["vader_prediction"] = data["tweet_text"].apply(predict_sentiment)

In [13]:
data.sample(5)

Unnamed: 0,tweet_text,polarity,vader_prediction
383,is upset about the whole GM thing. life as i k...,negative,negative
299,How do you use the twitter API?... http://bit....,neutral,neutral
53,annoying new trend on the internets: people p...,negative,negative
411,First dentist appointment [in years] on Wednes...,neutral,neutral
427,eating breakfast and then school,neutral,neutral


## VADER Performance on the Dataset.
From the original polarity column and VADER's prediction we can generate the confusion matrix and its overall performance (precision, recall, and f1 score).

In [14]:
from sklearn.metrics import accuracy_score, classification_report

accuracy = accuracy_score(data['polarity'], data['vader_prediction'])

print("Accuracy: {}\n".format(accuracy))

# Show the classification report
print(classification_report(data['polarity'], data['vader_prediction']))

Accuracy: 0.716297786720322

              precision    recall  f1-score   support

    negative       0.84      0.64      0.72       177
     neutral       0.67      0.70      0.68       139
    positive       0.67      0.81      0.73       181

    accuracy                           0.72       497
   macro avg       0.73      0.71      0.71       497
weighted avg       0.73      0.72      0.72       497



The model seems to be doing a good job because it is much better than a random guess (accuracy = 0.5)! The same observation can be made from the f1-scores of each polarity. Before diving into building machine learning models, it might be better to take VADER as your baseline model for such a task.

# Bag of Words Vectorization-Based Models
In the two approaches, Bag of words and Vader, we have simply used Python libraries to perform sentiment analysis. Now we will discuss an approach wherein we train our own model for the task. The steps involved in performing sentiment analysis using the Bag of Words Vectorization method are as follows

Pre-Process the text of training data (Text pre-processing involves Normalization, Tokenization, Stopwords Removal, and Stemming/Lemmatization.)

Create a Bag of Words for the pre-processed text data using the <b>Count Vectorization</b> or <b>TF-IDF Vectorization</b> approach.
Train a suitable classification model on the processed data for sentiment classification.

## Code for Sentiment Analysis using Bag of Words Vectorization Approach:

To build a sentiment analysis model using the BOW Vectorization Approach we need a labeled dataset. As stated earlier, the dataset used for this demonstration has been obtained from Kaggle. We have simply used sklearn’s count vectorizer to create the BOW. After, we trained a Multinomial Naive Bayes classifier, for which an accuracy score of 0.84 was obtained.

In [15]:
#Loading the Dataset
import pandas as pd
df = pd.read_csv('Finance_data.csv')

# Display the dataframe
df.head()

Unnamed: 0,Sentence,Sentiment
0,The GeoSolutions technology will leverage Bene...,positive
1,"$ESI on lows, down $1.50 to $2.50 BK a real po...",negative
2,"For the last quarter of 2010 , Componenta 's n...",positive
3,According to the Finnish-Russian Chamber of Co...,neutral
4,The Swedish buyout firm has sold its remaining...,neutral


In [16]:
# Pre-Prcoessing and Bag of Word Vectorization using Count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer

token = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv    = CountVectorizer(stop_words = 'english',ngram_range = (1, 1),tokenizer = token.tokenize)
text_counts = cv.fit_transform(df['Sentence'])

text_counts

<5842x11143 sparse matrix of type '<class 'numpy.int64'>'
	with 70957 stored elements in Compressed Sparse Row format>

In [17]:
#Splitting the data into trainig and testing
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, df['Sentiment'], test_size=0.25, random_state=5)

In [18]:
text_counts.shape, X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((5842, 11143), (4381, 11143), (1461, 11143), (4381,), (1461,))

In [19]:
# Training the model
from sklearn.naive_bayes import MultinomialNB
MNB = MultinomialNB()
MNB.fit(X_train, Y_train)

# Caluclating the accuracy score of the model
from sklearn import metrics
predicted = MNB.predict(X_test)
accuracy_score = metrics.accuracy_score(predicted, Y_test)
print("Accuracuy Score: ",accuracy_score)

Accuracuy Score:  0.6851471594798083


The trained classifier can be used to predict the sentiment of any given text input.

## References:
* <p>https://www.analyticsvidhya.com/blog/2022/07/sentiment-analysis-using-python/</p>
* <p>https://colab.research.google.com/drive/1_Y7LhR6t0Czsk3UOS3BC7quKDFnULlZG?usp=sharing#scrollTo=Mq87u2brC9L0</p>