## Sentiment Analysis 

In today's section, we will cover how to carry out an unsupervised, lexicon-based sentiment analysis. 

Usually, before applying text analysis tools (sentiment analysis, topic modeling...), we need to do some web scraping, or a Twitter API query. You can refer to Duke University's research guide for Introduction to Text Analysis tools to have a more comprehensive view on possible text analyses: https://guides.library.duke.edu/text_analysis

Some important resources to get you started on web scraping using Python's Beautiful Soup:
- https://opensource.com/article/21/9/web-scraping-python-beautiful-soup
- https://gitlab.com/ayush-sharma/example-assets/-/blob/fd7d2dfbfa3ca34103402993b35a61cbe943bcf3/programming/beautiful-soup/fetch.py
- https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/
- https://stackoverflow.com/a/40629823 [A useful stackoverflow on how to web scrape multiple URLs instead of just one as demonstrated in the examples above]

Twitter API used to be open and free for academic research. Unfortunately, currently the Twitter API is down/will start charging for the queries. 

Sentiment analysis, a popular Natural Language Processing (NLP) task, has a goal of classifying text based on the sentiment/mood/emotional implications of the words expressed in the text. It can be positive, negative, or neutral (polarity), but it can also focus on emotions (such as happy, sad, angry...etc). 

Sentiment analysis can be carried out using various NLP algorithms.

Another useful tutorial that applies a lexicon-based algorithm (unsupervised) and supervised ML techniques to carry out a sentiment analysis and also covers some of the necessary pre-processing steps: https://towardsdatascience.com/nlp-sentiment-analysis-for-beginners-e7897f976897 

Another tutorial that covers lexicon-based algorithms/unsupervised ML techniques, evaluates their performances, and compares their performance on movie reviews data: https://github.com/mohammed97ashraf/Sentiment-Analysis-Using-Unsupervised-Lexical-Models 

Out of the techniques covered in the aforementioned tutorials, we will cover VADER, Valence Aware Dictionary and sEntiment Reasoner, lexicon and simple rule-based model for sentiment analysis.

It can efficiently handle vocabularies, abbreviations, capitalizations, repeated punctuations, emoticons (😢 , 😃 , 😭 , etc.), etc. usually adopted on social media platforms to express one’s sentiment, which makes it a great fit for **social media sentiment text analysis.**

- No training needed; ready to use and assess the sentiment of any given text. 
- The results of VADER include: neg (negative), neu (neutral), pos(positive) and compound. 
- neg, neu, pos should sum up to 1 or approximately 1.
- compound is the sum of the valence/polarity score of each word in the lexicon and determines the *degree* of the sentiment not just its direction/actual value; it ranges from -1 (very negative sentiment) to 1 (very positive sentiment).
    - So basically, the compound score is the sum of positive, negative & neutral scores which is then normalized between -1(most extreme negative) and +1 (most extreme positive)
- We can use the compound score to determine the underlying sentiment, good rule of thumb:
- a positive sentiment, compound ≥ 0.05
- a negative sentiment, compound ≤ -0.05
- a neutral sentiment, the compound is between ]-0.05, 0.05[

In [3]:
# Importing libraries

import nltk # Natural Language Toolkit, documentation: https://www.nltk.org/
import pandas as pd
from sklearn.metrics import accuracy_score, classification_report

# Download VADER lexicon

nltk.download("vader_lexicon")

# Import the lexicon

from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\noura\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


A senti-lexicon is, in very simple terms, a dictionary of words and their associated sentiment. 

The VADER Lexicon documentation can be found here: https://github.com/cjhutto/vaderSentiment

Documentation for nltk.sentiment.vader: https://www.nltk.org/api/nltk.sentiment.vader.html#module-nltk.sentiment.vader

In [4]:
# Creating an instance of SentimentIntensityAnalyzer uploaded above

SentimentAnalyzer = SentimentIntensityAnalyzer()

An important function when using the SentimentIntensityAnalyzer:

**SentimentIntensityAnalyzer.polarity_score()** function provides the polarity of the text

In [8]:
# Some examples to demonstrate our SentimentAnalyzer:

sentence1 = "VADER is great at identifying the sentiment of a social media post!"

print("Sentence 1 has polarity scores of", SentimentAnalyzer.polarity_scores(sentence1))

sentence2 =  "VADER is a REALLY AMAZING library!!!!"

print("Sentence 2 has polarity scores of", SentimentAnalyzer.polarity_scores(sentence2))

sentence3= "I HATE fake news on the internet, so frustrating!!"

print("Sentence 3 has polarity scores of", SentimentAnalyzer.polarity_scores(sentence3))

Sentence 1 has polarity scores of {'neg': 0.0, 'neu': 0.695, 'pos': 0.305, 'compound': 0.6588}
Sentence 2 has polarity scores of {'neg': 0.0, 'neu': 0.373, 'pos': 0.627, 'compound': 0.8284}
Sentence 3 has polarity scores of {'neg': 0.703, 'neu': 0.297, 'pos': 0.0, 'compound': -0.9163}


In [9]:
# Onto applying VADER on a license-free tweets dataset available on http://help.sentiment140.com/for-students

data_url = "https://raw.githubusercontent.com/keitazoumana/VADER_sentiment-Analysis/main/data/testdata.manual.2009.06.14.csv"

sentiment_data = pd.read_csv(data_url)


In [10]:
sentiment_data.head()

Unnamed: 0,4,3,Mon May 11 03:17:40 UTC 2009,kindle2,tpryan,"@stellargirl I loooooooovvvvvveee my Kindle2. Not that the DX is cool, but the 2 is fantastic in its own right."
0,4,4,Mon May 11 03:18:03 UTC 2009,kindle2,vcu451,Reading my kindle2... Love it... Lee childs i...
1,4,5,Mon May 11 03:18:54 UTC 2009,kindle2,chadfu,"Ok, first assesment of the #kindle2 ...it fuck..."
2,4,6,Mon May 11 03:19:04 UTC 2009,kindle2,SIX15,@kenburbary You'll love your Kindle2. I've had...
3,4,7,Mon May 11 03:21:41 UTC 2009,kindle2,yamarama,@mikefish Fair enough. But i have the Kindle2...
4,4,8,Mon May 11 03:22:00 UTC 2009,kindle2,GeorgeVHulme,@richardebaker no. it is too big. I'm quite ha...


We only care about the last column: the actual tweet, and the 1st column with 4: the polarity of the tweet. The polarity in this dataset is expressed such as 0 is negative, 2 is neutral, and 4 is positive. 

This dataset is annotated with the polarity of each tweet to work as a validator for the different algorithms.

In [11]:
def format_data(data):

  last_col = str(data.columns[-1])
  first_col = str(data.columns[0])

  data.rename(columns = {last_col: 'tweet_text', first_col: 'polarity'}, inplace=True) 

  # Change 0, 2, 4 to negative, neutral and positive
  labels = {0: 'negative', 2: 'neutral', 4: 'positive'}
  data['polarity'] = data['polarity'].map(labels)

  # Get only the two columns
  return data[['tweet_text', 'polarity']]

# Apply the transformation
data = format_data(sentiment_data)
data.head(3)

Unnamed: 0,tweet_text,polarity
0,Reading my kindle2... Love it... Lee childs i...,positive
1,"Ok, first assesment of the #kindle2 ...it fuck...",positive
2,@kenburbary You'll love your Kindle2. I've had...,positive


In [22]:
def format_output(output_dict):
  
  polarity = "neutral"

  if(output_dict['compound']>= 0.05):
    polarity = "positive"

  elif(output_dict['compound']<= -0.05):
    polarity = "negative"

  return polarity

def predict_sentiment1(text):
  
  output_dict =  SentimentAnalyzer.polarity_scores(text)
  return format_output(output_dict)

def predict_sentiment2(text):
  
  output_dict =  SentimentAnalyzer.polarity_scores(text)
  return output_dict

# Run the predictions
data["vader_prediction_polarity1"] = data.loc[:,("tweet_text")].apply(predict_sentiment1)

text = data["tweet_text"]
data["vader_prediction_polarity2"] = data.loc[:,("tweet_text")].apply(predict_sentiment2)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["vader_prediction_polarity1"] = data.loc[:,("tweet_text")].apply(predict_sentiment1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["vader_prediction_polarity2"] = data.loc[:,("tweet_text")].apply(predict_sentiment2)


In [25]:
pd.set_option("max_colwidth", None)

# Show 5 random rows of the data
data.sample(5)

Unnamed: 0,tweet_text,polarity,vader_prediction_polarity1,vader_prediction_polarity2
403,yankees won mets lost. its a good day.,positive,positive,"{'neg': 0.178, 'neu': 0.31, 'pos': 0.512, 'compound': 0.6486}"
255,@uscsports21 LeBron is a monsta and he is only 24. SMH The world ain't ready.,positive,negative,"{'neg': 0.3, 'neu': 0.7, 'pos': 0.0, 'compound': -0.6301}"
333,ugh. the amount of times these stupid insects have bitten me. Grr..,negative,negative,"{'neg': 0.383, 'neu': 0.617, 'pos': 0.0, 'compound': -0.7351}"
16,i love lebron. http://bit.ly/PdHur,positive,positive,"{'neg': 0.0, 'neu': 0.323, 'pos': 0.677, 'compound': 0.6369}"
409,this dentist's office is cold :/,negative,negative,"{'neg': 0.324, 'neu': 0.676, 'pos': 0.0, 'compound': -0.34}"


In [27]:
# Evaluating VADER based on original polarity and VADER-predicted polarity

accuracy = accuracy_score(data['polarity'], data['vader_prediction_polarity1'])

print("Accuracy: {}\n".format(accuracy))

# Show the classification report

print(classification_report(data['polarity'], data['vader_prediction_polarity1']))

Accuracy: 0.716297786720322

              precision    recall  f1-score   support

    negative       0.84      0.64      0.72       177
     neutral       0.67      0.70      0.68       139
    positive       0.67      0.81      0.73       181

    accuracy                           0.72       497
   macro avg       0.73      0.71      0.71       497
weighted avg       0.73      0.72      0.72       497



- Precision attempts to answer the following question: What proportion of positive identifications was actually correct?

- Recall attempts to answer the following question: What proportion of actual positives was identified correctly?

- A good F1 score (a ML metric that is used in classification models) means that you have low false positives and low false negatives, so you're correctly identifying real threats and you are not disturbed by false alarms. An F1 score is considered perfect when it's 1 , while the model is a total failure when it's 0 