# Sentiment analysis - alternative ways of computing sentiment

## What we will cover?

In the class, we mostly work with sentiment analysis conducted using SentiStrenght. We have provided you with sentiment analyses results for all Twitter data collected for this class.

However, SentiStrenght is only one of the common off-the-shelf methods available. In this bonus tutorial, we will introduce you to another sentiment analysis tool, Vader. You can use this tool in GC2 if you want to use YouTube data for it.

VADER (stands for Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool designed for analyzing sentiments expressed in text. It assigns a polarity score to individual words in a text based on their sentiment orientation (positive, negative, or neutral) and intensity. These scores are then combined to generate an overall sentiment score for the entire text. The tool also considers factors such as capitalization and punctuation to enhance accuracy.

VADER's lexicon contains words and phrases manually annotated with sentiment scores, which are empirically derived and validated. It also incorporates rules to handle linguistic nuances and context-specific sentiment expressions. It is available as part of the Natural Language Toolkit (NLTK) library in Python.

Let's see how to apply it to Twitter data!

### Opening Twitter data and selecting relevant columns

In [2]:
import pandas as pd

In [3]:
#Opening file - note I specify the path to the file
df = pd.read_json('/Users/jstrych1/Documents/Digital_Analytics/2324s2_Joanna/Data/TheoAraujo.jsonl', lines=True)

In [4]:
df.head()

Unnamed: 0,data,includes,meta,__twarc,errors
0,[{'text': 'RT @I_Am_The_ICT: Screenshot this p...,"{'users': [{'username': 'Awoken_Soul_', 'prote...","{'newest_id': '1570085814817722368', 'oldest_i...",{'url': 'https://api.twitter.com/2/tweets/sear...,
1,[{'text': 'RT @jeffreyboadi_: All those exclai...,"{'users': [{'username': 'liligotthekeys', 'pro...","{'newest_id': '1570084255195070468', 'oldest_i...",{'url': 'https://api.twitter.com/2/tweets/sear...,"[{'resource_id': '1569959556025036800', 'param..."
2,"[{'lang': 'en', 'id': '1570082754871582721', '...",{'users': [{'public_metrics': {'followers_coun...,"{'newest_id': '1570082754871582721', 'oldest_i...",{'url': 'https://api.twitter.com/2/tweets/sear...,
3,"[{'public_metrics': {'retweet_count': 130, 're...",{'users': [{'profile_image_url': 'https://pbs....,"{'newest_id': '1570082705618120705', 'oldest_i...",{'url': 'https://api.twitter.com/2/tweets/sear...,
4,"[{'id': '1570082631567691776', 'source': 'Twit...",{'users': [{'public_metrics': {'followers_coun...,"{'newest_id': '1570082631567691776', 'oldest_i...",{'url': 'https://api.twitter.com/2/tweets/sear...,


In [5]:
def get_public_metrics(row):
    if 'public_metrics' in row.keys():
        if type(row['public_metrics']) == dict:
            for key, value in row['public_metrics'].items():
                row['metric_' + str(key)] = value
    return row

def get_tweets(df):
    if 'data' not in df.columns:
        return None
    results = pd.DataFrame()
    for item in df['data'].values.tolist():
        results = pd.concat([results, pd.DataFrame(item)])
        
    results = results.apply(get_public_metrics, axis=1)
        
    results = results.reset_index()
    del results['index']
        
    return results

In [6]:
#Unnesting the file
tweets = get_tweets(df)

In [7]:
tweets.head()

Unnamed: 0,text,entities,public_metrics,reply_settings,lang,source,possibly_sensitive,author_id,referenced_tweets,conversation_id,created_at,id,attachments,context_annotations,in_reply_to_user_id,geo,metric_retweet_count,metric_reply_count,metric_like_count,metric_quote_count
0,RT @I_Am_The_ICT: Screenshot this please so wh...,"{'mentions': [{'start': 3, 'end': 16, 'usernam...","{'retweet_count': 239, 'reply_count': 0, 'like...",everyone,en,Twitter for Android,False,1206072760239349761,"[{'type': 'retweeted', 'id': '1568180588191911...",1570085814817722368,2022-09-14T16:23:23.000Z,1570085814817722368,,,,,239,0,0,0
1,Do you have difficulty in preparing neetpg cho...,"{'urls': [{'start': 268, 'end': 291, 'url': 'h...","{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,en,Twitter Web App,False,1569551671969333249,,1570085811206627329,2022-09-14T16:23:22.000Z,1570085811206627329,{'media_keys': ['7_1570085212662669313']},,,,0,0,1,0
2,UHM? jd probably there cause i was rting shit ...,"{'urls': [{'start': 99, 'end': 122, 'url': 'ht...","{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,en,Twitter for Android,False,1002918664113442816,,1570085801098383360,2022-09-14T16:23:20.000Z,1570085801098383360,{'media_keys': ['3_1570085798472568832']},"[{'domain': {'id': '45', 'name': 'Brand Vertic...",,,0,0,0,0
3,cant believe i missed so many q2han videos üò¢ i...,,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,en,Twitter for Android,False,748598123732185088,,1570085777710907392,2022-09-14T16:23:14.000Z,1570085777710907392,,,,,0,0,0,0
4,@8x5tl8 The holy algorithm does that. :) When ...,"{'mentions': [{'start': 0, 'end': 7, 'username...","{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,en,Twitter for iPhone,False,1035008103987769346,"[{'type': 'replied_to', 'id': '157007157427923...",1570071574279237636,2022-09-14T16:23:05.000Z,1570085737869070339,,,23709361.0,,0,0,0,0


In [8]:
#selecting relevant columns
tweets_texts = tweets[['text', 'id']]

In [9]:
tweets_texts.head()

Unnamed: 0,text,id
0,RT @I_Am_The_ICT: Screenshot this please so wh...,1570085814817722368
1,Do you have difficulty in preparing neetpg cho...,1570085811206627329
2,UHM? jd probably there cause i was rting shit ...,1570085801098383360
3,cant believe i missed so many q2han videos üò¢ i...,1570085777710907392
4,@8x5tl8 The holy algorithm does that. :) When ...,1570085737869070339


### Applying sentiment analysis

Let's apply Vader to the texts of my tweets. You need the nltk package installed to be able to use Vader. When running it for the first time, don't forget to install it by running `! pip install nltk`

From the nltk package, you will need to download the lexicon that defines sentiment of words.

In [11]:
! pip install nltk

Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl.metadata (2.8 kB)
Collecting click (from nltk)
  Downloading click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting joblib (from nltk)
  Downloading joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2023.12.25-cp38-cp38-macosx_11_0_arm64.whl.metadata (40 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m40.9/40.9 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.5/1.5 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading regex-2023.12.25-cp38-cp38-macosx_11_0_arm64.whl (291 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ

In [12]:
import nltk

nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/jstrych1/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [13]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

In [15]:
analyzer.polarity_scores("I really LOVE strawberry ice cream!!!")

{'neg': 0.0, 'neu': 0.396, 'pos': 0.604, 'compound': 0.7965}

In [17]:
analyzer.polarity_scores("I really like strawberry ice cream!!!")['pos']

0.478

In [19]:
def get_vader_score(df):
  df['negative'] = analyzer.polarity_scores(df['text'])['neg']
  df['positive'] = analyzer.polarity_scores(df['text'])['pos']
  df['neutral'] = analyzer.polarity_scores(df['text'])['neu']
  df['compound'] = analyzer.polarity_scores(df['text'])['compound']
  return df

In [20]:
tweets_texts = tweets_texts.apply(get_vader_score, axis=1)

In [21]:
tweets_texts.head()

Unnamed: 0,text,id,negative,positive,neutral,compound
0,RT @I_Am_The_ICT: Screenshot this please so wh...,1570085814817722368,0.0,0.187,0.813,0.5893
1,Do you have difficulty in preparing neetpg cho...,1570085811206627329,0.123,0.051,0.826,-0.4515
2,UHM? jd probably there cause i was rting shit ...,1570085801098383360,0.338,0.0,0.662,-0.8253
3,cant believe i missed so many q2han videos üò¢ i...,1570085777710907392,0.18,0.092,0.729,-0.4238
4,@8x5tl8 The holy algorithm does that. :) When ...,1570085737869070339,0.05,0.126,0.824,0.5106


In [22]:
tweets_texts.dtypes

text         object
id           object
negative    float64
positive    float64
neutral     float64
compound    float64
dtype: object