### Part 2: NLP learning based methods
#### VADER
##### Q1: Briefly explaining how this method works

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a freely available python package used as a lexicon and rule-based sentiment analysis tool. It is often used in context of social media based data like tweets in order to analyze a piece of text whether word/ statements made have a positive, negative or neutral sentiment. 

The VADER lexicon consists of a list of words and phrases which got sentiment ratings from 10 independent human raters who provided sentiment scores for over 9’000 token features in a range of -4 (extremely negative) to 4 (extremely positive). In this case, quality control of the ratings was ensured by keeping only lexical features which had a non-zero mean rating and standard deviations less than 2.5. As a result, VADER has a list of over 7’500 lexical token features with scores which both indicate positive or negative valence (score>0 or score<0) and the sentiment intensity of before mentioned range. For example, the word “good” has positive valence and an sentiment intensity score of 1.9.

In particular, VADER makes raw categorizations of words into positive, negative or neutral categories. When giving a sentence as input, VADER gives scores to these categories based on their ratios for proportions of text that fall in each category. As a result, the positive, negative and neutral categories should add up to 1. 

Moreover, it is important to mention that these proportions are just raw categorizations by the lexicon of each word presented in the text. These categorizations do not include the VADER rule-based enhancements such as degree modifiers, worder-order sensitivity for sentiment-laden multi-word phrases, word-shape amplifiers etc. as we will describe later. 

These rule-based enhancements are expressed in the compound score as described in the following.

In order to evaluate the sentiment of whole sentences, the compound score is computed using the sum of the valence score, adjusted according to the valence rules (e.g.: word-order sensitivity), of each word in the sentence and then normalize this sum to become a value between -1 (very negative sentiment) and +1 (very positive sentiment). Using this technique, one obtains a useful unidimensional score between -1 and +1 to evaluate the overall sentiment of whole sentences.
In the following the authors provided recommended thresholds for the interpretation of the compound score:

1. positive sentiment: compound score >= 0.05
2. neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
3. negative sentiment: compound score <= -0.05

Furthermore, we would like to give examples of typical use cases/valence rules for sentiment analysis and the types of text VADER can deal with:

    -	Typical negotiations (e.g.: “not good”)
    -	Contractions of negations (e.g.: “wasn’t very good”)
    -	Use of punctuation to show increased sentiment intensity (e.g.: ”Good!!!!!!”)
    -	Use of word-shape (e.g.: “BAAAAAD” -> CAPS for words/phrases)
    -	Degree modifiers to alter sentiment intensity (e.g.: intensity boosters like “very” or dampeners like “kind of”)
    -	Sentiment-laden slangs (e.g.: “sux”)
    -	Sentiment-laden emoticons (e.g.: “:)” or “:D”)
    -	Utf-8 encoded emojis 
    -	Initialisms and Acronyms (e.g.: “lol”)

As a last remark one can point out that VADER works in conjunction with NLTK as well such that VADER can do sentiment analysis on longer texts like for example decomposing paragraphs/articles etc. into sentence-level analyses.



##### Q2: Provide a code snippet detailing how to use it for our task

In light of what you have learned about this method, reflect on pre-processing steps that might be
unnecessary when using VADER .


###### Installing VADER package 


In [1]:
! pip install vaderSentiment

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


###### Importing the package

In [2]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

###### Analyzing dummy example 

In [3]:
analyzer = SentimentIntensityAnalyzer()

text = "I love the course Machine Learning for Healthcare! It's amazing!"

scores = analyzer.polarity_scores(text)

print(scores)

{'neg': 0.0, 'neu': 0.482, 'pos': 0.518, 'compound': 0.8619}


As one can see, VADER is capable to process whole sentences by applying parts of our pre-processing steps like tokenization which seems to be not necessary anymore. 

Additionally, the overall sentence has a compound score of 0.8619 which means that the sentence has positive valence and a very high positive sentiment intensity (max.: 1). 

In [4]:
import pandas as pd 
# loading raw data and pre-processed/cleaned data
raw_data = pd.read_csv('Data/TweetsCOV19.csv')
cleaned_data = pd.read_csv('Data/cleaned_data.csv')

  raw_data = pd.read_csv('Data/TweetsCOV19.csv')


In [5]:
#loading 10 sample Tweets

texts = raw_data['TweetText'][:10]
texts_cleaned = cleaned_data['TweetText'][:10]


print('raw texts')
print(texts)
print('---------------------')
print('clean texts')
print(texts_cleaned)

raw texts
0    From my blog: Californians support vaccine law...
1    Secretary of State should recall Stormont next...
2    While serving in Afghanistan in 2010, Marine C...
3    witch vixen season starts tomorrow. you all sh...
4    CGTN on the scene: \n\nAround 15,000 troops, 3...
5    Looking like it may be a fall-like weekend com...
6    i stopped caring what niggas think when i real...
7    #LIVE: Chaos expected on Oct 1 across Hong Kon...
8    I hold @kie_vs_theworld personally responsible...
9                                @FuckITripped Exactly
Name: TweetText, dtype: object
---------------------
clean texts
0    from my blog californian support vaccine law –...
1    secretary of state should recall stormont next...
2    while serving in afghanistan in 2010 marine co...
3    witch vixen season start tomorrow you all shou...
4    cgtn on the scene around 15000 troop 32 equipm...
5    looking like it may be a falllike weekend comi...
6    i stopped caring what nigga think when i

In [6]:
for text in texts: 
    score = analyzer.polarity_scores(text)
    print("{:-<65} {}".format(text, str(score)))

From my blog: Californians support vaccine laws – new poll diminishes anti-vaxxer power https://t.co/d5BaAda3ki {'neg': 0.0, 'neu': 0.828, 'pos': 0.172, 'compound': 0.4019}
Secretary of State should recall Stormont next Monday at 10am.
Those MLA’s who refuse to turn up (Whatever Party) s… https://t.co/ZdArVTrKar {'neg': 0.092, 'neu': 0.795, 'pos': 0.113, 'compound': 0.128}
While serving in Afghanistan in 2010, Marine Corporal Brandon Rumbaug was carrying a fellow Marine to safety when h… https://t.co/Dipa5CbN1A {'neg': 0.0, 'neu': 0.872, 'pos': 0.128, 'compound': 0.4215}
witch vixen season starts tomorrow. you all should be receiving the spell that turns you into your witchsona at mid… https://t.co/ruYcgfoSdI {'neg': 0.111, 'neu': 0.889, 'pos': 0.0, 'compound': -0.3612}
CGTN on the scene: 

Around 15,000 troops, 32 equipment units and 12 air formations composed of over 160 aircraft a… https://t.co/btVQ5kDgAQ {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
Looking like it may be a

In [7]:
for text in texts_cleaned: 
    
    score = analyzer.polarity_scores(text)
    
    print("{:-<65} {}".format(text, str(score)))

from my blog californian support vaccine law – new poll diminishes antivaxxer power {'neg': 0.0, 'neu': 0.816, 'pos': 0.184, 'compound': 0.4019}
secretary of state should recall stormont next monday at 10am those mla ’ s who refuse to turn up whatever party s … {'neg': 0.085, 'neu': 0.811, 'pos': 0.104, 'compound': 0.128}
while serving in afghanistan in 2010 marine corporal brandon rumbaug wa carrying a fellow marine to safety when h … {'neg': 0.0, 'neu': 0.872, 'pos': 0.128, 'compound': 0.4215}
witch vixen season start tomorrow you all should be receiving the spell that turn you into your witchsona at mid … {'neg': 0.111, 'neu': 0.889, 'pos': 0.0, 'compound': -0.3612}
cgtn on the scene around 15000 troop 32 equipment unit and 12 air formation composed of over 160 aircraft a … {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
looking like it may be a falllike weekend coming up dry cold front look to move through the area friday with hig … {'neg': 0.0, 'neu': 0.898, 'pos': 0.102, 'c

It seems like VADER can deal with all of the raw data without pre-processing and is therefore very robust to unprocessed data. Furthermore, VADER seems to be doing fine (only errors for NaN's) and makes useful outputs which make sense even though no pre-processing was applied in our raw data.

Additionally, VADER has the advantage to make use of emoticons, UTF-8 encoded emojis, word-shapes, slangs, punctuations and inialisms/acronyms which surely helps to determine the overall sentiment more precisely. Therefore, these text types should not be removed in the pre-processing step. 


##### Example 1:

###### Raw data:

"i stopped caring what niggas think when i realized they will believe anything the next nigga tell them." {'neg': 0.191, 'neu': 0.667, 'pos': 0.142, 'compound': -0.0258} 

###### Preproccessed: 

"i stopped caring what nigga think when i realized they will believe anything the next nigga tell them" {'neg': 0.09, 'neu': 0.758, 'pos': 0.152, 'compound': 0.3182}

##### Example 2:

###### Raw Data:

I hold @kie_vs_theworld personally responsible for what just happened 🤷🏾‍♂️😂 #RAW {'neg': 0.07, 'neu': 0.704, 'pos': 0.226, 'compound': 0.6369}

###### Preprocessed:

i hold kievstheworld personally responsible for what just happened raw {'neg': 0.0, 'neu': 0.796, 'pos': 0.204, 'compound': 0.3182}

##### Conclusion:

As one can see at our example, it seems like our pre-processing introduce bias in terms of that surely negative contexts get biased in a direction of neutral or even positive connotation. Therefore, I would suggest to not pre-process the data using when using VADER due to the fact that the already implemented pre-processing techniques using the package are enough to reliably evaluate the provided social media data/ Twitter texts. 

As a result our implemented pre-processing functions like lemmatizaton, removal of NaN's, URLs, emojis, punctuations, tokenizations are not necessary anymore, since VADER already takes care of of these problems and many more, like the categorizations of mis-spellings as neutral or the incorporation of abbreviations to be able to categorize them (eg.:"LOL"). 

###### Q3: Apply this method to our TweetsCOV19 dataset and comment on the performance obtained

In [8]:
raw_data.columns

Index(['Unnamed: 0', 'TweetId', 'Username', 'Timestamp', 'NoFollowers',
       'NoFriends', 'NoRetweets', 'NoFavorites', 'Entities', 'Sentiment',
       'Mentions', 'Hashtags', 'URLs', 'TweetText', 'UserLocation'],
      dtype='object')

In [9]:
raw_data['Sentiment'].head(10)

0    2 -1
1    2 -1
2    2 -3
3    1 -1
4    1 -1
5    2 -1
6    3 -1
7    1 -1
8    1 -1
9    1 -1
Name: Sentiment, dtype: object

Sentiment is the label of our Dataframe, it has a positive score and a negative score for the provided Tweet.

In [10]:
complete_texts = raw_data['TweetText']

In [11]:
compound_score= []
for text in complete_texts: 
    score = analyzer.polarity_scores(text)
    compound_score.append(score['compound'])
    
    

TypeError: 'float' object is not iterable

In [12]:
raw_data.loc[641055,:]

Unnamed: 0         https://t.co/YSpREbX5VD
TweetId         coast to coast & then some
Username                               NaN
Timestamp                              NaN
NoFollowers                            NaN
NoFriends                              NaN
NoRetweets                             NaN
NoFavorites                            NaN
Entities                               NaN
Sentiment                              NaN
Mentions                               NaN
Hashtags                               NaN
URLs                                   NaN
TweetText                              NaN
UserLocation                           NaN
Name: 641055, dtype: object

In [13]:
raw_data[raw_data['TweetText'].isna()]

Unnamed: 0.1,Unnamed: 0,TweetId,Username,Timestamp,NoFollowers,NoFriends,NoRetweets,NoFavorites,Entities,Sentiment,Mentions,Hashtags,URLs,TweetText,UserLocation
641055,https://t.co/YSpREbX5VD,coast to coast & then some,,,,,,,,,,,,,
644251,delay the testing and tracing of Jamaat attend...,,,,,,,,,,,,,,
644252,bank politics? #AarNoiMamata #আরনয়মমতা https:...,"Kolkata, India",,,,,,,,,,,,,
651840,Now the problem is solved. https://t.co/67sMxS...,,,,,,,,,,,,,,
663640,Now the problem is solved. https://t.co/67sMxS...,,,,,,,,,,,,,,


###### Conclusion 

Need to convert all NaN values in column 'TweetText' because VADER can not deal with that

In [14]:
#Removal of NaN's
new_data = raw_data[raw_data['TweetText'].notna()]

In [15]:
#retry VADER 

complete_texts= new_data['TweetText']
compound_score= []

for text in complete_texts: 
    score = analyzer.polarity_scores(text)
    compound_score.append(score['compound'])
    
    

In [16]:
def sentimentPredict(sentiment):
    
    for i in range(len(sentiment)):
        
        if sentiment[i] >= 0.05:
             sentiment[i]= 1         #"Positive"
    
        elif sentiment[i] <= -0.05: 
             sentiment[i]=-1         #"Negative"
    
        else:
             sentiment[i]=0          #"Neutral"
    return sentiment

In [17]:
sentiment_predict = sentimentPredict(compound_score)

In [18]:
tweet_sentiment= new_data['Sentiment']

In [19]:
#convert str from tweet_sentiment to integer values 
tweet_sentiment_int=[]

for s in tweet_sentiment:
    a, b = map(int, s.split())
    c=[a,b]
    tweet_sentiment_int.append(c)
    

In [20]:
tweet_sentiment_sum=[]
for number in tweet_sentiment_int:
    summed= sum(number)
    tweet_sentiment_sum.append(summed)
    

In [21]:
def convert_int_to_sentiment(sentiment):
    for i in range(len(sentiment)):
        
        if sentiment[i] >0:
             sentiment[i]= 1    #"Positive"
    
        elif sentiment[i] < 0: 
             sentiment[i]=-1    #"Negative"
    
        else:
             sentiment[i]=0     #"Neutral"
    return sentiment

In [25]:
tweet_sentiment= convert_int_to_sentiment(tweet_sentiment_sum)

In [26]:
from sklearn.metrics import balanced_accuracy_score

In [31]:
score= balanced_accuracy_score(tweet_sentiment, sentiment_predict, adjusted=False)
print(score)

0.5740915423089935


In [32]:
score_adjusted = balanced_accuracy_score(tweet_sentiment, sentiment_predict, adjusted=True)
print(score_adjusted)

0.36113731346349026


In our evalutation of the performance of the VADER package we used the outputted compound score of the package which predicts the overall sentiment of the Tweet. As comparison, we used the labels of our dataset and summed both the positive and negative labels per Tweet to get an overall sentiment score for each Tweet.



Moreover, we will use the adjusted balanced accuracy score as a metric to evaluate the performance of the package. The adjusted balanced accuracy score is a metric that is used to evaluate the performance of a classifier. It is a balanced accuracy score that is adjusted for chance. 

As a small remark, using the adjusted balanced accuracy score, a score of 1 would mean a perfect performance, while a adjusted score of 0 would mean random guessing. Therefore, with an adjusted balance accuracy score of 0.36 VADER seems to be better than random in classifying the sentiment of twitter texts but there is still a lot of potential to be better. 

##### Conclusion of Q3

As a result, one can say that VADER is a quite good start for classifying sentiments of twitter texts but as one can see it is far away of being a perfect classifier. Reflecting on our applied methods, we used heuristics such as taking the sum of the positive and negative score from the sentiment labels in the TweetsCOV19 dataset and for example interpreted a positive sum as a positive sentiment statement. Furthermore, we applied thresholds described on the VADER GitHub page for the compound scores which helped categorizing compound scores into positive, neutral or negative predictions. 