### Pfizer Vaccine Tweets

This notebook has two parts. 

- Part 1 - Sentiment Analysis of tweets using HuggingFace Transformers (analogical to Optimus Prime!!) and NLTK VADER (Analogical to Darth Vader!!). Ignore the analogies if you aren't a fan of movies.
- Part 2 - Senternce embedding generation using pretraine BERT.

## Part 1 (Optimus Prime vs Darth Vader!!)

In [None]:
from IPython.display import Image
Image(filename='/kaggle/input/compare/img.jpg')

We all are familiar with Optimus Prime and Darth Vader. Let's see how good they are.

> Just for for your reference I'm referring HuggingFace Transformer with Optimus Prime and NLTK's VADER with Darth Vader. Interesting Analogy. Isn't it?

We'll be doing sentiment analysis using both (HuggingFace Transformers and NLTK's VADER) and let's how they perform. And do they contradict each other?

In [None]:
import numpy as np 
import pandas as pd 
import plotly.graph_objects as go
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import warnings
warnings.filterwarnings('ignore')

In [None]:
data = pd.read_csv("/kaggle/input/pfizer-vaccine-tweets/vaccination_tweets.csv")

In [None]:
data.head()

### Basic EDA

A generic function for basic EDA

In [None]:
def basic_eda(df, row_limit=5, list_elements_limit=10):
    ### rows and columns
    print('Info : There are {} columns in the dataset'.format(df.shape[1]))
    print('Info : There are {} rows in the dataset'.format(df.shape[0]))
    
    print("==================================================")
    
    ## data types
    print("\nData type information of different columns")
    dtypes_df = pd.DataFrame(df.dtypes).reset_index().rename(columns={0:'dtype', 'index':'column_name'})
    cat_df = dtypes_df[dtypes_df['dtype']=='object']
    num_df = dtypes_df[dtypes_df['dtype']!='object']
    print('Info : There are {} categorical columns'.format(len(cat_df)))
    print('Info : There are {} numerical columns'.format(len(dtypes_df)-len(cat_df)))
    
    if list_elements_limit >= len(cat_df):
        print("Categorical columns : ", list(cat_df['column_name']))
    else:
        print("Categorical columns : ", list(cat_df['column_name'])[:list_elements_limit])
        
    if list_elements_limit >= len(num_df):
        print("Numerical columns : ", list(num_df['column_name']))
    else:
        print("Numerical columns : ", list(num_df['column_name'])[:list_elements_limit])
    
    #dtypes_df['dtype'].value_counts().plot.bar()
    display(dtypes_df.head(row_limit))
    
    print("==================================================")
    print("\nDescription of numerical variables")
    
    #### Describibg numerical columns
    desc_df_num = df[list(num_df['column_name'])].describe().T.reset_index().rename(columns={'index':'column_name'})
    display(desc_df_num.head(row_limit))
    
    print("==================================================")
    print("\nDescription of categorical variables")
    
    desc_df_cat = df[list(cat_df['column_name'])].describe().T.reset_index().rename(columns={'index':'column_name'})
    display(desc_df_cat.head(row_limit))
    
    return

In [None]:
basic_eda(data)

### Sentiment Analysis : The Optimus Prime [HuggingFace Transformers]

![HF](https://raw.githubusercontent.com/huggingface/transformers/master/docs/source/imgs/transformers_logo_name.png)

Transformers is a library released by [huggingface](https://huggingface.co/transformers/quicktour.html). This library downloads pretrained models for Natural Language Understanding (NLU) tasks, such as analyzing the sentiment of a text, and Natural Language Generation (NLG), such as completing a prompt with new text or translating in another language.

We'll use the pretrained model find out the sentiment of a tweet in our dataset

Pros:

- Good Accuracy
- Very short and easy to use code
- No fancy preprocessing needed
- No finicking around with threshold values

Cons:

- Significantly Slower
- Only works with 2 classes out of the box 



In [None]:
from transformers import pipeline
sentiment_analysis = pipeline('sentiment-analysis')

In [None]:
transformer_sentiments = data.text.apply(sentiment_analysis)

In [None]:
labels = []
scores = []
for sentiment in transformer_sentiments:
    #print(f"label: {sentiment[0]['label']}, with score: {round(sentiment[0]['score'], 4)}")
    labels.append(sentiment[0]['label'])
    scores.append(round(sentiment[0]['score'], 4))

In [None]:
data['tf-sentiment'] = labels
data['tf-score'] = scores

In [None]:
data[['text', 'tf-sentiment', 'tf-score']].head(3)

### Sentiment Analysis : The Darth Vader [NLTK VADER]

![nltk](https://static1.squarespace.com/static/538cea80e4b00f1fad490c1b/54668a77e4b00fb778d22a34/54668e11e4b00fb778d29051/1416008768215/?format=1500w)

NLTK already has a built-in, pretrained sentiment analyzer called VADER (Valence Aware Dictionary and sEntiment Reasoner).

VADER is pretrained, you can get results more quickly than with many other analyzers. However, VADER is best suited for language used in social media, like short sentences with some slang and abbreviations. It’s less accurate when rating longer, structured sentences, but it’s often a good launching point.

Pros:

- Very Fast
- Very short and easy to use code
- No fancy preprocessing needed
- Provides three classes

Cons:

- Rule based algorithm doesn't consider context
- less accuarate

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

In [None]:
def find_sentiment(tweet):
    if sia.polarity_scores(tweet)["compound"] > 0:
        return "POSITIVE"
    elif sia.polarity_scores(tweet)["compound"] < 0:
        return "NEGATIVE"
    else:
        return "NEUTRAL"        

In [None]:
vader_sentiments = data.text.apply(find_sentiment)

In [None]:
data['vader-sentiment'] = vader_sentiments

In [None]:
data[['text', 'vader-sentiment']].head(3)

### Comparision of Transformer with VADER

Now that we have both the sentiments let's compare the

In [None]:
## Make a df just for comparision
df = data[['text', 'tf-score', 'tf-sentiment', 'vader-sentiment']]
df.head(3)

#### Woah! Are they performing exactly opposite? Let's find out.

In [None]:
print("Distribution of classes : Optimus Prime")
counts = df['tf-sentiment'].value_counts()
percent = counts/sum(counts)

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 5))

counts.plot(kind='bar', ax=ax1)
percent.plot(kind='bar', ax=ax2)
ax1.set_ylabel('Counts : TF Sentiments', size=12)
ax2.set_ylabel('Percentage : TF Sentiments', size=12)
plt.tight_layout()
plt.show()

Interestingly Transformer is clasifying most of the tweets as negative. Let's see about VADER.

In [None]:
print("Distribution of classes : Darth Vader")
counts = df['vader-sentiment'].value_counts()
percent = counts/sum(counts)

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 5))

counts.plot(kind='bar', ax=ax1)
percent.plot(kind='bar', ax=ax2)
ax1.set_ylabel('Counts : VADER Sentiments', size=12)
ax2.set_ylabel('Percentage : VADER Sentiments', size=12)
plt.tight_layout()
plt.show()

Interesting! Let's see side by side.

In [None]:
def label_function(val):
    return f'{val / 100 * len(df):.0f}\n{val:.0f}%'

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(16, 8))

df.groupby('vader-sentiment').size().plot(kind='pie', autopct=label_function, textprops={'fontsize': 12},
                                  colors=['tomato', 'gold', 'skyblue'], ax=ax1)
df.groupby('tf-sentiment').size().plot(kind='pie', autopct=label_function, textprops={'fontsize': 12},
                                 colors=['violet', 'lime'], ax=ax2)
ax1.set_ylabel('VADER Sentiments', size=12)
ax2.set_ylabel('Transformer Sentiments', size=12)
plt.tight_layout()
plt.show()

Clearly there is a lot of difference! Let's dig down further!

In [None]:
def same_or_diff(x):
    if x[0]==x[1]:
        return "Same"
    else:
        return "Different"

In [None]:
print("Same or Different including the Neutral records")
df['same_or_diff_w_neut'] = df[['tf-sentiment', 'vader-sentiment']].apply(same_or_diff, axis=1)
df.head(3)

In [None]:
print("Same or Different including the Neutral records : Comparision")
counts = df['same_or_diff_w_neut'].value_counts()
percent = counts/sum(counts)

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 5))

counts.plot(kind='bar', ax=ax1)
percent.plot(kind='bar', ax=ax2)
ax1.set_ylabel('Number of records', size=12)
ax2.set_ylabel('Percentage or records', size=12)
plt.tight_layout()
plt.show()

In [None]:
print("Same or Different after removing the Neutral records")
dfwn = df[df['vader-sentiment'] != 'NEUTRAL']
# Just to ensure
print("==================================\n")
print(dfwn['vader-sentiment'].value_counts())
dfwn['same_or_diff_wo_neut'] = dfwn[['tf-sentiment', 'vader-sentiment']].apply(same_or_diff, axis=1)
dfwn.head(3)

In [None]:
print("Same or Different after removing the Neutral records : Comparision")
counts = dfwn['same_or_diff_wo_neut'].value_counts()
percent = counts/sum(counts)

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 5))

counts.plot(kind='bar', ax=ax1)
percent.plot(kind='bar', ax=ax2)
ax1.set_ylabel('Number of records', size=12)
ax2.set_ylabel('Percentage or records', size=12)
plt.tight_layout()
plt.show()

#### Let's see at some records !!

In [None]:
df['color'] = df['same_or_diff_w_neut'].apply(lambda x : "green" if x == 'Same' else 'red')

fig = go.Figure(data=[go.Table(
    columnorder = [1,2,3,4],
    columnwidth = [400, 100, 100, 120],
    header=dict(values=['text', 'tf-sentiment', 'vader-sentiment', 'same_or_different'],
                fill_color='paleturquoise',
                line_color='black',
                align='center',
                height=40),
    cells=dict(values=[df['text'],df['tf-sentiment'], df['vader-sentiment'], df['same_or_diff_w_neut']],
               fill_color=[['lavender'], ['lavender'], ['lavender'], list(df.color)],
               line_color='black',
               align='left'))
])

fig.update_layout(height=700,
                 title="Comparision across Transformer and VADER")

fig.show()

#### We can clearly see the differences! 

## Part 2 - Sentence embedding generation of tweets using BERT

In [None]:
import math
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

In [None]:
## Loading pretrained model/tokenizer
model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

In [None]:
##### Some utility functions
# function to find max length
def find_max_len(tokenized):
  max_len = 0
  for i in tokenized.values:
      if len(i) > max_len:
          max_len = len(i)
  return max_len


# function to extract bert features
def get_bert_features(df, text_col):
  tokenized = df[text_col].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))
  max_len = find_max_len(tokenized)
  print("Max Len = ",max_len)
  padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])
  attention_mask = np.where(padded != 0, 1, 0)

  input_ids = torch.tensor(padded)  
  attention_mask = torch.tensor(attention_mask)

  with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

  features = last_hidden_states[0][:,0,:].numpy()

  return features


# Applying above function in batches to avoid RAM issues
def extract_features(df, text_col, batch_size=1000):
    features = []
    labels = []

    no_of_batches = math.ceil(len(df)/batch_size)
    print("\nInitializing...")
    print("Total no of batches : ",str(no_of_batches))
    batch_no = 1

#     widgets = ['Generating BERT Embeddings: ', progressbar.AnimatedMarker()] 
#     bar = progressbar.ProgressBar(max_value=len(df), widgets=widgets).start() 

    for i in range (0,len(df),batch_size):
        #time.sleep(0.2)
        #bar.update(i)
        print()
        print("\nGenerating features for batch",str(batch_no),"of",str(no_of_batches))
        dfn = df[i:i+batch_size]
        tfeatures = get_bert_features(dfn, text_col)
        tfeatures = list(tfeatures)
        features.append(tfeatures)
        batch_no = batch_no + 1

    print("Done")
    features = np.concatenate(features)

    return features

In [None]:
features = extract_features(df=df, text_col = 'text', batch_size=1000)

In [None]:
embeddings = list(features)

In [None]:
df['embedding'] = embeddings
df = df[['text', 'embedding']].head(10)

In [None]:
fig = go.Figure(data=[go.Table(
    columnorder = [1,2],
    columnwidth = [300,400],
    header=dict(values=['text', 'embedding'],
                fill_color='paleturquoise',
                line_color='black',
                align='center',
                height=40),
    cells=dict(values=[df['text'],df['embedding']],
               fill_color=[['lavender'], ['lavender']],
               line_color='black',
               align='left'))
])

fig.update_layout(height=700,
                 title="Embeddings generated using BERT")

fig.show()

### Thanks for viewing this noteboook. If you found it interesting consider UPVOTING it.