# <u>About</u>

The Pfizer–BioNTech COVID‑19 vaccine, sold under the brand name Comirnaty, is a COVID-19 vaccine developed by BioNTech in cooperation with Pfizer. It is both the first COVID-19 vaccine to be authorized by a stringent regulatory authority for emergency use and the first cleared for regular use. In December of 2020, the United Kingdom was the first country to authorize the vaccine on an emergency basis, soon followed by the United States, the European Union and several other countries globally.

<br>
    <figure>
        <center>
            <img src="https://d1e00ek4ebabms.cloudfront.net/production/4319492f-2f82-445b-9fef-ae8deda455bb.jpg" width=500 height=500/>
            <figcaption>Pfizer-BioNTech Vaccine</figcaption>
        </center>
    </figure>
<br>

<u>Problem statement</u>: Study the subjects of recent tweets about the vaccine made in collaboration by Pfizer and BioNTech, perform various NLP tasks on this data source.<br>
<u>Data source</u>: https://www.kaggle.com/gpreda/pfizer-vaccine-tweets

In this notebook, I have covered 4 important things:
- Classified tweets based on Valence Aware Dictionary and sentiment Reasoner. 
- Extracted top 25 negative and positive sentimental words.
- Analyzed positive and negative tweet counts over a period of 3 months.
- Countries that are leading the vaccination drive.

## *Loading Pfizer Tweets Dataset*

In [None]:
import re
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.sentiment import SentimentIntensityAnalyzer

%matplotlib inline

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data = pd.read_csv('/kaggle/input/pfizer-vaccine-tweets/vaccination_tweets.csv').fillna('')

print(f'No. of records: {data.shape[0]}')
print(f'No. of columns: {data.shape[1]}\n')
print(f'Column names:\n {data.columns.values}')
data.head(3)

## *Data Preprocessing*

In [None]:
stop_words = (set(stopwords.words('english')))
sno = SnowballStemmer('english')

def remove_html_tags(sentence):
    regex = re.compile(pattern='<.*?>')
    clean_text = re.sub(regex, ' ', sentence)
    return clean_text

def remove_punctuations(word):
    cleaned_sentence = re.sub(pattern=r'[?|!|\|"|#|\']', repl=r'', string=word)
    cleaned_sentence = re.sub(pattern=r'[.|,|)|(|\|/]', repl=r'', string=cleaned_sentence)
    return cleaned_sentence

def get_preprocessed_data(data, feature, cleaned_feature):
        
        i = 0
        final_string = []

        sentences = data[feature].values
        for sentence in sentences:
            filtered_sentence = []
            sentence = remove_html_tags(sentence)
            for word in sentence.split():
                for clean_word in remove_punctuations(word).split():
                    if clean_word.isalpha() and len(clean_word) > 2:
                        if clean_word.lower() not in stop_words:
                            s = (sno.stem(clean_word.lower()))
                            filtered_sentence.append(s)

            string = " ".join(filtered_sentence)
            final_string.append(string)
            i += 1
        data[cleaned_feature] = final_string
        return data

data = get_preprocessed_data(data, 'text', 'Tidy Tweet')
data = get_preprocessed_data(data, 'hashtags', 'Tidy hashtags')

## *Analyzing sentiments using pretrained sentiment analyzer: Valence Aware Dictionary and sentiment Reasoner (VADER)*

VADER is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. VADER uses a combination of A sentiment lexicon is a list of lexical features (e.g., words) which are generally labelled according to their semantic orientation as either positive or negative.

<br>
    <figure>
        <center>
            <img src="https://www.tertiarycourses.com.sg/media/catalog/product/cache/1/image/650x/040ec09b1e35df139433887a97daa66f/n/a/natural-language-processing-nlp-python-nltk-training.jpg" 
                      width=200 height=200/>
            <img src="https://miro.medium.com/max/860/1*Xj8-Jpi5TppZHA8dFRml6A.jpeg" width=200 height=200/>
        </center>
    </figure>
<br>

For more info: https://www.nltk.org/_modules/nltk/sentiment/vader.html

In [None]:
sentiment = SentimentIntensityAnalyzer()

def get_sentiment(data):
    sentiment_list = []
    for text in list(data['Tidy Tweet'].values):
        if sentiment.polarity_scores(text)["compound"] > 0:
            sentiment_list.append("Positive")
        elif sentiment.polarity_scores(text)["compound"] < 0:
            sentiment_list.append("Negative")
        else:
            sentiment_list.append("Neutral")
    return sentiment_list
        
data['Sentiment'] = get_sentiment(data)
sns.countplot(x="Sentiment", data=data, palette="Set3")
print(data.Sentiment.value_counts())

## *Word Cloud on Tweets*

In [None]:
def get_word_cloud(sentiment):
    stop_words = (set(stopwords.words('english')))
    remove_words = ['vaccin', 'pfizerbiontech', 'coronavirus', 'pfizer', 'covid', 'covidvaccin', 'pfizervaccin']
    stop_words = remove_words + list(stop_words)
    plt.figure(figsize=[15,15])
    clean_tweets= "".join(list(data[data['Sentiment']==sentiment]['Tidy Tweet'].values))
    wordcloud = WordCloud(width=700,height=400, background_color='white',colormap='plasma', max_words=50, stopwords=stop_words, collocations=False).generate(clean_tweets)
    plt.title(f"Top 50 {sentiment} words used in tweets", fontsize=20)
    plt.imshow(wordcloud)
    return plt.show()

In [None]:
get_word_cloud(sentiment='Negative')

### Based Valence Aware Dictionary and sentiment Reasoner (VADER) classification in analyzing tweets as negative have given us good results. Few of the words which represent negative sentiments are *death, die, sore, risk, dead, serious, delay, risk, allergy etc*

In [None]:
get_word_cloud(sentiment='Positive')

### Based Valence Aware Dictionary and sentiment Reasoner (VADER) classification in analyzing tweets as positive have given us good results. Few of the words which represent negative sentiments are *thank, like, new, good, hope, health, approve, receive, help, need etc*

## *Tweet Count w.r.t Date*

In [None]:
data['date'] = pd.to_datetime(data['date']).dt.date
negative_data = data[data['Sentiment']=='Negative'].reset_index()
positive_data = data[data['Sentiment']=='Positive'].reset_index()
grouped_data_neg = negative_data.groupby('date')['Sentiment'].count().reset_index()
grouped_data_pos = positive_data.groupby('date')['Sentiment'].count().reset_index()
merged_data = pd.merge(grouped_data_neg, grouped_data_pos, left_on='date', right_on='date', suffixes=(' Negative', ' Positive'))

merged_data.plot(x='date', y=['Sentiment Negative', 'Sentiment Positive'], figsize=(14, 7), marker='o', xlabel='Date', ylabel='Count', title='Tweet count over a period of time')

### From the above graph it can be observed that for over a period of 3 months people were optimistic about the pfizer vaccine. Graph also shows that for any given day, positive tweets dominated negative tweets. This indicates that people are hopeful and have high expectation from pfizer vaccine.

## *Tweets per Country/City*

In [None]:
loc_df = data['user_location'].str.split(',',expand=True)
loc_df=loc_df.rename(columns={0:'fst_loc',1:'snd_loc'})
loc_df['snd_loc'] = loc_df['snd_loc'].str.strip()

state_fix = {'Ontario': 'Canada','United Arab Emirates': 'UAE','TX': 'USA','NY': 'USA','FL': 'USA','England': 'UK','Watford': 'UK','GA': 'USA','IL': 'USA', 'NY':'USA','United Kingdom':'UK', 
             'Alberta': 'Canada','WA': 'USA','NC': 'USA','British Columbia': 'Canada','MA': 'USA','ON':'Canada','OH':'USA','MO':'USA','AZ':'USA','NJ':'USA','London':'UK',
             'CA':'USA','DC':'USA','AB':'USA','PA':'USA','SC':'USA','VA':'USA','TN':'USA','New York':'USA','Dubai':'UAE','CO':'USA', 'MI':'USA', 'LA':'USA', 'MD':"USA"}
country = loc_df.replace({"snd_loc": state_fix}) 
top_tweets = loc_df['snd_loc'].value_counts()[:20]
tweet_df = pd.DataFrame(top_tweets)
tweet_df.reset_index(level=0, inplace=True)
tweet_df.columns = ['Country', 'Count']
tweet_df['Country'] = tweet_df['Country'].replace(state_fix, regex=False)
tweets_per_country = tweet_df.groupby('Country')['Count'].sum().reset_index().sort_values(by='Count')
tweets_per_country.plot.bar(x='Country', figsize=(14, 7), xlabel='Country', ylabel='Count', title='Tweet count per country')

### Maximum tweets were done by USA, UK and India. These 3 countries are leading the vaccination drive. This [article](https://timesofindia.indiatimes.com/life-style/health-fitness/health-news/coronavirus-5-countries-which-are-leading-the-race/photostory/78844240.cms) by Times of India mentions that USA, UK, China, India and Russia are playing a crucial role in vaccine manufacturing and production. Since twitter is banned(blocked) in China and Russia, therefore the number don't come into the picture.

<br>
    <figure>
        <center>
            <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSr0hAXZpPjP6H_SLVhBBZQ0t8lDLMGvk8ZTA&usqp=CAU" width=200 height=200/>
            <figcaption>If you liked my worked please upvote it.</figcaption>
        </center>
    </figure>
<br>