# <p style="font-size:36px; font-family:'Candara'; font-weight: bold; line-height:1.3">Sentiment Analysis of Pfizer COVID-19 Vaccine Tweets using VADER</p>
<p style="font-size:18px; font-family:'Calibri Light'; line-height:1.3">Analyst: Jordan Rich<br>KaggleID: JordanRich</p>

<p style="font-size:24px; font-family:'Candara'; font-weight: bold; line-height:1.3">Notebook Description</p>

<p style="font-size:18px; font-family:'Calibri Light'; line-height:1.3">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;In this notebook, Pfizer COVID-19 Vaccine Tweets are explored employing various methods to understand the sentiments toward the vaccine. First, hashtags are explored to determine whether hashtag data should be included in sentiment scoring. The data is then cleaned and tokenized employing some very easy to use tools included in the NLTK (Natural Language Toolkit) library. Positive and negative lexicons contained within tweets are analyzed to get an understanding of sentiments. Finally, sentiment scoring is performed employing VADER (Valence Aware Dictionary and sEntiment Reasoner). VADER was selected as the lexicon and rule-based sentiment scoring tool for this project because it is specifically attuned to sentiments expressed in social media.</p>

<p style="font-size:24px; font-family:'Candara'; font-weight: bold; line-height:1.3">Key Activities</p>
    <ul>
        <li><p style="font-size:18px; font-family:'Calibri Light'; line-height:1.3">Demonstrate how to analyze, tokenize, and clean text data obtained from Twitter</p></li> 
        <li><p style="font-size:18px; font-family:'Calibri Light'; line-height:1.3">Describe how to perform sentiment analysis using NLTK SentimentIntensityAnalyzer() and VADER</p></li>
        <li><p style="font-size:18px; font-family:'Calibri Light'; line-height:1.3">Provide results from sentiment analysis specific to the Pfizer COVID-19 Vaccine Tweets</p></li>
        <li><p style="font-size:18px; font-family:'Calibri Light'; line-height:1.3">Present conclusion of study</p></li>
    </ul>
</p>

In [None]:
pip install langdetect

In [None]:
#import libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
from langdetect import detect
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.tag import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer
import re, string
from nltk.corpus import stopwords
from PIL import Image
from wordcloud import WordCloud
from nltk.probability import FreqDist
from nltk.sentiment import SentimentIntensityAnalyzer

#NLTK downloads
nltk.download([
    "names",
    "stopwords",
    "averaged_perceptron_tagger",
    "vader_lexicon",
    "punkt",
])

#configure jupyter to allow each cell to display multiple outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
#load tweet csv file and view 10 sampled rows
tweets = pd.read_csv('../input/pfizer-vaccine-tweets/vaccination_tweets.csv')
tweets.sample(10)

In [None]:
# check if there are duplicated values
tweets.duplicated().value_counts()

len(tweets) == tweets.duplicated().value_counts()[0]

<p style="font-size:24px; font-family:'Candara'; font-weight: bold">Extract and format hashtags to analyze</p>

<p style="font-size:18px; font-family:'Calibri Light'; line-height:1.3">Hashtags are fomatted and tokenized for easier analysis and manipulation. Here, hashtags are separated and placed into a list employing standard python functions.</p>

In [None]:
%%capture
#drop NAN and NULL values
x = tweets['hashtags'].dropna()

#reset_index
x.reset_index(drop=True, inplace=True)

#extract hashtags from lists within table and place one by one into hashtag list
for i in np.arange(0,len(x)):
    if i == 0:
        z = list(x[i])
        z.pop(0)
        z.pop(len(z)-1)
        blah = ''.join(z)
        exec(f"ht_list = [{blah}]")
    else:
        z = list(x[i])
        z.pop(0)
        z.pop(len(z)-1)
        z = ''.join(z)
        exec(f"y = [{z}]")
        ht_list.extend(y)

#save into dataframe
hashtags = pd.DataFrame(ht_list, columns = ['hashtags'])

In [None]:
hashtags

In [None]:
#create ranked barplot of first 50 hashtags
fig = plt.figure(figsize = [5, 20]);

ht_counts = hashtags.value_counts().head(50).reset_index()
sns.barplot(data=ht_counts, y='hashtags', x = 0);
plt.title('Ranked Hashtags', fontsize=20)
plt.yticks(fontsize=14, fontweight='bold');
plt.xticks(fontsize=14);
plt.xlabel('Count', fontsize=16);

In [None]:
#list frequencies
ht_freqdist=FreqDist(ht_list)
ht_freqdist.most_common(50)

<p style="font-size:20px; font-family:'Calibri Light'; line-height:1.3; font-weight: bold">Conclusions about hashtags</p>

<p style="font-size:18px; font-family:'Calibri Light'; line-height:1.3">Hashtags do not appear to contain useful information for sentiment analysis. They predominately mention Pfizer, covid, vaccine (or a variation of these words) or governmental organizations associated with the COVID-19 pandemic response and management.</p>

# <p style="font-size:28px; font-family:'Candara'; font-weight: bold">Perform EDA on tweet text</p>
<ul>
    <li><p style="font-size:18px; font-family:'Calibri Light'; line-height:1.3">Detect tweet languages using the langdetect library so that english tweets can be isolated.</p></li>
    <li><p style="font-size:18px; font-family:'Calibri Light'; line-height:1.3">Tokenize text to words</p></li>
    <li><p style="font-size:18px; font-family:'Calibri Light'; line-height:1.3">Clean text</p></li>
    <li><p style="font-size:18px; font-family:'Calibri Light'; line-height:1.3">Analyze frequency distribution</p></li>
</ul>

# <p style="font-size:24px; font-family:'Candara'; font-weight: bold">Language Detection</p>
<p style="font-size:18px; font-family:'Calibri Light'; line-height:1.3">
    &nbsp;&nbsp;&nbsp;&nbsp;Here the library langdetect is employed to analyze individual tweets and return the ISO 639-1 language codes. Using the language codes and a dataset created from the wikipedia page on ISO 639 language codes, the codes are translated to the english language name and then stored into a dataframe with the counts.
</p>

In [None]:
#drop NAN and NULL values
tweet_text = tweets['text'].dropna()

#detect language of individual tweets and put into list
for i in np.arange(0, len(tweet_text)):
    if i==0:
        lang = [detect(tweet_text[i])]
    else:
        lang.extend([detect(tweet_text[i])])

In [None]:
#create count of unique languages
lang_un = np.unique(lang, return_counts=True)

In [None]:
#load table to translate ISO 639-1 language codes to English Language names
lang_codes=pd.read_csv('../input/language-codes/correctedMetadata.csv')
lang_codes.sample(10)

In [None]:
#translate ISO 639-1 language codes to English language names and place into dataframe with counts
a = []
for i in np.arange(0,len(lang_un[0])):
    if lang_un[0][i] == 'en':
        a.append(['English'])
    else:
        a.append(lang_codes.loc[lang_codes['Wikipedia.Language.Code'] == lang_un[0][i], 'Language.name..English.'].to_list())
lang_tweet = pd.DataFrame(a, columns=['Language'])
del a
lang_tweet['Count'] = lang_un[1]
lang_tweet

In [None]:
#extract only tweets in english for this notebook
tweet_text_en = []
for i in np.arange(0,len(lang)):
    if lang[i] == 'en':
        tweet_text_en.extend([tweet_text[i]])

# <p style="font-size:24px; font-family:'Candara'; font-weight: bold">Word Tokenization</p>

<p style="font-size:20px; font-family:'Candara'; font-weight: bold; line-height:1.3">In determining which tokenizer to use in this project to convert tweet text strings to words, retaining the format of URLs and hashtags is important to permit their discard at cleaning. Here two NLTK tokenization functions are compared for suitability in this project, word_tokenize() and ToktokTokenizer().</p>

<p style="font-size:18px; font-family:'Calibri Light'; line-height:1.3">In this case, word_tokenize() did not retain the hashtag and URL formatting, making it more difficult to discard hashtags and URLs from tweet text. ToktokTokenizer() is definitely the better solution for this project because it does retain the URL and hashtag formatting, making discard simpler at cleaining</p>

In [None]:
#initial ToktokTokenizer()
tt = ToktokTokenizer()

# NLTK.word_tokenize vs. NLTK.ToktokTokenizer()
print('Original Tweet - \n\n{}\n\n    NLTK.word_tokenize() -\n\n    {}\n\n    vs.\n\n    NLTK.ToktokTokenizer() -\n\n    {}' .format(tweet_text[0], word_tokenize(tweet_text[0]),tt.tokenize(tweet_text[0])))


# <p style="font-size:24px; font-family:'Candara'; font-weight: bold">POS (Part Of Speach) Tagging</p>
<p style="font-size:18px; font-family:'Calibri Light'; line-height:1.3">
    &nbsp;&nbsp;&nbsp;&nbsp;Here, the pos_tag() function contained in the NLTK library is used to tag tokens by their respective parts (nouns, adjectives, verbs, etc.). For more information on POS Tagging, see <a href=http://www.nltk.org/book/ch05.html>http://www.nltk.org/book/ch05.html</a>
</p>

In [None]:
print(pos_tag(tt.tokenize(tweet_text[0])))

# <p style="font-size:24px; font-family:'Candara'; font-weight: bold">Lemmatization</p>
<p style="font-size:18px; font-family:'Calibri Light'; line-height:1.3">
     &nbsp;&nbsp;&nbsp;&nbsp;The below function performs lemmatization, or i.e. converts tokens into their root words. Here the NLTK function WordNetLemmatizer() is employed. The lemmatize_sentence function, below, will later be incorporated into a different function that performs overall cleaning of the tweet text data.
</p>

In [None]:
#This project will utilize the NLTK WordNetLemmatizer() to reduce words with common root to its root form.
def lemmatize_sentence(tokens):
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentence = []
    for word, tag in pos_tag(tokens):
        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
        lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
    return lemmatized_sentence

In [None]:
print(lemmatize_sentence(tt.tokenize(tweet_text[0])))
print('\n{}'.format(tweet_text[0]))

# <p style="font-size:24px; font-family:'Candara'; font-weight: bold">Function to Clean and Remove Noise</p>
<p style="font-size:18px; font-family:'Calibri Light'; line-height:1.3">
    &nbsp;&nbsp;&nbsp;&nbsp;Here a complete function for cleaning textual data and removing noise is implemented. The function takes as inputs tokenized tweet text and stopwords and outputs clean data that has been lemmatized and 'noise' removed. Noise here refers to the text that doesn't provide any insight into sentiment, such as certain symbols, words that essentially mean the same thing as other words that cannot be corrected using lemmatization (namely because they are slang), hashtags, and URLs. Regex substitution expressions are used to accomplish much of the noise correction in the textual data. All text is converted to lower case. As a final step, more basic methods are necessary to remove symbols that regex expressions could not handle, such as in this case, '...'.
</p>

In [None]:
#look at raw tokens prior to cleaning to get an idea of what needs to go
fdist=FreqDist(tt.tokenize(''.join(tweet_text)))

fdist.most_common(50)

In [None]:
#Function to clean tweet text - discards tokens and characters that are incompatible with analyses and merges inconsequential redundant terminology 
def remove_noise(tweet_tokens, stop_words = ()):

    cleaned_tokens = []

    for token, tag in pos_tag(tweet_tokens):
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
        token = re.sub("(@[A-Za-z0-9_]+)","", token)
        token = re.sub("#pfizerbiontech","", token.lower())
        token = re.sub("…","", token.lower())
        token = re.sub("#covid19","covid19", token.lower())
        token = re.sub("’","", token.lower())
        token = re.sub("#vaccine","vaccine", token.lower())
        token = re.sub("#covidvaccine","vaccine", token.lower())
        token = re.sub("pfizer","", token.lower())
        token = re.sub("#pfizer","", token.lower())
        token = re.sub("covid","covid19", token.lower())
        token = re.sub("#pfizervaccine","vaccine", token.lower())
        token = re.sub("covid19-19","covid19", token.lower())
        token = re.sub("covid1919","covid19", token.lower())
        token = re.sub("&amp","", token.lower())
        token = re.sub("amp","", token.lower())
        token = re.sub("#vaccine","vaccine", token.lower())
        token = re.sub("vaccination","vaccine", token.lower())
        token = re.sub("vaccine.","vaccine", token.lower())
        token = re.sub("#coronavirus","covid19", token.lower())
        token = re.sub("vaccinate","vaccine", token.lower())
        token = re.sub("#moderna","", token.lower())
        token = re.sub("coronavirus","covid19", token.lower())
        token = re.sub("covid19-19","covid19", token.lower())
        token = re.sub("covid19vaccine","vaccine", token.lower())
        token = re.sub("vaccined","vaccine", token.lower())
        token = re.sub("#covid19vaccine","vaccine", token.lower())
        token = re.sub("#vaccine","vaccine", token.lower())
        token = re.sub("-biontech","", token.lower())
        token = re.sub("#astrazeneca","", token.lower())
        token = re.sub("#covid19","covid19", token.lower())
        token = re.sub("#covid19_19","covid19", token.lower())
        token = re.sub("covid19_19","covid19", token.lower())
        token = re.sub("/biontech","", token.lower())
        token = re.sub("vaccine.","vaccine", token.lower())
        token = re.sub("#biontech","", token.lower())
        token = re.sub("#ergotron","", token.lower())

        #Lemmatize tokens to root words
        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'

        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)

        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())

    #Need to handle '...' token discard in special manner. Doing so using re causes errors. The below two loops do it.
    a = []
    for i in np.arange(0,len(cleaned_tokens)):
        if cleaned_tokens[i] == '...':
            a.append(i)
    for i in np.arange(0,len(a)):
        if i == 0:
            cleaned_tokens.pop(a[i])
        else:
            cleaned_tokens.pop(a[i]-i)
    return cleaned_tokens

In [None]:
#load stopwords (these were downloaded using NLTK - see )
stop_words = stopwords.words('english')

#display output after cleaning and before for comparison
print(remove_noise(tt.tokenize(tweet_text[0]), stop_words))
print('\n{}'.format(tweet_text[0]))

# <p style="font-size:24px; font-family:'Candara'; font-weight: bold">WordCloud Figure</p>
<p style="font-size:18px; font-family:'Calibri Light'; line-height:1.3">
    &nbsp;&nbsp;&nbsp;&nbsp;Here the tweet data is passed through the tokenizer and noise-removal function then a wordcloud figure generated from the output. The larger the text in the wordcloud figure, the greater the frequency of use in the tweets, combined. Visible terms like happy, grateful, great, and hope demonstrate positive sentiments about the vaccine. The only term that stands out as negative is the term side effect. A majority of the terms appear to be neutral.
</p>

In [None]:
#clean tweet text and create bag of words for wordcloud figure
temp_text = ' '.join(tweet_text_en)
clean_tweets = remove_noise(tt.tokenize(temp_text), stop_words)
wordcloud_text = ' '.join(clean_tweets)

#specify circular mask
char_mask = np.array(Image.open('../input/circle-mask/circle.png'))

#create figure
wordcloud=WordCloud(background_color='black', mask=char_mask).generate(wordcloud_text)
fig = plt.figure(figsize=[20,20])
plt.imshow(wordcloud)
plt.axis('off')
plt.show();

# <p style="font-size:24px; font-family:'Candara'; font-weight: bold">Frequency Distribution of Common Terms</p>

In [None]:
fdist=FreqDist(clean_tweets)

tweet_tok_freq = np.transpose(fdist.most_common(40))

#convert string to int for plotting
fdist_x = []
for i in np.arange(0,len(tweet_tok_freq[1])):
    fdist_x.append(int(tweet_tok_freq[1][i]))
    
plt.figure(figsize=[25,20]);
plt.title('40 most commonly used terms in Pfizer COVID19 Vaccine Tweets', fontsize=30, fontweight='bold')
sns.barplot(x = fdist_x, y = tweet_tok_freq[0], palette='hsv');
plt.yticks(fontsize=18, fontweight='bold');
plt.xticks(fontsize=16);
plt.xlabel('Count', fontsize=20, fontweight='bold');

# <p style="font-size:24px; font-family:'Candara'; font-weight: bold">Term Lookup using FreqDist()</p>
<p style="font-size:18px; font-family:'Calibri Light'; line-height:1.3">Below is a demonstration of how to lookup specific terms using the FreqDist function. This is very useful if key terms being investigated are known.</p>

In [None]:
print('Within the {} english language tweets:\n    Happy was used {} times.\n    Death was used {} times.\n    Hope was used {} times.\n    Grateful was used {} times.\n    Scared was used {} time.\n    Stress was used {} time.\n    Sad was used {} times.\n    Angry was used {} time.\n    Joy was used {} times.\n    Upset was used {} times.'.format(len(tweet_text_en), fdist["happy"], fdist["death"], fdist["hope"], fdist["grateful"], fdist["scared"], fdist["stress"], fdist["sad"], fdist["angry"], fdist["joy"], fdist["upset"]))

# <p style="font-size:24px; font-family:'Candara'; font-weight: bold">Analysis of Positive and Negative Lexicons</p>
<p style="font-size:18px; font-family:'Calibri Light'; line-height:1.3">
    &nbsp;&nbsp;&nbsp;&nbsp;Here the tweets will be analyzed for positive and negative sentiment terms and frequency plots generated. The sentiment lexicons employed in this analysis were obtained at <a href=https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets>https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets</a>, 
which were compiled through the work described in:</p>
<p style="font-size:16px; font-family:'Calibri Light'; line-height:1.1">
Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA
</p>

In [None]:
#load positive and negative words
neg_lex = list(pd.read_csv('../input/lexicons/negative-lexicons.txt', sep="\n", header=None, encoding = "ISO-8859-1")[0])
pos_lex = list(pd.read_csv('../input/lexicons/positive-lexicons.txt', sep="\n", header=None, encoding = "ISO-8859-1")[0])

In [None]:
#search through cleaned tweets for positive tokens and return index numbers
pos_ind = []
for i in np.arange(0,len(clean_tweets)):
    for j in np.arange(0,len(pos_lex)):
        if clean_tweets[i] == pos_lex[j]:
            pos_ind.append(i)

#search through cleaned tweets for negative tokens and return index numbers
neg_ind = []
for i in np.arange(0,len(clean_tweets)):
    for j in np.arange(0,len(neg_lex)):
        if clean_tweets[i] == neg_lex[j]:
            neg_ind.append(i)

In [None]:
#create dataframe of positive token counts
pos_tok = []
for i in pos_ind:
    pos_tok.append(clean_tweets[i])
blah = pd.DataFrame(np.unique(pos_tok, return_counts=True)[0], columns = ['word'])
blah['count'] = np.unique(pos_tok, return_counts=True)[1]
blah.sort_values(by=['count'], ascending=False, inplace=True)
pos_tok = blah.copy()
del blah

#create dataframe of negative token counts
neg_tok = []
for i in neg_ind:
    neg_tok.append(clean_tweets[i])
blah = pd.DataFrame(np.unique(neg_tok, return_counts=True)[0], columns = ['word'])
blah['count'] = np.unique(neg_tok, return_counts=True)[1]
blah.sort_values(by=['count'], ascending=False, inplace=True)
neg_tok = blah.copy()
del blah

In [None]:
plt.figure(figsize=[25,20]);
plt.title('40 most commonly used positive terms in Pfizer COVID19 Vaccine Tweets', fontsize=30, fontweight='bold')
sns.barplot(x = pos_tok.head(40).iloc[:,1], y = pos_tok.head(40).iloc[:,0], palette='hsv');
plt.yticks(fontsize=18, fontweight='bold');
plt.xticks(fontsize=16);
plt.xlabel('Count', fontsize=20, fontweight='bold');
plt.ylabel('');
plt.xticks(fontsize=18);

In [None]:
plt.figure(figsize=[25,20]);
plt.title('40 most commonly used negative terms in Pfizer COVID19 Vaccine Tweets', fontsize=30, fontweight='bold')
sns.barplot(x = neg_tok.head(40).iloc[:,1], y = neg_tok.head(40).iloc[:,0], palette='hsv');
plt.yticks(fontsize=18, fontweight='bold');
plt.xticks(fontsize=16);
plt.xlabel('Count', fontsize=20, fontweight='bold');
plt.ylabel('');
plt.xticks(fontsize=18);

# <p style="font-size:24px; font-family:'Candara'; font-weight: bold">Collocation Analysis</p>
<p style="font-size:18px; font-family:'Calibri Light'; line-height:1.3">
    &nbsp;&nbsp;&nbsp;&nbsp;Below are the 50 most common trigram collocated terms (most frequent three terms collocated sequentially in tweet text).
</p>

In [None]:
blah = nltk.collocations.TrigramCollocationFinder.from_words(remove_noise(tt.tokenize(tweet_text_en), stop_words))
blah.ngram_fd.most_common(50)

# <p style="font-size:28px; font-family:'Candara'; font-weight: bold">Sentiment Scoring Analysis</p>
<p style="font-size:18px; font-family:'Calibri Light'; line-height:1.3">
    &nbsp;&nbsp;&nbsp;&nbsp;Sentiment scores for each tweet are generated using NLTK SentimentIntensityAnalyzer() (SIA) using the VADER lexicon. The output of SIA is a dictionary containing negative, neutral, positive, and compound scores. To make analysis easy, scores are transferred into lists then compiled into a dataframe.
</p>

In [None]:
#initialize analyzer
sia = SentimentIntensityAnalyzer()

#intialize lists for storing scores
neg = []
neu = []
pos = []
compound = []

#loop through each tweet
for i in np.arange(0, len(tweet_text_en)):
    a = str(tweet_text_en[i])
    b = sia.polarity_scores(a)
    #store each score in respective list
    neg.append(b['neg'])
    neu.append(b['neu'])
    pos.append(b['pos'])
    compound.append(b['compound'])

#place scores into dataframe for easy analysis    
pol_stats = pd.DataFrame(
    {'neg': neg,
     'neu': neu,
     'pos': pos,
     'compound': compound
    })

In [None]:
#look at stats on score data
pol_stats['compound'].describe()
print('Compound Score Skew: {}'.format(round(pol_stats['compound'].skew(),3)))

In [None]:
#generate figure to describe scores

#create fontdict for axis labels
axlab2 = {'family': 'serif',
              'color': 'black',
              'weight': 'bold',
              'size': 16
         }

# create figure with 4 subplots
fig = plt.figure(figsize=[15,6])
fig.suptitle("Sentiment Scores for Pfizer COVID-19 Vaccine Tweets", fontsize=18, fontweight='bold')
fig.subplots_adjust(top=0.92)
grid = plt.GridSpec(5, 1, wspace=0.3, hspace=0.1)

ax0 = plt.subplot(grid[0:4, 0]);
sns.distplot(pol_stats['compound'], ax=ax0, color='dodgerblue');
a = ax0.axvline(pol_stats['compound'].median(),color= "black", linestyle="--", label="median");
b = ax0.axvline(pol_stats['compound'].mean(),color= "red", linestyle="--", label="mean");
c = ax0.axvline(min(pol_stats['compound'].mean()+ 3 * pol_stats['compound'].std(), 1),color= "orange", linestyle="--", label="3sigma");
ax0.axvline(max(pol_stats['compound'].mean()- 3 * pol_stats['compound'].std(), -1),color= "orange", linestyle="--");
d = ax0.axvline(min(pol_stats['compound'].mean()+ 2 * pol_stats['compound'].std(), 1),color= "slategrey", linestyle="--", label="2sigma");
ax0.axvline(max(pol_stats['compound'].mean()- 2 * pol_stats['compound'].std(), -1),color= "slategrey", linestyle="--");
plt.tick_params(
    axis='x',          
    which='both',      
    bottom=False,      
    top=False,         
    labelbottom=False);
ax0.set_xlabel('', fontdict=axlab2);
ax0.set_xticks(np.arange(-1.5,1.6,0.5));
ax0.set_yticks([]);
plt.legend([a, b, d, c], ['median', 'mean','2sigma','3sigma'], loc='upper center', bbox_to_anchor=(0.92, 1), fontsize=14) 


ax1 = plt.subplot(grid[4, 0]);
sns.boxplot(x=pol_stats['compound'], ax=ax1, color='honeydew');
ax1.axvline(pol_stats['compound'].median(),color= "black", linestyle="--", label="median");
ax1.axvline(pol_stats['compound'].mean(),color= "red", linestyle="--", label="mean");
ax1.axvline(min(pol_stats['compound'].mean()+ 3 * pol_stats['compound'].std(), 1),color= "orange", linestyle="--", label="3sigma");
ax1.axvline(max(pol_stats['compound'].mean()- 3 * pol_stats['compound'].std(), -1),color= "orange", linestyle="--");
ax1.axvline(min(pol_stats['compound'].mean()+ 2 * pol_stats['compound'].std(), 1),color= "slategrey", linestyle="--", label="2sigma");
ax1.axvline(max(pol_stats['compound'].mean()- 2 * pol_stats['compound'].std(), -1),color= "slategrey", linestyle="--");
ax1.set_xlabel('Compound Score', fontdict=axlab2);
plt.xticks(fontsize=14);
ax1.set_xticks(np.arange(-1.5,1.6,0.5));
ax1.set_xticklabels([' ','-1.0','-0.5','0.0', '0.5', '1.0', ' '],fontdict={'color': 'black', 'size': 14});

# <p style="font-size:28px; font-family:'Candara'; font-weight: bold">Conclusion</p>
<p style="font-size:18px; font-family:'Calibri Light'; line-height:1.3">
    &nbsp;&nbsp;&nbsp;&nbsp;Based upon the analysis contained in this notebook and as demonstrated in the above sentiment scores, public sentiment concerning the Pfizer COVID-19 Vaccine demonstrate an overall neutral to positive sentiment. There is clearly some negative sentiment about the vaccine. Negative sentiments concern the death toll of covid-19, worry, delay, and various possible side effects, such as soreness, headaches, and fatigue. However, there is clearly more positive than negative sentiments based upon the data. Some of the standout positive sentiments about the vaccine are hope, thankfulness, gratefulness, approval, happiness, and trust.  
</p>