# Pfizer Vaccine Tweets Analysis

<center>
    <img src="https://i.guim.co.uk/img/media/f9cb10ec580aab5cdb195116cf9e8496472cdd2d/0_227_5625_3375/master/5625.jpg?width=300&quality=85&auto=format&fit=max&s=590c232179cd32826ca9734144336a52">
</center>
<center> Pfizer's vaccine from The Guardian </center> 

[The Guardian](https://www.theguardian.com/world/2020/nov/27/hospitals-england-told-prepare-early-december-covid-vaccine-rollout-nhs)

In this notebook, we are going to analyze pfizer vaccine tweets. To do so, we will talk

1. [Exploratory data analysis](#eda)
2. [Text mining ](#tm)
3. [Sentiment analysis](#sa)
4. [conclusion](#conc)

Let's start

# Load library, data and prepare data

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from scipy.stats import normaltest
from warnings import filterwarnings

In [1]:
import string
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import spacy
from spacy import displacy
from pprint import pprint 
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

In [1]:
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, cross_val_predict
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, ExtraTreesClassifier
from sklearn.preprocessing import MinMaxScaler, LabelEncoder 
from xgboost import XGBRFClassifier
from sklearn.naive_bayes import MultinomialNB 
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import mean_squared_error, mean_absolute_error, RocCurveDisplay,confusion_matrix,r2_score
from sklearn.metrics import plot_roc_curve, roc_auc_score, classification_report, accuracy_score, f1_score
from sklearn.metrics import recall_score, plot_confusion_matrix, precision_score, plot_precision_recall_curve, classification_report
    
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [1]:
sns.set(style='whitegrid')
pd.set_option('display.max_colwidth', 300)
pd.set_option('display.max_rows', 10000)
filterwarnings('ignore')
pd.plotting.register_matplotlib_converters()
%matplotlib inline
print("Setup Complete")

In [1]:
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [1]:
!pip install tweet-preprocessor

In [1]:
tweets = pd.read_csv('/kaggle/input/pfizer-vaccine-tweets/vaccination_tweets.csv')

In [1]:
tweets.tail()

In [1]:
tweets.info()

## Some feature explained
1. **user_name**: the name of the user, as they have defined it.
2. **user_location**: the user-defined location for this account's profile.
3. **user_description**: the user-defined UTF-8 string describing their account
4. **user_verified**: when true, indicates that the user has a verified account.
5. **user_followers**: the number of followers this account currently has. 
6. **user_friends**: the number of user this account is following
7. **user_favorites**: the number of tweets this user has liked in the account's lifetime. 
8. **user_created**: the UTC datetime that the user account was created on twitter.
9. **hashtag**: is any word or phrase immediately preceded by the # symbol. When you click or tap on a hashtag, you will see other tweets containing the same keyword or topic.
10. **retweet**: a tweet that you forward to your followers is known as a retweet.
11. **favorite** refers to topics or subjects that users are most interested in.

**Checking missing values**

In [1]:
tweets.isnull().sum()[tweets.isnull().sum()>0]

As user_location, user_description, hashtags and source are object type, we are going to use 
.fillna.

In [1]:
tweets.fillna(' ', inplace=True) #imputation

In [1]:
tweets.isnull().sum()[tweets.isnull().sum()>0]

**convert object date to datetime format**

In [1]:
tweets['user_created'] = pd.to_datetime(tweets['user_created'])
tweets['date'] = pd.to_datetime(tweets['date'])

**check**

In [1]:
tweets.info()

# Feature engineering

We will create new feature:

1. **user_account_lifetime**: the number of years an user is on twitter.
2. **user_year_created**: the year of creation of the twitter account.
3. **user_month_created**: the month of creation of the twitter account.
4. **user_time_created**: the time of creation of the twitter account.
5. **user_day_created**: the day of creation of the twitter account.
6. **user_date_created**: the date of creation of the twitter account.

7. **user_year_write**: the year that an user writes about the pfizer vaccine.
8. **user_month_write**: the month that an user writes about the pfizer vaccine.
9. **user_day_write**: the day that an user writes about pfizer vaccine.
10. **user_time_write**: the hour that an user writes about pfizer vaccine.
11. **user_date_write**: the date that an user writes about pfizer vaccine.

In [1]:
tweets['user_account_lifetime'] = tweets.date.dt.year - tweets.user_created.dt.year
tweets['user_year_created'] = tweets.user_created.dt.year
tweets['user_month_created'] = tweets.user_created.dt.month
tweets['user_time_created'] = tweets.user_created.dt.time
tweets['user_day_created'] = tweets.user_created.dt.dayofweek
tweets['user_date_created'] = tweets.user_created.dt.date

In [1]:
tweets['user_year_write'] = tweets.date.dt.year
tweets['user_month_write'] = tweets.date.dt.month
tweets['user_day_write'] = tweets.date.dt.dayofweek
tweets['user_time_write'] = tweets.date.dt.time
tweets['user_date_write'] = tweets.date.dt.date

**rename**

In [1]:
#rename 
tweets.user_month_created.replace(to_replace=sorted(tweets.user_month_created.unique()),
                                 value=['january', 'february','march', 'april','may','june','july','august',
                                       'september','october','november','december'], inplace=True)

tweets.user_month_write.replace(to_replace=sorted(tweets.user_month_write.unique()),
                                value=['january', 'december'], inplace=True)

tweets.user_day_write.replace(to_replace=sorted(tweets.user_day_write.unique()),
                              value=['monday','tuesday','wednesday','thursday','friday','saturday','sunday'],
                             inplace=True)

tweets.user_day_created.replace(to_replace=sorted(tweets.user_day_created.unique()),
                              value=['monday','tuesday','wednesday','thursday','friday','saturday','sunday'],
                               inplace=True)

In [1]:
tweets.tail(2)

In [1]:
tweets.drop(columns=['id'], inplace=True)

<a id = 'eda'></a>

# Exploratory data analysis

## Descriptive analysis

In [1]:
tweets.describe()

**We explain**

we have, 
1. $user followers = 63430\pm 476213$ with 50% of accounts have user_followers less than 606 where one account reach 13714930 user followers.
2. $user friend = 1171\pm2469$ with 50% of accounts have user_friend less than 441 where one account reach 64441.
and so on...

In [1]:
#month when there is more creation of twitter account 
tweets.user_month_created.mode()

In [1]:
#day when there is more creation of twitter account 
tweets.user_day_created.mode()

In [1]:
#year when there is more creation of twitter account 
tweets.user_year_created.mode()

In [1]:
#month when user write more about pfizer vaccine
tweets.user_month_write.mode()

In [1]:
#day when user write more about pfizer vaccine
tweets.user_day_write.mode()

In [1]:
#source must using by user
tweets.source.mode()

In [1]:
tweets.corr()

1. $corr(retweets, favorite) = 0.837641$ **means that users who favor a tweet tend to retweet this tweet.**
2. $corr(user year created, user account lifetime) = -0.9951$ **means that most people who create twitter account each year does not see theirs accounts take more lifetime.**

The other features are purely independent.

## Visualization: Distribution

### Twitter Accounts creation

We visualize feature

In [1]:
plt.figure(figsize=(15,5))
sns.countplot(x='user_month_created', data=tweets, hue='user_verified')
plt.title('Monthly Twitter accounts creation')
plt.show()

Most users likes create twitter account but does not like to make account verification. 

In [1]:
plt.figure(figsize=(15,5))
sns.countplot(x='user_day_created', data=tweets, hue='user_verified')
plt.title('Daily Twitter accounts creation')
plt.show()

Daily, most people does not like to make account verification.

In [1]:
plt.figure(figsize=(15,5))
sns.countplot(x='user_year_created', data=tweets, hue='user_verified')
plt.title('Yearly Twitter accounts creation')
plt.show()

only half of the twitter users are making account verification in year 2009.  

In [1]:
plt.figure(figsize=(15,5))
sns.countplot(x='user_account_lifetime', data=tweets, hue='user_verified')
plt.title('Twitter accounts lifetime')
plt.show()

Only twitter users  who have 11 years lifetime are making account verification.

**We learn**
1. most twitter users does not like to make account verification.

### Twitter users write about pfizers vaccine

In [1]:
plt.figure(figsize=(15,5))
sns.countplot(x='user_month_write', data=tweets, hue='user_verified')
plt.title('Monthly Twitter user write about Pfizer vaccine')
plt.show()

In december, people discover and write  a pfizer vaccine deployment

In [1]:
plt.figure(figsize=(15,5))
sns.countplot(x='user_day_write', data=tweets, hue='is_retweet')
plt.title('Daily Twitter user write about Pfizer vaccine')
plt.show()

tuesday and wednesday are the two days who users write most about pfizer vaccine.

In [1]:
plt.figure(figsize=(15,5))
sns.countplot(x='user_year_write', data=tweets, hue='is_retweet')
plt.title('Yearly Twitter user write about Pfizer vaccine')
plt.show()

In [1]:
plt.figure(figsize=(5,10))
sns.countplot(y='source', data=tweets)
plt.title('Source')
plt.show()

most people like to use twitter on android, iphone and web app.

## correlation plot

In [1]:
plt.figure(figsize=(15,5))
sns.regplot('favorites','retweets', data=tweets)
plt.title('correlation feature plot')
plt.show()

In [1]:
plt.figure(figsize=(15,5))
sns.regplot('user_account_lifetime','user_year_created', data=tweets)
plt.title('correlation feature plot')
plt.show()

user_year_created and user_account_lifetime are strong opposite.

## Time series

In [1]:
sns.catplot(x='user_account_lifetime', y='user_followers', hue='user_verified', data=tweets,kind="swarm")
plt.show()

Only account having an account verification have huge user followers.

In [1]:
tweets.plot(x='user_time_write', y='favorites', figsize=(15,5), title='favorites time series')
plt.show()

In [1]:
tweets.plot(x='user_date_write', y='favorites', figsize=(15,5), title='favorites time series')
plt.show()

**summary**
1. users who favor a tweet tend to retweet this tweet.
2. most people who create twitter account each year does not see theirs accounts take more lifetime.
3. Most users likes create twitter account but does not like to make account verification.
4. In december, people discover and write a pfizer vaccine deployment
5. most people like to use twitter on android, iphone and web app.
6. Only account having an account verification have huge user followers

<a id = 'tm'></a>

# Text mining

In [1]:
import preprocessor as p

In [1]:
def tokenizer(sentence):
    import string
    from spacy.lang.en import English
    import spacy
    # Create our list of punctuation marks
    punctuations = string.punctuation

    # Create our list of stopwords
    nlp = spacy.load('en')
    stop_words = spacy.lang.en.stop_words.STOP_WORDS

    # Load English tokenizer, tagger, parser, NER and word vectors
    parser = English()
    
    #clean tweet text
    sentence = p.clean(sentence)

    # Creating our token object, which is used to create documents with linguistic annotations.
    mytokens = parser(sentence)

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens]

    # Removing stop words
    mytokens = [word for word in mytokens if word not in stop_words and word not in punctuations]

    # return preprocessed list of tokens
    return mytokens

In [1]:
def get_bigrams(corpus, n=None):
    vec = CountVectorizer(tokenizer=tokenizer, ngram_range=(3, 3), ).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    
    x,y =map(list,zip(*words_freq[:n]))
    
    return x, y

In [1]:
text = tweets.loc[:, ['user_location','user_description','text','hashtags']]

In [1]:
text.tail()

## Where users are local?

In [1]:
local = ' '.join(u for u in list(text.user_location))

In [1]:
token_local = tokenizer(local)

In [1]:
from collections import Counter

In [1]:
count_token = Counter(token_local)

In [1]:
#show 30 most common
pprint(count_token.most_common(30))

## Word cloud: user description

In [1]:
from wordcloud import WordCloud

In [1]:
#clean tweet
desc = p.clean(' '.join([u for u in text.user_description]))

In [1]:
wordcloud = WordCloud().generate(' '.join(u for u in tokenizer(desc)))

In [1]:
plt.figure(figsize=(14,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.title('User description word cloud')
plt.axis('off')
plt.show()

Some users are interested by:
1. Opinion
2. nurse
3. love
4. life
5. clinical
6. science
7. and so on ...

### 3-gram user description

In [1]:
desc_x, desc_y = get_bigrams(np.array([desc]), 30)

In [1]:
plt.figure(figsize=(10,20))
plt.barh(sorted(desc_x), sorted(desc_y))
plt.title('3 grams user description')
plt.show()

## Wordcloud: hashtags

In [1]:
hashtag = p.clean(' '.join(u for u in text.hashtags))
w_hash = WordCloud().generate(' '.join(u for u in tokenizer(hashtag)))

In [1]:
plt.figure(figsize=(14,10))
plt.imshow(w_hash, interpolation='bilinear')
plt.title('User hashtags word cloud')
plt.axis('off')
plt.show()

Vaccine, covid19, pfizer, biontech hashtags

## Text analysis

In [1]:
# we clean tweet text by usisng tweet-preprocessor
def preprocess_tweet(sent):
    return p.clean(sent['text'])

In [1]:
def word_frequence(dtext, n=1):

    tfvector = TfidfVectorizer(tokenizer=tokenizer, ngram_range=(n, n))
    transformed_text = tfvector.fit_transform(dtext)
    transformed_text_as_array = transformed_text.toarray()

    for counter, doc in enumerate(transformed_text_as_array):
        #construct a dataframe
        tf_idf_tuples = list(zip(tfvector.get_feature_names(), doc))
        one_doc_as_df = pd.DataFrame.from_records(tf_idf_tuples, 
        columns=['term', 'score']).sort_values(by='score', ascending=False).reset_index(drop=True)

    return one_doc_as_df

In [1]:
text['text'] = text.apply(preprocess_tweet, axis=1)

In [1]:
dtext = text.text

### Word frequency

We compute word frequency to see which word users write most in the tweet. 

**1-gram**

In [1]:
#we transform our n-dim text to 1-dim text
tw_text = np.array([" ".join([u for u in dtext])])

In [1]:
word_freq = word_frequence(tw_text)

In [1]:
word_freq.head()

In [1]:
word_freq[:10].plot(kind="bar", x='term',y='score', figsize=(15,5))
plt.ylabel('frequence')
plt.title('word frequency')
plt.show()

Vaccine is word more used by users after coming the others.

### collocation

Here, we identified words that commonly co-occur in the tweet text.

**2-grams**

In [1]:
two_grams = word_frequence(tw_text, n=2)

In [1]:
two_grams.head()

In [1]:
two_grams[:30].plot(kind="bar", x='term', y='score', figsize=(15,5))
plt.ylabel('frequence')
plt.title('30 most common 2-grams ')
plt.show()

This graph two very well that users are most write **covid-19 vaccine, pfizer biontech, 1 dose, ...** and also word vaccine are cooccured with more word in the text.

**3-grams**

Let's see trigram.

In [1]:
tri_grams = word_frequence(tw_text, n=3)

In [1]:
tri_grams.head()

In [1]:
tri_grams[:30].plot(kind="bar", x='term', y='score', figsize=(15,5))
plt.ylabel('frequence')
plt.title('30 most commons 3-grams ')
plt.show()

we can identify four important words   **Pfizer, Biontech, Coronavirus, Vaccine, Covid19**

## Word cloud

In [1]:
cloud = WordCloud().generate(" ".join([u for u in dtext]))

In [1]:
plt.figure(figsize=(15,10))
plt.imshow(cloud, interpolation='bilinear')
plt.title('Tweet text word cloud')
plt.axis('off')
plt.show()

keyword: **covid, vaccine, dose, pfizer , biontech,**

<a id='sm'></a>

# Sentiment Analysis

In this section, we are using textblob model.

## TextBlob

In [1]:
from textblob import TextBlob

In [1]:
docs = [TextBlob(u) for u in dtext]

In [1]:
len(docs)

In [1]:
docs[1].tags

In [1]:
docs[0].noun_phrases

In [1]:
sentiment_polarity = [u.sentiment.polarity for u in docs]

In [1]:
sentiment_subjectivity = [u.sentiment.subjectivity for u in docs]

In [1]:
text['polarity'] = ['positive' if score >= 0.1 else 'negative' for score in sentiment_polarity]
text['subjective'] = sentiment_subjectivity

In [1]:
text.head()

In [1]:
text.tail()

We are going to analyse our result below.

In [1]:
text.polarity.value_counts().plot(kind='pie', figsize=(15,5))
plt.title('Sentiment opinion about Pfizer-Biontech vaccine')
plt.show()

In [1]:
val_count = text.polarity.value_counts()
p1 = 100*(val_count[0]/sum(val_count))
p2 = 100*(val_count[1]/sum(val_count))

In [1]:
print(f'The percentage of users who have negative opinion about Pfizer-Biontech vaccine is: {p1}.')
print(f'The percentage of users who have positive opinion about Pfizer-Biontech vaccine is: {p2}.')

In [1]:
text.polarity.value_counts().plot(kind='bar', figsize=(15,5))
plt.title('Sentiment opinion about Pfizer-Biontech vaccine')
plt.show()

Okay, we can now see the subjectivity of all users.

In [1]:
plt.figure(figsize=(15,5))
sns.violinplot(y='polarity', x='subjective', data=text, hue='polarity')
plt.title('Objective and subjective opinion')
plt.show()

In [1]:
opinion = pd.pivot_table(text, values='subjective', index='text',columns='polarity')

In [1]:
opinion.describe()

1. The median of negative opinion is equal to zero, this means that half of the users have an objective opinion. many users are not convenient with Pfizer-Biontech vaccine.
2. The median of positive opinion is equal to 0.5, this means that half of the users have a subjective and non-objective opinion. Which may show that these users remain in doubt about the Pfizers vaccine

<a id='conc'></a>

# Conclusion 

what we can say about Pfizers' vaccine is:

1. **60%** of users do not trust Pfizers' vaccine and over **50%** of users with **negative opinions** are **objective** about what they say. while those who have a **positive opinion (40%)** are **subjective** about what they write.
2. Therefore the Pfizers-Biontech vaccine is not welcome

# Sentiment classsification

After that we know the percentage of negative and positive opinion about Pfizer-Biontech vaccine, we can now automatically identified what user are positive or negative. Let's create a classification model using Textblob. 

In [1]:
#divide our text to trains and test
ptext = text[['text','polarity']]
xtrain, xtest, ytrain, ytest = train_test_split(ptext.text, ptext.polarity, stratify=ptext.polarity,
                                                random_state=0, test_size=0.2)

In [1]:
#let's create a function which return train and test data
def create_train_test(x,y):
    df = pd.DataFrame()
    df['x'] = list(x)
    df['y'] = list(y)
    
    return  [tuple(df.iloc[i, [0,1]].values) for i in range(df.shape[0])]

In [1]:
train = create_train_test(xtrain, ytrain)
test = create_train_test(xtest, ytest)

In [1]:
train[:3]

In [1]:
test[:3]

In [1]:
from textblob.classifiers import NaiveBayesClassifier

In [1]:
cl = NaiveBayesClassifier(train)

In [1]:
# evaluation
cl.accuracy(test)

In [1]:
cl.show_informative_features(10)

### Conclusion

Our model is not bad then we can make sentiment classification. From the Informative feature we can know the sentiment of an user if we locate thess word in his tweets.

**Bee free to share and download this notebook. I Hope that this notebook help everyone to understand the opinion about Pfizers-Biontech vaccine**