# *PFIZER Vaccine Tweets Analysis.*

* I have analyzed the Pfizer publicly available dataset on Kaggle
* In the first part I have sourced the data.
* Then I have checked and observed the columns and number of rows.
* Then I have performed Data Engineering and Data Cleaning
* Finally Data Analysis

# Data Sourcing and Understanding

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

In [None]:
df = pd.read_csv("../input/pfizer-vaccine-tweets/vaccination_tweets.csv")

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.describe()

# Data Engineering and Data Cleaning

In [None]:
plt.figure(figsize=(14,8))
sns.heatmap(df.isnull(), cbar=False, cmap='magma')
plt.xlabel('Columns')
plt.title('Missing Values Exploration by Columns')

1. Columns with Yellow lines show signs of missing values
2. These Values will need to be observed in depth to see how they can be filled or whether these columns should be completely dropped.

In [None]:
msno.bar(df, color='darkblue')

* Bar representation of the same information to precisely see which column has missing values and in what quantitiies.

# Dealing with Missing Columns: user_location, user_description & Hashtags

#  User Location

In [None]:
df['user_location'].count()

In [None]:
df['user_location'].nunique()

In [None]:
df['user_location'].unique()

* I am first looking at the exact count the user_location column has
* The exact count is 1734.. This means that 1734 rows have some value in user_location column
* The number of unique values in this column are 778. This means there are 778 different locations of users. 
* After analyzing the UNIQUE VALUES in user_location, I found out that people have not actually used their real location to fill this information. Some columns say 'YOUR BED' or "MORON CANADA" which clearly states that this is column does not really hold any valueable information which can be used in our analysis
* There is also no consistency in the data.. Some values are "Portugal' while others are 'Colorado, USA'
* Therefore I have decided to just fill the remaining values as 'Not Reported'
* We can also delete this columns but I chose to keep it and fill in the values

In [None]:
df['user_location'] = df['user_location'].fillna('Not Reported')

# User Description

In [None]:
df['user_description'].nunique()

In [None]:
df['user_description'].count()

* Every user has different description as one user differs from the other. 
* It is not really problem for analysis purpose to have some value in this column.
* We will not delete the column but will fill the infomration. 
* We will fill in the missing values with "Not Available"

In [None]:
df['user_description'] = df['user_description'].fillna('Not Available')

# HashTags

In [None]:
df['hashtags'].count()

* Hashtags are optional to the twwets.
* The column has missing values because people just simply did not put hashtags in their tweets.
* In order to deal with null values, we will put 'No HashTag' to fill these values.
* We cannot delete this column as it holds valuable information.

In [None]:
df['hashtags'] = df['hashtags'].fillna('No HashTag')

# One final dropna on entire dataset to make sure if there is any null value that was not dealt with is deleted. 

In [None]:
df.dropna(inplace=True)

In [None]:
msno.bar(df, color='darkred')

# Data is now Cleaned!

# User Created and Date column Data Type 

In [None]:
df['user_created'] = pd.to_datetime(df['user_created'])
df['date'] = pd.to_datetime(df['date'])

In [None]:
df.info()

# We will now begin our Analysis!

In [None]:
df['date'] = df['date'].dt.date

In [None]:
date = df.groupby('date').count().reset_index()

In [None]:
sns.set_style('darkgrid')
plt.figure(figsize=(12,8))
g = sns.lineplot(x='date', y='id', data=date, color='orange', markers="o")
sns.despine(left=True)
g.set_title('Tweets per Day', fontsize=25)
g.set_xlabel('Date', fontsize=20)
g.set_ylabel('Number of Tweets', fontsize=20)


* Date wise analysis of the data. It is showing that between 2020-12-13 to 2020-12-17 there was one spike in the number of tweets. 
* Another spike of tweets was between 2020-12-17 and 2020-12-25

In [None]:
htags = pd.DataFrame(df['hashtags'].value_counts()).reset_index()
htags = htags[htags['index'] != 'No HashTag']
htags = htags.rename(columns={'index': 'Hashtags', 'hashtags': 'counts'})[:10]

In [None]:
sns.set_style('white')
plt.figure(figsize=(12,8))
h=sns.barplot(x='Hashtags', y='counts', data=htags, palette='magma_r')
sns.despine(left=True)
h.set_title('Top 10 Hashtag Most Used', fontsize=25)
h.set_xlabel('#Hashtags', fontsize=20)
h.set_ylabel('Number of Tags', fontsize=20)
plt.xticks(rotation=90)

* It is surprising to see 6 out top 10 Hashtags show pfizerBioNTech and the top one having around 250 hashtags.
* Moderna made it at the 10th spot which is also sharing its place with PfizerBioNTech

In [None]:
top_10 = htags['Hashtags']
retweet = df[df['hashtags'].isin(top_10)]

sns.set_style('white')
plt.figure(figsize=(12,8))
k=sns.barplot(x='hashtags', y='retweets', data=retweet, color='red')
sns.despine(left=True)
k.set_title('Top 10 Hashtag Most Retweeted', fontsize=25)
k.set_xlabel('#Hashtags', fontsize=20)
k.set_ylabel('Number of Reweets', fontsize=20)
plt.xticks(rotation=90)

* It seems that hashtags that were mostly used were not retweeted in the same behavior
* The most retweeted hashtag was COVID19' with PfizerBioNTech
* The hashtags are different but they tell the same story as the previous visual that people are talking about pfizer Vaccine and retweeting it.

In [None]:
df['user_created'] = df['user_created'].dt.date

In [None]:
sns.set_style('dark')
plt.figure(figsize=(12,12))
b = sns.scatterplot(x='user_created', y='id', data=df, palette='magma', hue='user_verified', size='user_verified', sizes= (50,200), size_order=[True, False])
sns.despine(left=True)
b.set_title('Verified Account Check for Users!', fontsize=25)
b.set_xlabel('Date on which Acounts were created ', fontsize=20)
b.set_ylabel('Number of Accounts', fontsize=20)
#plt.xticks(rotation=90)
b.set_xlim(df['user_created'].min(),df['user_created'].max())
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

* It seems that accounts that are tweeting about pfizer have been randomly created
* If the accounts created had any spikes in a particular month of a year then this might indicate that the tweets are being made from a potentially fake account. 
* However the user activity here seems natural
* Unverified accounts that are tweeting also do not demonstrate any pattern of being classified as a fake account.

In [None]:
top_10_df = df[df['hashtags'].isin(top_10)]

In [None]:
sns.set_style('darkgrid')
plt.figure(figsize=(10,8))
j = sns.scatterplot(x='user_friends', y='user_followers', data=top_10_df, hue='hashtags', palette='inferno_r', s=70)
j.set_xlabel('User Friends', fontsize=15)
j.set_ylabel('User Followers', fontsize=15)
j.set_title('User Friends and Followers According to Hashtag Used', fontsize=20)
sns.despine(left=True)
plt.xlim(0,3000)
plt.ylim(top_10_df['user_friends'].min(),3000)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

* This tells us that the users that are tweeting using a certain hashtag have a certain number of followers and friends.
* It seems that most of the users with more than 500 Friends and Followers are tweeting with top 3 hashtags. 

In [None]:
top_source = top_10_df['source'].value_counts()
top_source = pd.DataFrame(top_source).reset_index()

In [None]:
sns.set_style('white')
plt.figure(figsize=(14,8))
o = sns.barplot(data=top_source, x='index', y='source', alpha=0.7, palette='RdBu')
sns.despine(left=True)
o.set_xlabel('Source', fontsize=15)
o.set_ylabel('Counts', fontsize=15)
o.set_title('Sources Used for Tweets', fontsize=20)

* It shows that most of the users that tweeted using top 10 hashtags were using twitter from Iphone

In [None]:
df.sort_values(ascending=False, by='retweets')[:30][['id','user_name', 'user_verified', 'text', 'retweets']]

* The above table shows the top 30 tweets according to their number of retweets. 
* We can observe that most of these tweets are from verified accounts.
* This proves it is likely that a tweet will be retweeted if its created by a verified account

In [None]:
df_1 = df[(df['user_verified'] == True) & (df['hashtags'] != 'No HashTag') &(df['hashtags'].isin(top_10))]
df_1 = df_1.groupby('hashtags').sum().sort_values(ascending=False, by='retweets').reset_index()

In [None]:
df_1

In [None]:
sns.set_style('dark')
plt.figure(figsize=(12,8))
#sns.lineplot(y='user_verified', x='hashtags', data=df_1, color='orange')
s = sns.scatterplot(y='retweets', x='user_verified', data=df_1, hue='user_verified', size='user_verified', sizes=(80,200))
sns.despine(left=True)
s.set_xlabel('Number of Verified Users', fontsize=15)
s.set_ylabel('Number of Retweets', fontsize=15)
s.set_title('Retweets of Top 10 Hashtags by Verified Accounts', fontsize=20)
plt.xticks(rotation=60)

* This shows that the tweets by verified accounts were mostly made with the hashtag PfizerBioNTech
* So the retweets increated as the verified account tweets for a particular hashtag increated

# Please Upvote if you like my work. Also do check out my other projects. Thank you so much for your time!