# Exploratory Data analysis of the progress of vaccinations around the world.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Relevant imports

import numpy as np, pandas as pd
from IPython.display import Image
import matplotlib.pyplot as plt, seaborn as sns
import scipy
import warnings
import plotly.express as px
from itertools import product
import statsmodels.api as sm
import datetime
from tqdm import tqdm
warnings.filterwarnings('ignore')

In [None]:
import sympy 

## Read data and understand the data schema

Read csv file as a pandas dataframe

In [None]:
data = pd.read_csv('../input/covid-world-vaccination-progress/country_vaccinations.csv')

Let's form a general idea of the different type of data present in different columns fo the dataset, and the shape of the dataset.

In [None]:
data

In [None]:
data.shape

## Handle Missing data

Let's get a statistic of the number of NaNs present in the dataset

In [None]:
data.isna().sum()

### -> Column: `total_vaccinations`

`total_vaccinations` is the most important column in the dataset. We simply drop rows having NaN in this column


In [None]:
data = data.drop(data[data.total_vaccinations.isna()].index)

Now, we re-evaluate the statistic of the number of NaN after removing all rows with NaN value for `total_vaccinations`

In [None]:
data.isna().sum()

Although the number of NaNs have decreased, we still need to remove the remaining NaN values to be able to process the data.
Let's plot the correlation matrix to check is the NaN values in any column can be derived from other columns.

In [None]:
plt.subplots(figsize=(8, 8))
sns.heatmap(data.corr(), annot=True, square=True)
plt.show()

We can observe that `total_vaccinations` and `people_vaccinated` have a very high correlation. Let's remove the rows having NaN
in `people_vaccinated` and perform the Mann-Whithey U test to see if the columns can be derived from each other.

### -> Column: `people_vaccinated`

In [None]:
people_vaccinated_data = data.drop(data[data.people_vaccinated.isna()].index)
people_vaccinated_data.isna().sum()

In [None]:
people_vaccinated_data.head()

We can see that total_vaccinations and people_vaccinated have almost the same values. Let's perform the Mann-Whitney U test to confirm our hypothesis.

In [None]:
scipy.stats.mannwhitneyu(people_vaccinated_data.total_vaccinations, people_vaccinated_data.people_vaccinated, alternative='two-sided')

However, we see that the `p_value` of the Mann-Whitney U Test between `people_vaccinated` and `total_vaccinations` is far lesser
than 0.05. Hence we have to reject our hypothesis.

Since, the columns having the highest correlation failed the Mann-Whitney U Test, it is highly possible that the columns having lower
correlation will also fail this test. Hence, we fill the NaN values with 0.

In [None]:
data = data.fillna(0)
data.isna().sum()

## Visualization of different metrics

### total_vaccinations

Let's rank the different countries on the basis of the number of vaccinations.

In [None]:
# select the columns necessary of representing each country
columns = ["country", 'total_vaccinations', 'iso_code', 'vaccines', 'total_vaccinations_per_hundred']

# group the columns by country name
vaccinations_data = data[columns].groupby('country').max().sort_values('total_vaccinations', ascending=False)

The countries with the highest number of vaccinations are as follows:

In [None]:
vaccinations_data.head()

The countries with lowest number of vaccinations are as follows:

In [None]:
vaccinations_data.tail()

Now let's visualize the total number of vaccinations in each country using a Bar Chart.

In [None]:
plt.figure(figsize=(30, 10))
plt.bar(vaccinations_data.index, vaccinations_data.total_vaccinations)
plt.xticks(rotation = 90)
plt.ylabel('Number of vaccinated citizens')
plt.xlabel('Countries')
plt.show()

As we can see, the highest number of vaccinations are in USA and China, closely followed by India. However, we should also take into consideration the huge population of these
countries. The population is a major factor in the number of vaccinations. This is also evident in the countries having the lowest number of vaccinations, where the population is very low.

To get a more intuitive understanding of this ranking, let's visualize this data on a geographical world map.

In [None]:
fig = px.choropleth(locations=vaccinations_data.iso_code, color=vaccinations_data.total_vaccinations, title='Number of vaccinated citizens', 
                   color_continuous_scale='rainbow')
fig.show('notebook')

### total_vaccinations_per_hundred

Let's rank the different countries on the basis of the number of vaccinations.

In [None]:
# we already have the relevant columns grouped by countries in the dataframe. We just need to sort it by total_vaccinations_per_hundred

# sort the dataframe by total_vaccinations_per_hundred
vaccinations_data = vaccinations_data.sort_values('total_vaccinations_per_hundred', ascending=False)

Listing the top and bottom most countries ranked by total_vaccinations_per_hundred.

In [None]:
vaccinations_data.head()

In [None]:
vaccinations_data.tail()

And a complementary bar chart.

In [None]:
plt.figure(figsize=(30, 10))
plt.bar(vaccinations_data.index, vaccinations_data.total_vaccinations_per_hundred)
plt.xticks(rotation = 90)
plt.ylabel('Number of vaccinated citizens per hundred')
plt.xlabel('Countries')
plt.show()

Israel, UAE, Gibraltar have the highest level of vaccinated people per hundred.

However, we shouldn't forget that the population of these countries isn't really high. This might be the reason of such high statistic indicators.

United Kingdom (along with England, Northern Ireland, Scotland and Wales) also have really high results, as it's population is almost 7 times higher than UAE's and Israels, and what is really incredible, <u>2016</u> times higher than Gibraltar's!

Again, we visualize this statistic on a map for a more intuitive understanding.

In [None]:
fig = px.choropleth(locations=vaccinations_data.iso_code, color=vaccinations_data.total_vaccinations_per_hundred, title='Number of vaccinated citizens per hundred', 
                   color_continuous_scale='rainbow')
fig.show('notebook')

As we can observe, USA has been very thorough with the vaccinations as it's total_vaccinations_per_hundred is also very high.
At the lowest level we have Russia, Mexico, South America and the Asian countries.

## Vaccines ranked by popularity

In [None]:
# we group the vaccinations_data dataframe by vaccine to be able to rank vaccines. Our ranking metric
# however, is still total_vaccinations. So we effectively rank the vaccines by the total number of 
# vaccinations done with it.
vaccines_usage = vaccinations_data.groupby('vaccines').sum().sort_values('total_vaccinations', ascending=False)

Listing the most popular and the least popular vaccines.

In [None]:
vaccines_usage.head()

In [None]:
vaccines_usage.tail()

Complementary bar chart visualizing usage of the vaccines.

In [None]:
plt.figure(figsize=(10, 5))
plt.bar(vaccines_usage.index, vaccines_usage.total_vaccinations)
plt.xticks(rotation = 90)
plt.ylabel('Number of vaccinated people')
plt.xlabel('Vaccines')
plt.show()

Complementary map visualization for better intuition.

In [None]:
fig = px.choropleth(locations=vaccinations_data.iso_code, color=vaccinations_data.vaccines, title='Vaccines used in different parts of the world', 
                   color_continuous_scale='rainbow')
fig.show()

The Pfizer/BioNTech vaccine is a arguably the most popular and wide-spread vaccine throughout the world. It is used in all of Europe and North America and Japan.

The Sputnik V vaccine is used in Russia, and surrounding contries like Kazakhstan, Iran, Ukraine, Mongolia and Belarus along with South American Countries like Argentina.

Covaxin is used in India.

