# **Problem Context**

COVID-19 is an infectious disease that causes respiratory illness. It was first seen in Wuhan, China in 2019. It rapidly spread throughout china, causing an epidemic, and then to the rest of the world resulting in a global pandemic. 

As of starting this project on the 26th February 2021, there have been 113,889,503 cases of people infected with COVID-19 and 2,526,142 people have died. The pandemic has changed the way we all live, meaning we have to be more isolated and distant from others to prevent the spread. 

Scientists from around the world have been racing to develop a safe vaccine against the virus and they have achieved this in less than a year from when the virus was first identified and sequenced. 

From this data set the aim is to be able to answer the following questions:<br>
1) Which countries have the most advanced vaccination programmes?<br>
2) Which countries are using which vaccines?<br>
3) Which countries are vaccinating the most people per day, as a percentage of the total population?<br>

# **Understanding the Data**

This dataset has 16 columns of data, these consist of:<br>
* <b>Country</b> - This is the country that has provided it's vaccine data. <br>
* <b>ISO</b> - This is the three letter ISO code for the country. <br>
* <b>Date</b> - Date of data entry. Sometimes different totals have been provided on different days such as a daily or cumulative total.<br>
* <b>Total Number of Vaccinations</b> - The total so far of vaccines given in that country.<br>
* <b>Total Number of People Vaccinated</b> - The number of vaccinations will differ from the number of people vaccinated as most vaccines require two doses.<br>
* <b>Total Number of People Fully Vaccinated</b> - Number of people who have recieved the entire set of vaccines needed. (ie, both vaccinations) <br>
* <b>Daily Vaccinations(raw)</b> - Number of Vaccines given for the specific day in each country.<br>
* <b>Daily Vaccinations</b> - Number of Vaccines given for the specific day in each country.<br>
* <b>Total Vaccinations per Hundred</b> - Ratio in percent <br>
* <b>Total Number of people Vaccinated Per Hundred</b> - Ratio in percent.<br>
* <b>Total Number of people Fully Vaccinated Per Hundred</b> -Ratio in percent.<br>
* <b>Number of Vaccinations per Day</b> - Number of Vaccines given for the specific day in each country.<br>
* <b>Daily Vaccinations per Million</b> - Ratio in people per million between number of people vaccinated and total pop.<br>
* <b>Vaccines Used in the Country</b> - Numbers of each vaccine used in the country.<br>
* <b>Source Name</b> - Souce for the data(ie, national organization, local authority).<br>
* <b>Source Website</b> - Web address for the data source.<br>

 Data Description source: https://www.kaggle.com/gpreda/covid-world-vaccination-progress

In [None]:
#Import Libraries That will be used in this project
import numpy as np
import pandas as pd
import plotly.express as px

import os

from datetime import date


In [None]:
#Read in the data so it can be investigated
vaccine_data = pd.read_csv('../input/d/gpreda/covid-world-vaccination-progress/country_vaccinations.csv')

# **Investigating the Data**

Now the file has been read in, it is important to take a look at it and see if there are any obvious issues which need correcting. Then any issues can be resolved so that the data is in a good format to be used. <br>

From an initial quick look at the data there appears to be a lot of duplicates, This is due to data being added on a daily basis. This can be grouped by country to make it easier to manage and analyze later. There are also some parts of the data we will not need here, so they can be dropped. There are many NaN values in this dataset so they will be changed to 0.0. From grouping the data by countries, there are some clear issues with the naming of some of the countries, such as Wales, this should be part of the UK, there is also a spelling issue for Faroe Islands and the United States should be called the United States of America. 

The dataset contains data starting 2020-12-13 00:00:00 to 2021-02-25 00:00:00

In [None]:
#Check Data Types in Columns
print(vaccine_data.dtypes)

#Convert the date to the correct type so it can be used
vaccine_data['date'] = pd.to_datetime(vaccine_data['date'])

In [None]:
#Print the Date range for this data set
print(f"This dataset contains data from {vaccine_data['date'].min()} to {vaccine_data['date'].max()}")

In [None]:
#Drop unwanted data
smaller_vaccine_data = vaccine_data.drop(['daily_vaccinations_raw', 'people_vaccinated_per_hundred', 'source_name', 'source_website',], axis = 1)

#Take a look at the remaining columns
smaller_vaccine_data

In [None]:
#Check for nulls, replace all null||NaN with 0.0
smaller_vaccine_data.isnull().sum()
corrected_vaccine_data = smaller_vaccine_data.fillna(value = 0.0)

In [None]:
#Correct issues with Country Names

#Merge UK countries
corrected_vaccine_data['country'].replace(['England', 'Wales', 'Scotland', 'Northern Ireland'], 'United Kingdom', inplace = True)
#Change United States to Unites States of America
corrected_vaccine_data['country'].replace(['United States'], 'United Staes of America', inplace = True)
#Correct Spelling of Faroe Islands
corrected_vaccine_data['country'].replace(['Faeroe Islands'], 'Faroe Islands', inplace = True)
#Change Northern Cyprus to Cyprus
corrected_vaccine_data['country'].replace(['Northern Cyprus'], 'Cyprus', inplace = True)

print(f"There are {corrected_vaccine_data['country'].nunique()} different countries in the dataset ")

In [None]:
corrected_vaccine_data

# **Analysis of the Dataset**

This next part will analyse the dataset. It will look at which vaccines are the most used, which countries use which vaccines, which countries are vaccinating the most people.

As of 2021-02-25 there are 102 countries with vaccine data (the countries in the UK have been combined so it is represented as one country) out of 195 countries in the world. So this shows that there is still a way to go, and that some countries may require assistance in acquiring and adminsitering the vaccine. 

In [None]:
#Create a dataframe of which countries use which vaccine
country_name = list(corrected_vaccine_data['country'])
vaccine_name = list(corrected_vaccine_data['vaccines'])
iso = list(corrected_vaccine_data['iso_code'])

country_vaccine = {'country': country_name, 'iso_code':iso, 'vaccine': vaccine_name}
country_vaccine = pd.DataFrame(country_vaccine)
country_vaccine = country_vaccine.drop_duplicates()
country_vaccine

## **Which Counrties Are Using Each Vaccine** <br>

The below map shows which countries are using which vaccines. Many countries are using more than one vaccine to immunize their population. 
You can hover your mouse over each country to see which vaccines are in use there, as some parts of the map are quite small. You can also zoom in on the map and move it around. 

You can see here in the UK we are using the Oxford/AstraZeneca and Pfizer/BioNTech vaccines.

In [None]:
#plot the vaccines on a map, using iso_code to link the vaccines to the correct place on the map
vaccine_map = px.choropleth(country_vaccine, locations = 'iso_code', color = 'vaccine')
vaccine_map.show()

In [None]:
#Vaccine Country
vaccine_country_fig = px.scatter( x = corrected_vaccine_data['country'], y = corrected_vaccine_data['vaccines'])
vaccine_country_fig.update_layout(title = 'Vaccines Used by Each Country')
vaccine_country_fig.show()

## **Top 50 Countries with the Highest amount of total Vaccinations**

The below data shows the top 50 countries with the highest amount of total vaccinations. The Number of vaccines given may be larger than the population of a country due to people having multiple doses of a vaccine. 

In [None]:
#Aggregate the data so that it can be displayed nicely
total_vaccine_grp = corrected_vaccine_data.groupby(['country'])['total_vaccinations'].max().reset_index()
total_vaccine_grp = total_vaccine_grp.nlargest(50, ['total_vaccinations']).reset_index()
#del index for aesthetics 
del total_vaccine_grp['index']
top_50 = total_vaccine_grp['country']
display((total_vaccine_grp).style.background_gradient(cmap = 'Greens'))

The below graph represents the above data in an easier to undertstand way. You can see that the United States of America is leading the way, having given the largest total amount of vaccines so far, a total of 68.27 Million. It also highlights the difference between the countries issuing the most vaccines and the least. This graph doesn't represent the size of the population though, some countries may have really small populations, so a better measure of a countries progress at vaccinating their population would be better measured looking at the total vaccinations per hundred people. 

In [None]:
#top vaccine graph
top_country_fig = px.bar( x = total_vaccine_grp['country'], y = total_vaccine_grp['total_vaccinations'])
top_country_fig.update_layout(title = 'Total COVID-19 Vaccinations (Top 50)')
top_country_fig.show()

## **Top 50 Countries by Total Vaccinations per 100 People**

The below data shows which countries have the most people vaccinated per 100 people, which gives a better overall view of the progress being made by each country in vaccinating their population. Currently it looks like Gibraltar has vaccinated most of their population, it is worth nothing that the population there is incredibly small, so this doesn't mean that they will have the most advanced vaccination programme, vaccinating a larger population requires more resources and distribution and staff. This also doesn't take into consideration that some people may have had two vaccines and others having not recieved any vaccines yet. 

In [None]:
#Data Aggregation
total_per_100 = corrected_vaccine_data.groupby(['country'])['total_vaccinations_per_hundred'].max().reset_index()
total_per_100 = total_per_100.nlargest(50, ['total_vaccinations_per_hundred']).reset_index()
#Drop index for presentation
del total_per_100['index']

display((total_per_100).style.background_gradient(cmap = 'Greens'))

In [None]:
total_100_fig = px.bar( x = total_per_100['country'], y = total_per_100['total_vaccinations_per_hundred'])
total_100_fig.update_layout(title = 'Total COVID-19 Vaccinations Per 100 People (Top 50)')
total_100_fig.show()

## **Top 50 Countries by People Fully Vaccinated per 100 People**

The below data shows the top 50 countries that have fully vaccinated people per 100 people. For most of the vaccines to be fully vaccinated, two vaccines are needed between 1 - 3 months apart. 

This gives a better overview of which countries have made the most progress fully vaccinating their population against COVID-19.

In [None]:
#Data Aggregation
full_per_100 = corrected_vaccine_data.groupby(['country'])['people_fully_vaccinated_per_hundred'].max().reset_index()
full_per_100 = full_per_100.nlargest(50, ['people_fully_vaccinated_per_hundred']).reset_index()
#Drop index for presentation
del full_per_100['index']

display((full_per_100).style.background_gradient(cmap = 'Greens'))

In [None]:
full_100_fig = px.bar( x = full_per_100['country'], y = full_per_100['people_fully_vaccinated_per_hundred'])
full_100_fig.update_layout(title = 'Total Amount of People Fully Vaccinated against COVID-19 Per 100 People (Top 50)')
full_100_fig.show()

# **Vaccine Usage By Day**

This section looks at which countries are using which vaccines on which days. The total vaccinations are a running total, so will increase each day, rather than being the specific amount used on a given day. This has been chosen over using the specific amount on each day as for many countries this data is incomplete or not given but more often than not the cumulative total is updated. 

It gives a more detailed insight into which countries are using which vaccines and how many they are using on a frequent basis. 

## **Global Vaccinations by Day**

The below graph shows the amount of vaccines being given globally by day. 

In [None]:
##Group data
daily_vac_global = corrected_vaccine_data[['date','daily_vaccinations']].sort_values('date', ascending = True)
daily_vac_global = daily_vac_global.groupby(['date'])['daily_vaccinations'].max().reset_index()
daily_vac_global['running_total'] = daily_vac_global.groupby('date')['daily_vaccinations'].cumsum()

In [None]:
daily_fig = px.line(daily_vac_global, x = 'date', y = 'running_total', title = 'Global Vaccinations by Day')
daily_fig.show()

## **Pfizer/BioNTech**

In [None]:
display(corrected_vaccine_data[corrected_vaccine_data['vaccines']== 'Pfizer/BioNTech'][['country', 'vaccines', 'total_vaccinations', 'date']].sort_values(by = 'total_vaccinations', ascending = False).style.background_gradient(cmap='Reds'))

## **Sputnik V**

In [None]:
display(corrected_vaccine_data[corrected_vaccine_data['vaccines']== 'Sputnik V'][['country', 'vaccines', 'total_vaccinations', 'date']].sort_values(by = 'total_vaccinations', ascending = False).style.background_gradient(cmap='Reds'))

 ## **Oxford/AstraZeneca**
 

In [None]:
display(corrected_vaccine_data[corrected_vaccine_data['vaccines']== 'Oxford/AstraZeneca'][['country', 'vaccines', 'total_vaccinations', 'date']].sort_values(by = 'total_vaccinations', ascending = False).style.background_gradient(cmap='Reds'))

## **Moderna, Oxford/AstraZeneca, Pfizer/BioNTech**

In [None]:
display(corrected_vaccine_data[corrected_vaccine_data['vaccines']== 'Moderna, Oxford/AstraZeneca, Pfizer/BioNTech'][['country', 'vaccines', 'total_vaccinations', 'date']].sort_values(by = 'total_vaccinations', ascending = False).style.background_gradient(cmap='Reds'))

## **Oxford/AstraZeneca, Sputnik V**

In [None]:
display(corrected_vaccine_data[corrected_vaccine_data['vaccines']== 'Oxford/AstraZeneca, Sputnik V'][['country', 'vaccines', 'total_vaccinations', 'date']].sort_values(by = 'total_vaccinations', ascending = False).style.background_gradient(cmap='Reds'))

## **Pfizer/BioNTech, Sinopharm/Beijing**

In [None]:
display(corrected_vaccine_data[corrected_vaccine_data['vaccines']== 'Pfizer/BioNTech, Sinopharm/Beijing'][['country', 'vaccines', 'total_vaccinations', 'date']].sort_values(by = 'total_vaccinations', ascending = False).style.background_gradient(cmap='Reds'))

## **Oxford/AstraZeneca, Sinovac**

In [None]:
display(corrected_vaccine_data[corrected_vaccine_data['vaccines']== 'Oxford/AstraZeneca, Sinovac'][['country', 'vaccines', 'total_vaccinations', 'date']].sort_values(by = 'total_vaccinations', ascending = False).style.background_gradient(cmap='Reds'))

## **Sinopharm/Beijing**

In [None]:
display(corrected_vaccine_data[corrected_vaccine_data['vaccines']== 'Sinopharm/Beijing'][['country', 'vaccines', 'total_vaccinations', 'date']].sort_values(by = 'total_vaccinations', ascending = False).style.background_gradient(cmap='Reds'))

## **Moderna, Pfizer/BioNTech**

In [None]:
display(corrected_vaccine_data[corrected_vaccine_data['vaccines']== 'Moderna, Pfizer/BioNTech'][['country', 'vaccines', 'total_vaccinations', 'date']].sort_values(by = 'total_vaccinations', ascending = False).style.background_gradient(cmap='Reds'))

## **Pfizer/BioNTech, Sinovac**

In [None]:
display(corrected_vaccine_data[corrected_vaccine_data['vaccines']== 'Pfizer/BioNTech, Sinovac'][['country', 'vaccines', 'total_vaccinations', 'date']].sort_values(by = 'total_vaccinations', ascending = False).style.background_gradient(cmap='Reds'))

## **Sinopharm/Beijing, Sinopharm/Wuhan, Sinovac**

In [None]:
display(corrected_vaccine_data[corrected_vaccine_data['vaccines']== 'Sinopharm/Beijing, Sinopharm/Wuhan, Sinovac'][['country', 'vaccines', 'total_vaccinations', 'date']].sort_values(by = 'total_vaccinations', ascending = False).style.background_gradient(cmap='Reds'))

## **Oxford/AstraZeneca, Pfizer/BioNTech**

In [None]:
display(corrected_vaccine_data[corrected_vaccine_data['vaccines']== 'Oxford/AstraZeneca, Pfizer/BioNTech'][['country', 'vaccines', 'total_vaccinations', 'date']].sort_values(by = 'total_vaccinations', ascending = False).style.background_gradient(cmap='Reds'))

## **Moderna, Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing, Sputnik V**

In [None]:
display(corrected_vaccine_data[corrected_vaccine_data['vaccines']== 'Moderna, Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing, Sputnik V'][['country', 'vaccines', 'total_vaccinations', 'date']].sort_values(by = 'total_vaccinations', ascending = False).style.background_gradient(cmap='Reds'))

## **Covaxin, Oxford/AstraZeneca**

In [None]:
display(corrected_vaccine_data[corrected_vaccine_data['vaccines']== 'Covaxin, Oxford/AstraZeneca'][['country', 'vaccines', 'total_vaccinations', 'date']].sort_values(by = 'total_vaccinations', ascending = False).style.background_gradient(cmap='Reds'))

## **Sinovac**

In [None]:
display(corrected_vaccine_data[corrected_vaccine_data['vaccines']== 'Sinovac'][['country', 'vaccines', 'total_vaccinations', 'date']].sort_values(by = 'total_vaccinations', ascending = False).style.background_gradient(cmap='Reds'))

## **Oxford/AstraZeneca, Sinopharm/Beijing**

In [None]:
display(corrected_vaccine_data[corrected_vaccine_data['vaccines']== 'Oxford/AstraZeneca, Sinopharm/Beijing'][['country', 'vaccines', 'total_vaccinations', 'date']].sort_values(by = 'total_vaccinations', ascending = False).style.background_gradient(cmap='Reds'))

## **Oxford/AstraZeneca, Sinopharm/Beijing, Sputnik V**

In [None]:
display(corrected_vaccine_data[corrected_vaccine_data['vaccines']== 'Oxford/AstraZeneca, Sinopharm/Beijing, Sputnik V'][['country', 'vaccines', 'total_vaccinations', 'date']].sort_values(by = 'total_vaccinations', ascending = False).style.background_gradient(cmap='Reds'))

## **Pfizer/BioNTech, Sinopharm/Beijing, Sputnik V**

In [None]:
display(corrected_vaccine_data[corrected_vaccine_data['vaccines']== 'Pfizer/BioNTech, Sinopharm/Beijing, Sputnik V'][['country', 'vaccines', 'total_vaccinations', 'date']].sort_values(by = 'total_vaccinations', ascending = False).style.background_gradient(cmap='Reds'))

## **Johnson & Johnson**

In [None]:
display(corrected_vaccine_data[corrected_vaccine_data['vaccines']== 'Johnson&Johnson'][['country', 'vaccines', 'total_vaccinations', 'date']].sort_values(by = 'total_vaccinations', ascending = False).style.background_gradient(cmap='Reds'))