# Task: Construct a score for how well countries are doing at their vaccine rollout for COVID-19.

The difficulty is the ambiguity in the prompt. How should we define "how well"? What is this score being used for? That additional context would be useful, however let's begin to break out the problem. 


### TL:DR

- Methodology used was to create a relative score based on percentage of people fully vaccinated
- The benefits are
1. We can better compare how countries are doing relative to others
2. Our scores fit a normal distribution which makes other type of modeling and analysis easier
3. The results are easy to communicate. 


### Approach 
- Pre-process data: Check data for NA, 0 or other issues.
- Methodology: people_fully_vaccinated_per_hundred as the topline metric. This is easily interpretable and can be analyized quickly. The final score provided is a relative score of the percentage of people fully vaccinated. I noticed that the data had a log-normal distribution, so I transformed it to fit a normal distribution. This way, the score could be easily implemented into a more sophisitcated model or analysis. 
  - Without having qualitaive analysis of how the different factors such as GDP/capita, population/density, pre-vaccine covid conditions, vaccine utilization, and how they correlate to the topline score, it made more sophisitcated scoring method difficult to produce. Given domain knowledge at this point in time, I stuck with the simpler approach. 

### Future work
- What other factors impact the vaccine rollout that could be included in the score?
    - Prior-covid cases/deaths: If a country has a higher proportion of the world cases, then vaccines should be prioritized to them
    - GDP/captia: Developed countries have more resources and as a result will likely obtain a larger volumes of the vaccines realtive to poorer counties.
    - Vaccine utilization: Based on how many vaccines available to the country, how effective is the country at distributing them? 
    - supply chain limitations
- Create a time series of the scores and see how countries perform over time? Compare scores with a third dimension which is days from first vaccination. (ie lets say USA first vax was start of Decemnber and Japan was a month later, comparing those two at isn't necessairly equivalent) 
- Improving quality of data. There was a lot of missing data. This analysis took a lot of that into consideration.  

   


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np, pandas as pd
from IPython.display import Image
import matplotlib.pyplot as plt, seaborn as sns
import scipy
import warnings
import plotly.express as px
from itertools import product
import statsmodels.api as sm
import datetime
from tqdm import tqdm
warnings.filterwarnings('ignore')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# load
data = pd.read_csv('../input/covid-world-vaccination-progress/country_vaccinations.csv')

In [None]:
data.head()

# Pre-Processing

In [None]:
def quick_stats(df):
    """
    displays a variety of statistics of a pandas dataframe
    
    parameters: 
        df: pandas dataframe
        
    returns:
        null
    """
    print("Dimenstions of dataframe: " + str(df.shape))
    print("Data types of dataframe: " + str(df.dtypes))
    print((df.describe()))
    print( "Nulls of dataframe: " + str(df.isnull().sum()))
    print((df == 0).sum())
    
quick_stats(data)

The data is very incomplete.

Notes: 
* There are 175 countries in the dataset. Of those 175, only 101 have data for 'people_fully_vaccinated_per_hundred'

Many key data points have tons of null values. For this analysis, I only considered countries, which had a non-NA/ non-zero value for the field, people_fully_vaccinated_per_hundred, as that is the key score I am working with. 

In [None]:
## investigate missing data.
data[ data['iso_code'] == "USA"].head(10)

In [None]:
fig = px.line(data, x="date", y="people_fully_vaccinated_per_hundred", color='country')
fig.show()
# we can see the missing values, but also see that Gibraltar and Israel are leading the charts as the highest performers.

In [None]:
plt.subplots(figsize=(8, 8))
sns.heatmap(data.corr(), annot=True, square=True)
plt.show()

I considered throwing out this dataset and using data from https://ourworldindata.org/covid-vaccinations, but due to time limitations it would take to gather and vet data from other sources, I decided to focus more on the scoring methodology rather than alternative datasets.

# Creating Score

In [None]:
# max_vax_df = data.groupby(['iso_code'], as_index = False).agg({'people_fully_vaccinated_per_hundred':'max'}).dropna().reset_index()
temp = data.groupby(['iso_code']).max()
max_vax_df = temp[temp['people_fully_vaccinated_per_hundred'].notna() & temp['people_fully_vaccinated_per_hundred'] != 0.0]
max_vax_df.sort_values(by = ['people_fully_vaccinated_per_hundred'], ascending = 'False', axis = 0)

In [None]:
max_vax_df['log_transform_percentage'] = np.log(max_vax_df['people_fully_vaccinated_per_hundred']/100)
max_vax_df

In [None]:
max_vax_df['people_fully_vaccinated_per_hundred'].plot(kind="hist", bins = 20)


In [None]:
# max_vax_df['log_transform_percentage'].plot(kind="hist", bins = 20
max_vax_df['log_transform_percentage'].plot.kde()

In [None]:
mu = max_vax_df['log_transform_percentage'].mean()
print(mu)
std =  max_vax_df['log_transform_percentage'].std()
print(std)

In [None]:
max_vax_df['z_score'] =  (max_vax_df['log_transform_percentage'] - mu)/std
#shifts out distribution to have mean 0, var = 1

In [None]:
# max_vax_df['z_score'].plot(kind="hist", bins = 20)
max_vax_df['z_score'].plot.kde()
#this becomes standard normal

In [None]:
print(max_vax_df['z_score'].mean())
print(max_vax_df['z_score'].std())

In [None]:
# max_vax_df['z_score'].sort_values(ascending = False).head(50)

import plotly.express as px
fig = px.bar(max_vax_df.sort_values(by = 'z_score',ascending = False), x='country', y='z_score')
fig.show()

In [None]:
# from scipy.stats import norm

# x = np.linspace(-10,10,100)
# y = norm.cdf(x)

# plt.plot(x, y)

In [None]:
# import scipy
# import seaborn as sns


# plt.figure(figsize=(15,10))
# x= sns.barplot(x = max_vax_df.sort_values(by = 'z_score',ascending = True)['country'], y = scipy.stats.norm.cdf(max_vax_df['z_score'].sort_values()))
# x.set_xticklabels(x.get_xticklabels(),rotation=90)

# scipy.stats.norm.cdf(max_vax_df['z_score'])

# norm_cdf(max_vax_df['z_score']).sort_values() #.plot()
# max_vax_df['z_score'].sort_values().plot(kind = 'bar')

In [None]:
import plotly.graph_objects as go

colors = ['blue',] * 101
colors[42] = 'crimson'
colors[100] = 'crimson'

fig = go.Figure(data=[go.Bar(x= max_vax_df.sort_values(by = 'z_score',ascending = True)['country'], y = scipy.stats.norm.cdf(max_vax_df['z_score'].sort_values()), marker_color=colors)])
# fig = go.Figure(go.Bar(x= max_vax_df.sort_values(by = 'z_score',ascending = True)['country'], y = scipy.stats.norm.cdf(max_vax_df['z_score'].sort_values()))
fig.show()

 Let's comapre Gibraltar to Macao

In [None]:
max_vax_df[max_vax_df['country'] == "Gibraltar"]

In [None]:
max_vax_df[max_vax_df['country'] == "Macao"]

The benefits are
1. We can better compare how countries are doing relative to others
2. Our scores fit a normal distribution which makes other type of modeling and analysis easier
3. The results are easy to communicate. 