# BCG - COVID-19 AI Challenge - Hack 2


![](https://s3.ap-southeast-1.amazonaws.com/images.deccanchronicle.com/dc-Cover-qo4lkc94037m2ug36rq5r9q8a6-20200716133831.Medi.jpeg)

BCG vaccine has been in the news a lot over past few months wherein some [researchers](https://www.medrxiv.org/content/10.1101/2020.03.24.20042937v1) linked the lower deaths in certain countries to the universal BCG vaccine policy being implemented in those countries. This led to a wider debate on the topic and certain countries starting clinical trials to prove if BCG vaccine may actually provide some protection from Covid in absence of a reliable vaccine or medication. 

Few weeks ago I came across this challenge where the purpose is to find insights that might help in the ongoing clinical trials for BCG. Since then I have been working on this hack, the purpose of this notebook is to understand the impact of Covid across the globe and determine if there is any impact on severity of the disease based on BCG vaccine policy within the country. During my research for this problem, I came across number of ongoing clinical trials as well as analysis to study the impact of BCG vaccine and to prove if it can be used to reduce the deaths caused due to Covid. Since the ongoing clinical trials might take some time to be completed, hopefully these insights would be useful in providing some useful information.

# Table of Contents - 
* [Importing the raw datasets & basic overview of datasets](#introduction)
* [Potential Areas of Analysis](#analysis)
* [Status of Covid Pandemic Across the world](#global)
* [BCG - Impact of Vaccination on Covid-19](#bcg)
* [Spain vs Portugal](#spain)
* [Ireland - Cork vs Derry](#ireland)
* [East Germany vs West Germany](#germany)

## Importing the Raw Datasets<a name="introduction"></a>

The Competiton provides the following files - 

```covid_data.csv```- Contains information about the daily Covid related Confirmed Cases & Deaths for all countries across the globe, sourced from Our World in Data and contains data till 20 June

```bcg_world_atlas.csv``` - Contains the information about BCG policies across countries, start & end date of BCG vaccination, BCG strain used etc.

```germany_province_data.csv``` - Contains information Covid related cases across Germany.

In addition to these files I also came gathered some datasets from the internet based on the questions that I was looking to answer through my analysis-

```bcg_pnas.csv``` - Dataset used for PNAS research containing details on BCG Coverage and minimum and maximum age groups vaccinated across European countries.

```unicef_bcg_coverage.csv``` - Dataset obtained from UNICEF data warehouse providing stats on BCG coverage from 1980-1920 by country


```ireland_data.csv``` - Contains information about the covid cases in Ireland across different counties

```spain_age_group.csv``` - Contains information about the covid cases in Spain for different age groups, contains stats like Hospitalization, ICU & Deaths across age groups. 

```portugal_age_group.csv``` - Contains information Covid deaths in Portugal across different age groups. Sourced from Wikipedia

```covid_de.csv``` - Contains information Covid stats acorss different german provinces

```stringency_index.csv``` - Contains Daily Stringency Index for countries, derived from strictness of lockdown & social distancing measures enforced by countries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import glob
from datetime import datetime
import plotly.express as px
import datetime 
from scipy import stats

import re
from plotly.subplots import make_subplots

import plotly.graph_objects as go

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
# for dirname, _, filenames in os.walk('/kaggle/input/johnshopkinscovid/csse_covid_19_data/csse_covid_19_daily_reports/'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

covid_data=pd.read_csv('/kaggle/input/hackathon/task_2-owid_covid_data-21_June_2020.csv')#OWID Covid cases data till 20 June
germany_province_data=pd.read_csv('/kaggle/input/hackathon/task_2-Gemany_per_state_stats_20June2020.csv',thousands=',')#Covid stats for Germany provinces
#tb_data=pd.read_csv('/kaggle/input/hackathon/task_2-Tuberculosis_infection_estimates_for_2018.csv')
bcg_world_atlas=pd.read_csv('/kaggle/input/hackathon/BCG_world_atlas_data-2020.csv')#BCG World Atlas dataset
#pop_age_group=pd.read_csv('/kaggle/input/country-population-by-age-group/population-by-broad-age-group.csv')
spain_age_group=pd.read_csv('/kaggle/input/portugal-spain-covid-cases-by-age-group/spain_cases_age_group.csv',thousands=",")#Covid cases by age group in Spain
portugal_age_group=pd.read_csv('/kaggle/input/portugal/portugal_cases_age_group.csv')#Covid cases by age group in Portugal
#ni_ireland_covid_data=pd.read_csv('/kaggle/input/ireland-covid-data/Covid_19_Northern_Ireland_Daily_Indicators.csv',parse_dates=True, dayfirst=True)
ireland_data=pd.read_csv('/kaggle/input/ireland-covid-data/ireland_cso_data.csv')#Covid stats from Ireland counties
bcg_pnas=pd.read_csv('../input/pnas-bcg-data/bcg_data_pnas_europe.csv')#BCG Coverage dataset for Europe from PNAS
de_covid_data=pd.read_csv('/kaggle/input/covid19-tracking-germany/covid_de.csv')#Germany covid related stats
unicef_bcg_coverage=pd.read_csv('../input/unicef-bcg-coverage/unicef_bcg_data.csv')#UNICEF BCG coverage related stats
stringency_index=pd.read_csv('../input/owid-covid-stringency-index/covid-stringency-index.csv')

## Potential Areas of Analysis <a name="analysis"></a>

Given all the datasets that are available to us, we can look to answer the following questions - 

* How does Daily Covid Cases & Deaths look across different countries in the World?
* How does Covid Fatality Rate changes across countries? How does Fatality Rate vary with testing rates?
* How are different variables like Population Density, Diabetes & Cardiovascular disease prevalence, Income status, Number of hospital beds etc. related to Covid deaths in countries?
* Is there a difference in Mortality Rate for countries based on Mandatory status of BCG Vaccine?
* Which are the commonly used BCG vaccines across the world. How does Mortality rate vary across different BCG strains being used?
* Is the Fatality Rate consistent across BCG mandatory countries? If its different what could explain the difference in Fatality Rates?
* How does Covid rate vary across two neighboring counties in Ireland with different coverage for BCG Vaccine?
* How does impact of Covid vary across Spain & Portugal, neighboring countries with difference in BCG vaccine policy?
* Does BCG vaccine provide Covid protection in individuals belonging to higher age groups?
* How does impact of Covid vary across East & West Germany provinces, both having different vaccination policies being implemented post 1950?

## Status of Covid-19 Pandemic Across the World<a name="global"></a>

The OWID Covid Datset available to us provide daily covid stats across all countries in the world till 20 June. We will start out by performing some exploratory analysis on the dataset and trying to understand the spread of Covid-19 Pandemic and how different regions in the world have been impacted by it.

In [None]:
bcg_world_atlas = bcg_world_atlas.rename(columns={'Contry Name (Mandatory field)': 'location', 'Is it mandatory for all children?': 'mandatory'})
bcg_world_atlas_col=bcg_world_atlas[['location','mandatory','BCG Strain ']]
bcg_world_atlas_col['mandatory']=bcg_world_atlas_col['mandatory'].fillna('Unknown')
bcg_world_atlas_col

bcg_world_atlas_col['BCG Strain ']=bcg_world_atlas_col['BCG Strain '].fillna('NA')
bcg_world_atlas_col.loc[bcg_world_atlas_col['BCG Strain '].str.contains('Danish|Staten|SSI',regex=True),'BCG Strain ']='Danish'
bcg_world_atlas_col.loc[bcg_world_atlas_col['BCG Strain '].str.contains('Japan|Tokyo',regex=True),'BCG Strain ']='Japan'
bcg_world_atlas_col.loc[bcg_world_atlas_col['BCG Strain '].str.contains('Pasteur',regex=True),'BCG Strain ']='Pasteur'

In [None]:
stringency_index['Date']= pd.to_datetime(stringency_index['Date']) 
stringency_index.columns=['location','code','Date','stringency_index']

covid_data['date']=pd.to_datetime(covid_data['date']).dt.date
covid_data['Fatality_Rate']=covid_data['total_deaths']/covid_data['total_cases']

covid_data_no_world=covid_data.loc[covid_data['iso_code']!='OWID_WRL']

fig = px.line(covid_data_no_world, x="date", y="total_cases", color='location')
fig.update_layout(title_text='Total Covid confirmed Cases ',xaxis_title_text='date',yaxis_title_text='Total Cases',width=800,height=400)
fig.show()

fig = px.line(covid_data_no_world, x="date", y="total_deaths", color='location')
fig.update_layout(title_text='Total Covid Deaths',xaxis_title_text='date',yaxis_title_text='Daily New Cases',width=800,height=400)
fig.show()

The above line chart is from start of the year till 20 June. Two countries - US & Brazil clearly stand out above everyone else regarding the growth in the number of cases. US, Brazil as well as few European countries like UK, France & Spain have very high death counts. Double clicking on the country names above allows us to have a look at Covid Cases and Deaths for individual countries

In [None]:
covid_data_no_world=covid_data.loc[covid_data['iso_code']!='OWID_WRL']

fig = px.choropleth(covid_data_no_world, locations="iso_code",
                    color="total_deaths", # Death is a column of gapminder
                    hover_name="location", # column to add to hover information
                    color_continuous_scale=px.colors.sequential.Plasma)
fig.update_layout(title='Covid Deaths across Countries')
fig.show()

Following things can be observed from the chart above - 
* **North America** - **US** has the highest number of deaths across the world. Apart from US, Mexico seems to have been hit hard as well from Covid, Canada seems to have lower number of deaths compared to US & Mexico
* **South America** **Brazil** has been impacted worse than any other country in South America

* **Europe** - There is a **big difference in number of deaths across Eastern & Western European countries**. Most of the Western European countries seem to have very high number of deaths, with the exception of Portugal & Ireland. Among the Eastern European countries, Russia has high number of deaths

* **Asia** - **India & Iran** stand out as the countries with highest number of deaths. China - the epicentre of Covid, seems to have fallen behind other countries in Asia

* **Africa** - The entire continent seems to be fairly consistent in terms of low deaths being reported, it could potentially be due to deaths not being tracked due to low testing numbers. It could also be due to lesser travel to African countries from initial hubs of Covid

In [None]:
covid_data_no_world_jun=covid_data_no_world.loc[covid_data_no_world['date']==datetime.date(year=2020,month=6,day=20)]
covid_data_no_world_2000=covid_data_no_world_jun.loc[covid_data_no_world['total_cases']>=10000]
fig = px.scatter(covid_data_no_world_2000, x="total_cases_per_million", y="Fatality_Rate",log_x=True, color="continent",hover_name="location",text="location")
fig.update_traces(textposition='top center')
fig.update_layout(
    height=600,
    title_text='Total Cases vs Fatality Rate',
    yaxis_tickformat = '%'
)
fig.show()

In [None]:
tests=covid_data_no_world_2000.loc[covid_data_no_world_2000['total_tests'].notna()]
fig = px.scatter(tests, x="total_cases_per_million", y="Fatality_Rate",log_x=True,size='total_tests_per_thousand', color="continent",hover_name="location",text="location",size_max=60)
fig.update_traces(textposition='top center')
fig.update_layout(
    height=600,
    title_text='Total Cases vs Fatality Rate - Bubble size indicates tests_per_thousand',
    yaxis_tickformat = '%'
)
fig.show()

Following things emerge from above scatter plots - 

* Big number of countries seem to have Fatality Rate between **0-5%**
* European countries of **Belgium, UK, France, Netherlands & Italy** stand out of the pack when it comes to Fatality Rate. Belgium has a Fatality Rate of 15% - which means that out of every 100 Covid patients - 15 end up dying, which is much higher than rest of the world
* **Mexico** has the highest Fatality Rate in North America i.e. more than 10%, which is much higher than death rate for United States
* **Ecuador** has the highest Fatality Rate in South America, more than 10%
* Middle East countries like **Qatar, Bahrain & Saudi Arabia** have the highest tests conducted per million
* Testing stats are available only for Nigeria & South Africa in the African continent. For South America, these numbers are only available for Peru, Chile & Colombia

There seems to be a trend emerging that high income countries also tend to have higher number of Covid related deaths compared to lower income countries. Lower Income countries like Nigeria, India, Pakistan, Peru etc. have lower number of tests as well lower Fatality rates being reported

Let us now look at correlation between covid deaths and different country level variables available in the OWID dataset to understand which factors have a high correlation with covid deaths.

In [None]:
#Correl Matrix different variables
import seaborn as sns
covid_data_sel=covid_data_no_world_jun[['date','total_cases','total_deaths','Fatality_Rate','total_cases_per_million','gdp_per_capita','diabetes_prevalence','total_tests','total_tests_per_thousand','population_density','population','median_age','aged_65_older','hospital_beds_per_thousand','total_tests','total_tests_per_thousand','stringency_index','population_density']]
covid_data_sel['total_tests']=covid_data_sel['total_tests'].fillna(0)
#covid_data_sel['Test Positivity Rate']=covid_data_sel['total_cases']/covid_data_sel['total_tests']
covid_data_sel_cor=covid_data_sel.fillna(covid_data_sel.mean())
corr = covid_data_sel_cor.corr(method='spearman')
corr
ax = sns.heatmap(corr,linewidths=.5,cmap='YlGnBu')

Since the trend for Covid cases in most countries is non-linear, it may not make sense to look at Linear Pearson's Correlation. Instead we took a look at the Spearman's correlation and looked at correlation between Covid related cases and other variables available in the base dataset. There are not too many strong correlation relationsips emerge from this analysis. Following are some of the only strong correlation variables I was able to obseve - 
* **gdp_per_capita** also had a strong correation with **total_tests_per_thousand**, indicating that maybe only countries with huge resources are able to test extensively. Poorer countries may not be able to test a lot and could be underreporting the total number of cases
* **gdp_per_capita** also had a strong correlation with **median_age**, meaning that these countries also had significantly older population. My assumption is that most of these are European countries with older population
* **median_age** is strongly correlated with **hospital_beds_per_thousand**, showing that countries with older population also have greater healthcare resources at their disposal
* **Fatality_Rate** isn't strongly correlated with any of the other variables available in the dataset

Let us look at how Fatality Rates for Covid varies across different countries.

In [None]:
#Sort dataframe on a column
covid_data_no_world_jun_rel=covid_data_no_world_jun.loc[covid_data_no_world_jun['total_cases']>1000]
covid_data_no_world_jun_rel_sort=covid_data_no_world_jun_rel.sort_values('Fatality_Rate',ascending=False)
max_fat=covid_data_no_world_jun_rel_sort.head(5)
min_fat=covid_data_no_world_jun_rel_sort.tail(5)
max_fat
fig = make_subplots(rows=1, cols=2,subplot_titles=("Maximum Fatality Rate", "Minimum Fatality Rate"))

fig.add_trace(
    go.Bar(x=max_fat['location'], y=max_fat['Fatality_Rate'],name='Countries'),
    row=1, col=1
)

fig.add_trace(
    go.Bar(x=min_fat['location'], y=min_fat['Fatality_Rate'],name='Countries'),
    row=1, col=2
)

fig.update_layout(height=400, width=800, title_text="Countries with highest & lowest Fatality Rate",yaxis_tickformat = '%')
fig.show()

In the chart above we have plotted top 5 countries with the highest & lowest fatality rates due to Covid. Interestingly the countries with highest fatality rate all belong to Europe and the countries with lowest fatality rate all belong to Asia. In the upcoming steps within the notebook it would also be interesting to look at the status of BCG vaccine administration in these countries. Let's also look at daily progression of cases in these countries through a line chart.

In [None]:
#In the top 5 countries, have a look at the fatality rate. 
covid_data_no_world
fig = go.Figure()

country_list=['France','Belgium','Italy','United Kingdom','Hungary']
for i in range(5):
    fig.add_trace(
    go.Scatter(x=covid_data_no_world[covid_data_no_world['location']==country_list[i]]['date'], y=covid_data_no_world[covid_data_no_world['location']==country_list[i]]['Fatality_Rate'],name=country_list[i]),
    )
fig.update_layout(title='Countries with Highest Fatality Rates',yaxis_tickformat = '%',height=400)
fig.show()


fig1=go.Figure()
country_list=['Uzbekistan','Bahrain','Nepal','Qatar','Singapore']
for i in range(5):
    fig1.add_trace(
    go.Scatter(x=covid_data_no_world[covid_data_no_world['location']==country_list[i]]['date'], y=covid_data_no_world[covid_data_no_world['location']==country_list[i]]['Fatality_Rate'],name=country_list[i],mode='lines'),
    )
fig1.update_layout(title='Countries with Lowest Fatality Rates',yaxis_tickformat = '%',height=400)

fig1.show()

For the countries with the highest fatality rates, Italy saw the spike in caes first, around 1st week of April reportedly due to incoming travelers from China. The other countries witnessed start of spike around mid-April. The Fatality rate continued to increase in May and eventually settled around 1st week of June. All these countries saw Fatality Rates well in excess of 10%. 

For the countries with lowest fatality rates, even at the spikes the fatality rate was around 1%, much lower than the countries mentioned earlier. These countries saw a short spike post which the fatality rate went down steeply to less than 1%. Most of these countries are small with very small population numbers.

## Impact of BCG Vaccination on Covid-19<a name="bcg"></a>

After having gained some initial understanding of the spread of Covid-19 pandemic across the globe, let's start taking a deeper look at the BCG World Atlas dataset and begin understanding the prevalence of BCG vaccine across the globe and whether it has any effect on developing immunity against Covid-19. 


![](https://specials-images.forbesimg.com/imageserve/5e9292cb62b1b400075cb0d9/960x0.jpg?fit=scale)

As shown in the image above - most of countries in Africa, South America & Asia have current BCG vaccination policy for all their citizens. Majority of European countries had a national vaccination policy but since then have moved away from it - either due to unavailablity of vaccine due to global shortage or low incidence rates of TB

In [None]:
unicef_bcg_data=unicef_bcg_coverage[['Geographic area','TIME_PERIOD','OBS_VALUE']]
unicef_bcg_data.columns=['location','year','% BCG Coverage']
country_continent=covid_data[['location','continent']].drop_duplicates(keep='first')
unicef_bcg_data_continent=pd.merge(unicef_bcg_data,country_continent,how='inner')
#unicef_bcg_data_continent
unicef_bcg_data_2015=unicef_bcg_data_continent.loc[unicef_bcg_data_continent['year']==2015]
#unicef_bcg_data_2015
fig = px.histogram(unicef_bcg_data_2015, x="% BCG Coverage",title='Distribution of BCG coverage in 2015 across Countries',height=400)
fig.show()
unicef_bcg_data_2015[unicef_bcg_data_2015['% BCG Coverage']<=60]

unicef_bcg_data_continent_gp=unicef_bcg_data_continent.groupby(['continent','year']).agg({'% BCG Coverage':'median'}).reset_index()
fig = px.line(unicef_bcg_data_continent_gp, x="year", y="% BCG Coverage", title='BCG Coverage trends by Continent',color='continent')
fig.show()

One of the things we need to look at is the coverage rate for BCG vaccines - the number of newborn infants who are administered this vaccine. I managed to find a dataset on the UNICEF website which provided stats on % BCG Coverage by country from the year 1980. It seems like by 2015 majority of the countries had more than 90% coverage regarding BCG vaccine, which means that countries which had mandeated the use of BCG vaccine were able to implement their policies effectively. There were few outliers as well, countries with vaccination coverage rate less than 60%, these countries were typically in Europe, vaccine shortages could have been one reason for the coverage to have dropped. 

I also looked at the trends on BCG coverage across continents by looking at the median values for BCG coverage. These values were pretty low in the early 1980s all across the world except for Europe, however it did pick up by 1990s and has stayed up ever since.

In [None]:
country_continent=covid_data[['location','continent']]
country_continent=country_continent.drop_duplicates(subset=['location'], keep='first')
bcg_world_atlas['BCG Strain ']=bcg_world_atlas['BCG Strain '].fillna('NA')

bcg_world_atlas_country=pd.merge(bcg_world_atlas,country_continent,how="left",left_on='location',right_on='location')

#bcg_world_atlas_country['BCG Strain '].str.contains('Danish')

bcg_world_atlas_country.loc[bcg_world_atlas_country['BCG Strain '].str.contains('Danish|Staten|SSI',regex=True),'BCG Strain ']='Danish'
bcg_world_atlas_country.loc[bcg_world_atlas_country['BCG Strain '].str.contains('Japan|Tokyo',regex=True),'BCG Strain ']='Japan'
bcg_world_atlas_country.loc[bcg_world_atlas_country['BCG Strain '].str.contains('Pasteur',regex=True),'BCG Strain ']='Pasteur'

# # #bcg_world_atlas_country.loc[gp['BCG Strain'].str.contains('Danish'|'Staten'),'BCG Strain']='Danish'

gp=bcg_world_atlas_country.groupby(['BCG Strain ','continent']).agg({'location':'nunique'}).reset_index(drop=False)
gp.columns=['BCG Strain','continent','location']
gp.sort_values('location',ascending=False,inplace=True)
gp = gp[gp['BCG Strain']!='NA']
k=gp.pivot(index='BCG Strain', columns='continent', values='location').reset_index()
k.fillna(0, inplace=True)
k['total']=k.iloc[:,1:6].sum(axis=1)
k.sort_values('total',ascending=False).head(5)

There are a lot of different strains being used across the globe - within the dataset sometimes the same strain is mentioned under different names, so we have aggregated all of them under one name. Danish strain is the most common BCG strain used across the globe. As many as 16 countries in Europe used Danish strain, it was also used in all the other countries. Japanese, Pasteur, Serum Institute of India and Moscow are some of the other strains. 

We will now be looking at fatality rates in countries based on the status of BCG vaccine administration within the country. As the BCG world atlas data is incomplete, we will classify countries into three groups
*  BCG Mandatory Yes - Countries where BCG vaccination is mandatory for all
*  BCG Mandatory No - Countries where BCG vaccination is not mandatory and is only being given to certian high risk populations
*  BCG Status Unknown - Countries for which data on mandatory status is not available in BCG World Atlas
  
  We will now be diving deeper into the **BCG Mandatory Yes** and **BCG Mandatory No** groups and looking at fatality rates within these countries

In [None]:
bcg_world_atlas = bcg_world_atlas.rename(columns={'Contry Name (Mandatory field)': 'location', 'Is it mandatory for all children?': 'mandatory'})
bcg_world_atlas_col=bcg_world_atlas[['location','mandatory','BCG Strain ']]
bcg_world_atlas_col['mandatory']=bcg_world_atlas_col['mandatory'].fillna('Unknown')
bcg_world_atlas_col

bcg_world_atlas_col['BCG Strain ']=bcg_world_atlas_col['BCG Strain '].fillna('NA')
bcg_world_atlas_col.loc[bcg_world_atlas_col['BCG Strain '].str.contains('Danish|Staten|SSI',regex=True),'BCG Strain ']='Danish'
bcg_world_atlas_col.loc[bcg_world_atlas_col['BCG Strain '].str.contains('Japan|Tokyo',regex=True),'BCG Strain ']='Japan'
bcg_world_atlas_col.loc[bcg_world_atlas_col['BCG Strain '].str.contains('Pasteur',regex=True),'BCG Strain ']='Pasteur'

In [None]:

#bcg_world_atlas_col.loc[bcg_world_atlas_col['mandatory']=='Unknown','mandatory']='Z'
bcg_world_atlas_col=bcg_world_atlas_col.sort_values('mandatory',ascending=True)
bcg_world_atlas_col_unique=bcg_world_atlas_col.drop_duplicates(subset=['location'],keep='first')
#bcg_world_atlas_col.loc[bcg_world_atlas_col['mandatory']=='Z','mandatory']='Unknown'

covid_data_no_world.loc[covid_data_no_world['location']=='United States','location']='United States of America'
bcg_world_atlas_col_unique.loc[bcg_world_atlas_col_unique['location']=='France','mandatory']='no'
covid_data_no_world_bcg=pd.merge(left=covid_data_no_world,right=bcg_world_atlas_col_unique,how='inner')
covid_data_no_world_bcg.groupby(['mandatory']).agg({'location':'nunique'}).reset_index()

In [None]:
covid_data_no_world_bcg[covid_data_no_world_bcg['mandatory']=='Unknown']['location'].unique()
#covid_data_no_world_bcg.location.unique()

In [None]:
k=covid_data_no_world_bcg.groupby(['date','mandatory']).agg({'total_deaths':['sum'],'total_cases':['sum']}).reset_index()
k.columns=['date','mandatory','total_deaths','total_cases']
k['Fatality_Rate']=k['total_deaths']/k['total_cases']
fig = px.line(k, x="date", y="Fatality_Rate", color='mandatory')
fig.update_layout(title_text='Comparison of Fatality Rate by BCG Vaccine Status',yaxis_tickformat = '%')
fig.show()

yes_sample=covid_data_no_world_bcg.loc[(covid_data_no_world_bcg['mandatory'].isin(['yes']))&(covid_data_no_world_bcg['date']==datetime.date(year=2020,month=6,day=20))]
no_sample=covid_data_no_world_bcg.loc[(covid_data_no_world_bcg['mandatory'].isin(['no']))&(covid_data_no_world_bcg['date']==datetime.date(year=2020,month=6,day=20))]

yes_sample=yes_sample['Fatality_Rate']
no_sample=no_sample['Fatality_Rate']
#yes_sample

stats.ttest_ind(yes_sample,no_sample)

The above chart shows the difference in fatality rates between different BCG countries over time. Its interesting that there is a big difference between countries based on their BCG status. **Around June 20, non-BCG Mandatory countries had a combined Fatality Rate of 7% compared to the Fatality Rate of 3% for countries where BCG is Mandatory**. However lot of countries where BCG isn't mandatory are in Europe & North America, countries with low prevalence for TB and also having reported high number of fatalities. 

Also the Fatality Rate in BCG Mandatory countries remained flat from the start, there was no huge spike being observed, however for non-BCG Mandatory countries there was a continous rise in Fatality Rate before it stablized around June. 

We also performed a **t-test** to see if the difference between Fatality Rates between BCG mandatory countries was due to chance. P Value turned out to be <0.05, indicating the statistical significance of the result. However we could not be sure if the underlying population of the two groups is similar, given the difference in demographics, income, testing etc.

It would not be wise to just look at the above line chart and make a conclusion that lower Fatality Rate in BCG mandatory countries is due to the vaccine. There could be other demographic, social & economic factors that could be responsible for low spread, and it could also be the case that these countries are under-reporting covid cases and deaths due to lack of testing infrastructure.

In [None]:
covid_data_no_world_bcg_1000=covid_data_no_world_bcg.loc[(covid_data_no_world_bcg['total_cases']>=1000)&(covid_data_no_world_bcg['date']==datetime.date(year=2020,month=6,day=20))]
bcg_yes_no_comp=covid_data_no_world_bcg_1000.groupby('mandatory').agg({'population':'median','total_cases':'median','total_cases_per_million':'median','total_tests':'median','total_tests_per_thousand':'median','extreme_poverty':'mean','diabetes_prevalence':'mean','cvd_death_rate':'median','hospital_beds_per_thousand':'mean','Fatality_Rate':'median','aged_65_older':'median','life_expectancy':'mean','female_smokers':'mean','male_smokers':'mean'})
bcg_yes_no_comp

We now look at some of the economic and demographics related variables to understand differences between BCG & non-BCG countries. We only included at countries with more than 1000 cases for this analysis. Following are the observations -  
* Non-BCG countries had **twice** as many **total_tests_per_thousand** compared to BCG countries
* BCG countries had **twice** the value of **extreme_poverty** compared to non-BCG countries
* BCG countries had **twice** the value of **cvd_death_rate**(Cardiovascular Death Rate) showing higher percentage of citizens in these countries were vulnerable to heart diseases
* Non-BCG countries had higher **hospital_beds_per_thousand**
* Non-BCG countries had **thrice** the value of **aged_65_older**, showing that these countries had higher percentage of vulnerable population

These pointers clearly show that its not wise to make the claim that BCG Mandatory countries have lower Fatality Rates solely due to the vaccine, there are other factors at work here such as lower tesing, lower percentage of older population etc. which could be the reason for lower number of deaths here. A thorough hypothesis testing and regression analysis would allow us to determine the contribution of BCG vaccine status in the Fatality rate for the country.

In [None]:
tests=covid_data_no_world_bcg_1000.loc[covid_data_no_world_bcg_1000['total_tests'].notna()]
tests=tests.loc[covid_data_no_world_bcg_1000['mandatory'].notna()]

fig = px.scatter(tests, x="total_cases_per_million", y="Fatality_Rate",log_x=True,size='total_tests_per_thousand', color="mandatory",hover_name="location",text="location",size_max=60)
fig.update_traces(textposition='top center')
fig.update_layout(
    height=600,
    title_text='BCG Status - Total Cases vs Fatality Rate - Bubble Size indicates total_tests_per_thousand',
    yaxis_tickformat = '%'
)
fig.show()

We also look at spread of cases within individual countries through ```total_cases_per_million``` as well as ```fatality_rate``` broken out by BCG status for the countries-
* Non-BCG mandatory countries like Canada, Greece, Switzerland, France etc have a higher fatality rate than other countries
* ```total_tests_per_million``` seems to be even across BCG Mandatory & non-mandatory countries with some BCG mandatory countries having higher tests. This includes Bahrain & Qatar in the Middle East
* ```total_cases_per_million``` also seems to be even across BCG Mandatory & non-mandatory countries

In [None]:
fig = px.violin(covid_data_no_world_bcg_1000, x='mandatory', color='mandatory',y='Fatality_Rate',box=True) 
fig.update_layout(template='seaborn',title='Distribution of Fatality Rate for Countries divided by BCG Status',legend_title_text='State',yaxis_tickformat = '%')

fig.show()

The Violin plot above allows us to look at the distribution of individual countries Fatality Rate based on different categories of BCG vaccine status. This data provides a snapshot of 20 June. The median fatality rate for BCG mandatory countries is 3% compared to 5% for non-BCG mandatory countries. We see that even within BCG mandatory countries there are countries with higher than 10% Fatality Rate, we would need to dive deeper into data for those countries to understand what were the factors for high Fatality Rate.

In [None]:
bcg_yes=covid_data_no_world_bcg_1000.loc[covid_data_no_world_bcg_1000['mandatory']=='yes']

fig = px.violin(bcg_yes, x='BCG Strain ',y='Fatality_Rate',box=True) 
fig.update_layout(template='seaborn',title='Distribution of Fatality Rate for Countries divided by BCG Strain',legend_title_text='State',yaxis_tickformat = '%')

fig.show()

We also looked at Fatality Rates for different strains of BCG vaccine to understand the efficacy of these vaccines. To be able to provide a reliable estimate of Fatality Rate against these strains, we only included the countries where BCG vaccine was mandatory to only include countries where the protection offered from these vaccines would be complete. At an initial glance, there is not much difference observed between different strains, however the **Serum Institute of India** & **Japan** strains seem to have lower Fatality rates of around 2%, which is ~1-2% lower than other strains.

In [None]:
bcg_pnas_col=bcg_pnas[['Country','Age oldest  vaccinated','BCG Coverage']]

bcg_merge=pd.merge(bcg_pnas_col,covid_data_no_world_bcg_1000,left_on='Country',right_on='location',how='inner')

fig = px.scatter(bcg_merge, x="Age oldest  vaccinated", y="Fatality_Rate", trendline="ols",hover_name="location",text="location")
fig.update_traces(textposition='top center')
fig.update_layout(
    height=300,
    title_text='Age Oldest Vaccinated vs Fatality Rate',
    yaxis_tickformat = '%'
)
fig.show()
results = px.get_trendline_results(fig)
results.px_fit_results.iloc[0].summary()

We also looked at the PNAS dataset available here - https://www.pnas.org/content/early/2020/07/07/2008410117

It provides the age of oldest vaccinated person in a country based on the BCG vaccine start date. This data is available for European countries only. We will be looking to see if greater length of vaccination provides greater protection against Covid. 
* The R Squared was low at 0.25, however it could be due to how data is distributed as well. Few data transformations may result to this number being higher
* Countries in Europe that do not have any BCG vaccination policy have the highest Fatality Rate - **Belgium & Netherlands**
* Majority of the East Europe countries lie in the **60-80** bracket and enjoy **lower fatality rates** compared to the Western Europe countries

This piece of analysis might be a bit incomplete as we haven't looekd at youngest vaccinated, as some countries like Sweden started vaccination early in 1945 but stopped it around 1975. I will be looking at this a bit deeper in future versions of the notebook.

In [None]:
bcg_yes=covid_data_no_world_bcg.loc[covid_data_no_world_bcg['mandatory']=='yes']
bcg_fat_max=bcg_yes.loc[bcg_yes['date']==datetime.date(year=2020,month=6,day=20)]
bcg_fat_max=bcg_fat_max.loc[bcg_fat_max['total_cases']>=1000]

bcg_fat_max=bcg_fat_max.sort_values('Fatality_Rate',ascending=False)
head=bcg_fat_max.head(5)
#head

bcg_no=covid_data_no_world_bcg.loc[covid_data_no_world_bcg['mandatory']=='no']
bcg_no_fat_max=bcg_no.loc[bcg_no['date']==datetime.date(year=2020,month=6,day=20)]
bcg_no_fat_max=bcg_no_fat_max.loc[bcg_no_fat_max['total_cases']>=1000]
bcg_no_fat_max=bcg_no_fat_max.sort_values('Fatality_Rate',ascending=False)

In [None]:

bcg_fatality_top5=covid_data_no_world_bcg[covid_data_no_world_bcg['location'].isin(['Hungary','Mexico','Algeria','Ireland','Niger'])]
fig = px.line(bcg_fatality_top5, x="date", y="Fatality_Rate", color='location')
fig.update_xaxes(nticks=10)
fig.update_layout(title='BCG Mandatory Status Countries - Highest Fatality Rate',yaxis_tickformat = '%',height=400)

fig.show()

bcg_fatality_bottom5=covid_data_no_world_bcg[covid_data_no_world_bcg['location'].isin(['Oman','Maldives','Qatar','Nepal','Singapore'])]
fig = px.line(bcg_fatality_bottom5, x="date", y="Fatality_Rate", color='location')
fig.update_layout(title='BCG Mandatory Status Countries - Lowest Fatality Rate',yaxis_tickformat = '%',height=400)
fig.show()

In [None]:
bcg_fatality_top5.groupby('location').agg({'total_cases':'max','total_cases_per_million':'median','population':'median','total_tests':'max','total_tests_per_thousand':'max','extreme_poverty':'mean','diabetes_prevalence':'mean','cvd_death_rate':'median','hospital_beds_per_thousand':'mean','Fatality_Rate':'median','aged_65_older':'median','life_expectancy':'mean','female_smokers':'mean','male_smokers':'mean'})

In [None]:
bcg_fatality_bottom5.groupby('location').agg({'total_cases':'max','total_cases_per_million':'median','population':'median','total_tests':'max','total_tests_per_thousand':'max','extreme_poverty':'mean','diabetes_prevalence':'mean','cvd_death_rate':'median','hospital_beds_per_thousand':'mean','Fatality_Rate':'median','aged_65_older':'median','life_expectancy':'mean','female_smokers':'mean','male_smokers':'mean'})

Diving deeper into the results of the violin plot above - we looked at the BCG Mandatory countries and identified the countries with lowest and highest fatality rates. Algeria, Hungary, Ireland, Mexico & Niger were some of the countries with highest fatality rates, although out of these 5, except for Mexico all the countries had quite lower number of total cases. Mexico had around 175k cases and 12% Fatality Rate, which seems to be still rising. Except for Hungary and Mexico, the Fatality Rate seems to have stabilized.

* **Mexico seems to have a much higher prevalence of Diabetes at 13.06** compared to other countries here, although it had lower Cardiovascular death rate compared to other countries
* **Mexico** had very low **tests_per_thousand**, which means that the true impact of Covid might be worse than what is visible through these numbers
* **Niger** had low **life expectancy of 62**, indicating towards poorer healthcare infrastructure in the country. It also has a very** high extreme_poverty value**
* **Hungary** had a **high percentage of population older than 65**
* Countries with lowest Fatality Rates had a high life expectancy, all these countries had a life expectancy over 70.

## Spain vs Portugal <a name="spain"></a>

After having had a look at the global stats, let's now look at combinations of neighboring countries or regions to understand the variations in Covid deaths based on BCG policy. These would help us make more reliable claims since these neighboring regions would be expected to be similar in economic, demographic, ethnic, social & cultural status, eliminating any differences that can be observed due to these variables. 

The very first case that we would be looking at would be Spain vs Portugal, since there is a significant difference in Covid fatalities in these countries despite being neighbors. These countries are also different in terms of their Covid Status, Portugal has BCG vaccine mandatory for all its citizens whereas Spain stopped its vaccination program in 1981.

In [None]:
sppo=covid_data_no_world_bcg.loc[covid_data_no_world_bcg['location'].isin(['Spain','Portugal'])]

sp=sppo.loc[sppo['date']>=datetime.date(year=2020,month=3,day=10)]

sp['cases_per_million']=(sp['total_cases']/sp['population'])*1000000

fig = px.line(sp, x="date", y="total_cases", color='location')
fig.update_layout( title_text="Spain vs Portugal - Total Covid Cases",height=400)
fig.show()

fig = px.line(sp, x="date", y="cases_per_million", color='location')
fig.update_layout( title_text="Spain vs Portugal - Covid Cases per Million",height=400)
fig.show()

fig = px.line(sp, x="date", y="Fatality_Rate", color='location')
fig.update_layout( title_text="Spain vs Portugal - Fatality Rate",height=400,yaxis_tickformat = '%')
fig.show()



In [None]:
splo=stringency_index[(stringency_index.location.isin(['Spain','Portugal']))&(stringency_index['Date']>='2020-03-15')&(stringency_index['Date']<='2020-06-20')]

fig = px.line(splo, x="Date", y="stringency_index", color='location')
fig.update_layout( title_text="Spain vs Portugal - Lockdown Stringency Index",height=400)
fig.show()

As we can see above, there is a big difference in the total number of cases being reported in Spain & Portugal, Portugal reported around 38k cases by 20th June while Spain reported around 245k cases on the same date. However, this difference in the number of cases could be due to the variation in size and population between the two countries, Spain's population is more than 4 times that of Portugal. 

Therefore probably more reliable metric would be **total_cases_per_million** population, even here Spain has much higher 5280 cases per million by 20 June whereas Portugal has 3770 cases per million, however the difference here is not that huge as the total cases.

Next we look at the **total_deaths_per_million** and this is where huge difference is noticeable. In Spain the Fatality Rate went from around 2% on mid-March to 12% in June, however in Portugal it only increased from 0% to 4% in the same duration. On June 20 there was a difference of 8% in the Fatality Rate of Spain & Portugal.

This shows that Portugal had lower **total_cases_per_million** as well as lower **total_deaths_per_million** compared to Spain.

Lockdown measures in both countries were almost equally strict, with Portugal relaxing its lockdown measures slightly early than Spain

Next we will be diving deeper into this data and looking at hospitalizations & ICU cases in Spain while also looking at fatalities by age group in Portugal.

In [None]:
spain_age_group.loc[spain_age_group['Age Group']=='80-89','Age Group']='80+'
spain_age_group.loc[spain_age_group['Age Group']=='?90','Age Group']='80+'

spain_age_group=spain_age_group.groupby('Age Group').agg({'Cases':'sum','Hospitalization':'sum','ICU':'sum','Deaths':'sum'}).reset_index()

spain_age_group.columns=['Age Group','Cases','Hospitalization','ICU','Deaths']

spain_age_group['percent_hospitalization']=spain_age_group['Hospitalization']/spain_age_group['Cases']
spain_age_group['percent_ICU']=spain_age_group['ICU']/spain_age_group['Cases']
spain_age_group['percent_death']=spain_age_group['Deaths']/spain_age_group['Cases']

In [None]:
fig = make_subplots(rows=3, cols=1)

fig.append_trace(go.Bar(
   x=spain_age_group['Age Group'], y=spain_age_group['percent_hospitalization'],name='Percent_Hospitalization'
), row=1, col=1)

fig.append_trace(go.Bar(
   x=spain_age_group['Age Group'], y=spain_age_group['percent_ICU'],name='Percent_ICU'
), row=2, col=1)

fig.append_trace(go.Bar(
   x=spain_age_group['Age Group'], y=spain_age_group['percent_death'],name='Percent_Deaths'
), row=3, col=1)


fig.update_layout(height=600, width=800, title_text="Spain - Covid Stats by Age Group",yaxis_tickformat = '%')
fig.show()

* Percentage Hospitalizations for **0-9 age group** is high at **28%**, however this could be due to infants being born to mothers with Covid-19 or being infected in the hospital, however the good thing is that only 4% of cases in this age group require hospitalization and fatalities are close to zero
* In the age-groups of **10-40**, hospitalizations are between **10-20%** however the number of ICU and deaths are close to zero, indicating that these patients recover with regular hospital care
* In the **60+** age group, the number of hospitalizations are very high, more than **40%** of the patients in these age groups required hospitalization & more than 15% patients in the **70+** age group succumbed to Covid-19
* One thing that stands out from this chart is that the **percentage of ICU cases for 80+ age group is very low** compared to 60-69 & 70-79 age groups, however the mortality from this age group is very high, it could be due to scarcity of resources at the peak of pandemic and hospitals prioritizing lower age groups which had a higher chance of survival. Perhaps if the peak had been more spread out, some of these patients could have received necessary healthcare and the fatality rate could have been lowered
* Spain started BCG vaccination around 1965, which means that oldest age of vaccinated citizens would be 55 in Spain,and older age groups would have no protection offered from BCG. This coincides with steep rice in Fatality rate as well as we move from younger age groups to 50+ age groups.

In [None]:
portugal_age_group.loc[portugal_age_group['Age Group']=='Oct-19','Age Group']='10-19'
portugal_age_group['percent_deaths']=portugal_age_group['Deaths']/portugal_age_group['Cases']

spain_age_group['Age Group']
spain_age_group.loc[spain_age_group['Age Group']=='?90','Age Group']='80+'


fig = go.Figure(data=[
    go.Bar(name='Portugal', x=portugal_age_group['Age Group'], y=portugal_age_group['percent_deaths']),
    go.Bar(name='Spain', x=spain_age_group['Age Group'], y=spain_age_group['percent_death'])
])
# Change the bar mode
fig.update_layout(barmode='group',title='Spain vs Portugal - Percentage of Positive cases by Age Group leading to Deaths')
fig.show()

Next we compare the Fatality Rate between Spain vs Portual across age groups, its very apparent from this chart that Portugal has a lower death rate than Spain across all the age groups, however even in the older age groups Portugal has a lower death rate than Spain. The 80+ age group in both countries would not have received any BCG vaccination and the Fatality Rates for this age group in both countries is similar.

In [None]:
covid_data_no_world_bcg.loc[(covid_data_no_world_bcg['location'].isin(['Spain','Portugal']))& (covid_data_no_world_bcg['date']==datetime.date(year=2020,month=6,day=20))]

## Ireland - Cork vs Kerry <a name="ireland"></a>

During my research for this hack, I came across an interesting study for the BCG vaccine here - https://bmcinfectdis.biomedcentral.com/articles/10.1186/s12879-019-4026-z#:~:text=BCG%20vaccination%20policy%20in%20Ireland,as%20of%202016%20%5B10%5D

This was a study conducted in Ireland, to measure the efficacy of BCG vaccine for Tuberculosis, it compared three different regions across two counties in Southern Ireland, these regions were different in BCG coverage of the population. The aim of this study was to examine the impact of three different BCG vaccination policies on observed incidence of TB disease in the South of Ireland over a 13-year period. Study is also interesting in the sense that both these counties are geographically contigous and have a similar ethnic & economic background, belonging to the same country. 

We can also look at these counties to check if different rates of BCG coverage also have any correlation with total cases of Covid-19.

In [None]:
ireland_county_data=pd.read_csv('../input/ireland-covid-data/ireland_cso_data.csv')
ireland_county_daily_data=pd.read_csv('../input/ireland-covid-data/Covid19CountyStatisticsHPSCIreland.csv')
ireland_covid_stats=pd.read_csv('../input/ireland-covid-data/Covid19CountyStatisticsHPSCIrelandOpenData.csv')
ireland_covid_stats=ireland_covid_stats[['CountyName','PopulationCensus16','Density']].drop_duplicates(keep='first')
ireland_county_data.columns=['County', 'total_deaths', 'median_age_deaths', 'total_cases','median_age_cases']
ireland_county_data=ireland_county_data.sort_values('total_cases',ascending=False)
ireland_county_data=pd.merge(ireland_county_data,ireland_covid_stats,left_on='County',right_on='CountyName',how='inner')

ireland_county_data['total_deaths']=pd.to_numeric(ireland_county_data['total_deaths'],errors='coerce').fillna(0)
ireland_county_data['Fatality_Rate']=ireland_county_data['total_deaths']/ireland_county_data['total_cases']
ireland_county_data['cases_per_million']=(ireland_county_data['total_cases']/ireland_county_data['PopulationCensus16'])*10000000
ireland_county_data['deaths_per_million']=(ireland_county_data['total_deaths']/ireland_county_data['PopulationCensus16'])*10000000

# fig = px.bar(ireland_county_data, x='CountyName', y='ConfirmedCovidCases')
# fig.update_layout(title='Ireland - Confirmed Cases by County')
# fig.show()

colors = ['lightslategray',] * 26
colors[1] = 'blue'
colors[18] = 'crimson'


fig = make_subplots(rows=3, cols=1,subplot_titles=("Confirmed Covid Cases", "Covid Fatality Rate","Population Density"))

fig.add_trace(
    go.Bar(
    x=ireland_county_data['County'],
    y=ireland_county_data['total_cases'],text=ireland_county_data['total_cases'],
    marker_color=colors # marker color can be a single color value or an iterable
),
    row=1, col=1
)

fig.add_trace(
    go.Bar(
    x=ireland_county_data['County'],
    y=ireland_county_data['Fatality_Rate'],text=ireland_county_data['Fatality_Rate'],
    marker_color=colors # marker color can be a single color value or an iterable
),
    row=2, col=1
)

fig.add_trace(
    go.Bar(
    x=ireland_county_data['CountyName'],
    y=ireland_county_data['Density'],text=ireland_county_data['Density'],
    marker_color=colors # marker color can be a single color value or an iterable
),
    row=3, col=1
)

#fig.update_layout(height=400, width=800, title_text="Fatailty_Rate")



fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_layout(title_text='Ireland - Confirmed Cases & Population Density by County',height=1000)

fig.show()

The county level covid confirmed cases data for Ireland shows that majority of the Covid cases happened in the capital city of Dublin. Taking a look at two counties of interest - **Cork & Kerry**, we see that **Cork has 5 times the number of Covid confirmed cases as compared to Kerry**, however that could also be due to **Cork having twice as much population density as Kerry**. Dense population areas are likely to have a greater spread to Covid. Let's now try and normalize these numbers and look at ```cases_per_million``` & ```deaths_per_million``` between the two counties & overall growth of Covid.

In [None]:
ireland_county_daily_data['date']=pd.to_datetime(ireland_county_daily_data['TimeStamp']).dt.date
ireland_county_daily_data['cases_per_million']=(ireland_county_daily_data['ConfirmedCovidCases']/ireland_county_daily_data['PopulationCensus16'])*10000000
ireland_county_daily_data_cork_kerry=ireland_county_daily_data.loc[(ireland_county_daily_data['CountyName']=='Cork')|(ireland_county_daily_data['CountyName']=='Kerry')]

fig = px.line(ireland_county_daily_data_cork_kerry, x="date", y="cases_per_million", color='CountyName')
fig.update_layout( title_text="Cork vs Kerry - Cases per Million",height=300,width=800)
fig.show()

k=ireland_county_data.loc[(ireland_county_data['County']=='Cork')|(ireland_county_data['County']=='Kerry')]
fig1 = px.bar(k, x='County', y='deaths_per_million')
fig1.update_layout(title='Cork vs Kerry - Deaths per Million',height=300)
fig1.show()


In [None]:
ireland_county_data.loc[(ireland_county_data['County']=='Cork')|(ireland_county_data['County']=='Kerry')]

As seen above, Cork has a higher ```cases_per_million``` & ```deaths_per_million``` number than Kerry, which may suggest some evidence that people in Kerry had a higher protection from covid due to mandatory BCG vaccination policy. The ```cases_per_million``` in Cork is 1.5 times of the number in Kerry, whereas ```deaths_per_million``` was twice that of Kerry. Fatality Rate in Cork is also 1.5 times that of Kerry. This does suggest that Cork was hit much harder duet to Covid than Kerry, but we can't be completely sure at this stage whether it was due to BCG vaccination policy. 

In [None]:
fig=go.Figure()
fig.add_trace(go.Violin(y=ireland_county_data['median_age_cases'],name='median_age_cases',box_visible=True,meanline_visible=True))
fig.add_trace(go.Violin(y=ireland_county_data['median_age_deaths'],name='median_age_deaths',box_visible=True,meanline_visible=True))
fig.update_layout(title='Median age for Cases & Deaths across Counties in Ireland')
fig.update_yaxes(title="Median Age")

fig.show()

We also looked at difference between Median age of cases & Median age of deaths due to covid across all counties in Ireland. Q1-Q3 range for cases across all counties was 47-51, indicating that similar age groups were infected across all counties in Ireland. Median age of Deaths was much higher than median age of cases, standing at 82. Q1-Q3 range for deaths is 81-84, which again indicates that similar age group suffered most casualties across all counties in Ireland. 

Finally, ss we discussed earlier,let's also look at how much population density plays a role in the overall spread of Covid in Ireland, let's compare the correlation between population density and the confirmed cases of Covid.

In [None]:
ireland_county_data[['Density','total_cases','Fatality_Rate','PopulationCensus16','cases_per_million','deaths_per_million','median_age_cases','median_age_deaths']].corr()

As seen above, Population Density has a **correlation** of **0.99** with number of confirmed covid cases within Ireland, which is extremely high. Therefore we cannot reliably say that the difference in total_cases_per_million between Cork & Kerry is only due to difference in BCG vaccination policy. Rather it could also be due to the higher population density in urban areas of Cork. Also, Fatality Rate is also highly correlated with median_age_cases, which is a bit obvious. 

## Germany - East Germany vs West Germany <a name="germany"></a>

This is another interesting case where lot of studies were done to show the difference in spread of Covid across East & West Germany. Divergent BCG vaccination policies existed in the politically divided country (1949–1989) before German reunification in 1990. In East Germany, BCG vaccination programs were established by the communist government in 1951, and soon became compulsory in 1953, leading to near-universal (99.8%) BCG vaccination of newborns by day 3. By contrast, voluntary BCG vaccination (recommended since 1955) was far less common in West Germany, due to low incidence of the disease after the Second World War. In early years, only 7–20% of all newborns became BCG-vaccinated in Western Germany, with almost complete cessation of vaccination between 1975 and 1977. More information is available here - https://www.nature.com/articles/s41375-020-0871-4

Thankfully I did not have to search a lot to get access to detailed German covid dataset. This is directly available in Kaggle here - https://www.kaggle.com/headsortails/covid19-tracking-germany. We were able to get **Daily Age Group & Gender Covid Stats by Province**, most detailed out of any country in this notebook.  Let's now dig deep and study how the spread of Covid has evolved with time .

In [None]:
de_covid_data=pd.read_csv('/kaggle/input/covid19-tracking-germany/covid_de.csv')
germany_province_data=pd.read_csv('/kaggle/input/hackathon/task_2-Gemany_per_state_stats_20June2020.csv')
id_mapping=pd.read_csv('/kaggle/input/idmapping/state_id_mapping_de.csv')


uniques=germany_province_data[['State in Germany (German)','East/West','Population']].drop_duplicates()

de_covid_data['date']=pd.to_datetime(de_covid_data['date']).dt.date
de_covid_data
#germany_province_data
#de_covid_data.state.unique()

de_covid_data_ew=pd.merge(de_covid_data,uniques,left_on='state',right_on='State in Germany (German)',how='left')
#de_covid_data_ew.state.unique()

de_covid_data_ew.loc[de_covid_data_ew.state.str.contains('Baden'),'East/West']='West'
de_covid_data_ew.loc[de_covid_data_ew.state.str.contains('Thueringen'),'East/West']='East'
de_covid_data_ew.loc[de_covid_data_ew.state.str.contains('Baden'),'Population']=10879618
de_covid_data_ew.loc[de_covid_data_ew.state.str.contains('Thueringen'),'Population']=2170714

de_sum=de_covid_data_ew.groupby(['state','East/West','date','Population']).agg({'cases':'sum','deaths':'sum','recovered':'sum'}).reset_index()
de_sum.columns=['state','East/West','date','Population','cases','deaths','recovered']
#de_sum.state.unique()

de_sum['total_cases']=de_sum.groupby(by=['state','East/West','Population'])['cases'].cumsum()
de_sum['total_deaths']=de_sum.groupby(by=['state','East/West','Population'])['deaths'].cumsum()
#de_sum

de_sum['Fatality_Rate']=de_sum['total_deaths']/de_sum['total_cases']
de_sum=pd.merge(de_sum,id_mapping,how='left')
de_304=de_sum.loc[de_sum['date']==datetime.date(year=2020,month=4,day=30)]
de_305=de_sum.loc[de_sum['date']==datetime.date(year=2020,month=5,day=30)]
de_236=de_sum.loc[de_sum['date']==datetime.date(year=2020,month=6,day=23)]

In [None]:
#germany_province_data[['State in Germany (German)','East/West']].drop_duplicates()
#de_sum
from urllib.request import urlopen
import json


with urlopen('https://raw.githubusercontent.com/isellsoap/deutschlandGeoJSON/master/2_bundeslaender/2_hoch.geo.json') as response:
    geojson = json.load(response)

fig = px.choropleth(de_304, geojson=geojson, color="Fatality_Rate",
                    locations="id",color_continuous_scale="Viridis"
                    ,hover_name="East/West",
                   )
fig.update_geos(fitbounds="locations")
fig.update_layout(title='Germany - 30 April Covid Fatality Rate',height=400)
fig.show()

fig1 = px.choropleth(de_305, geojson=geojson, color="Fatality_Rate",
                    locations="id",color_continuous_scale="Viridis"
                    ,hover_name="East/West"
                   )
fig1.update_geos(fitbounds="locations")
fig1.update_layout(title='Germany - 30 May Covid Fatality Rate',height=400)

fig1.show()


fig2 = px.choropleth(de_236, geojson=geojson, color="Fatality_Rate",
                    locations="id",color_continuous_scale="Viridis"
                    ,hover_name="East/West"
                   )
fig2.update_geos(fitbounds="locations")
fig2.update_layout(title='Germany - 23 June Covid Fatality Rate',height=400)

fig2.show()

We looked at the changes in Fatality Rate across German provinces at three different dates - 30 April, 30 May & 23 June to see how it has changed over time. Following patterns emerge - 
* There is not a big variation in Covid Fatality Rate across German provinces on all three dates. **Fatality rates generally lie between 3%-6%**
* There is **not a huge variation in Covid Fatality Rate across East & German provinces**

Let's now look at distribution of Covid Fatality Rates across provinces to get a better sense of differences across East & West Germany.

In [None]:
germany_concat=pd.concat([de_304,de_305,de_236]).reset_index()
germany_concat.date.unique()
fig = px.violin(germany_concat, x='date', color='East/West', y='Fatality_Rate',box=True, hover_name='state') 
fig.update_layout(template='seaborn',title='Distribution of Fatality Rates Across East/West Germany',legend_title_text='Region',xaxis = {
   'tickformat': '%d-%m',
   'tickmode': 'auto',
#   'nticks': value, [where value is the max # of ticks]
#   'tick0': value, [where value is the first tick]
#   'dtick': value [where value is the step between ticks]
},yaxis_tickformat = '%')

fig.show()

As we can see above the median values for Fatality Rates across German provinces is not very different for East & West Germany across the three month period. **East Germany provinces however have a slightly lower Fatality Rate on all three months**. Let's now look at daily progression of Covid cases across East/West Germany and fatality rates by age group.

In [None]:
group=de_sum.groupby(['East/West','date']).agg({'total_deaths':'sum','total_cases':'sum'}).reset_index()
reg_pop=germany_province_data.groupby('East/West').agg({'Population':'sum'}).reset_index()
group=pd.merge(group,reg_pop,how='inner')

group['Fatality Rate']=group['total_deaths']/group['total_cases']
group['Deaths_per_Million']=(group['total_deaths']/group['Population'])*1000000


fig = px.line(group, x="date", y="total_cases", color='East/West')
fig.update_layout( title_text="Germany - Total Cases by Region",height=300,width=800)
fig.show()

fig = px.line(group, x="date", y="total_deaths", color='East/West')
fig.update_layout( title_text="Germany - Total Deaths by Region",height=300,width=800)
fig.show()

fig = px.line(group, x="date", y="Fatality Rate", color='East/West')
fig.update_layout( title_text="Germany - Fatality Rate by Region",yaxis_tickformat = '%',height=300,width=800)
fig.show()

fig = px.line(group, x="date", y="Deaths_per_Million", color='East/West')
fig.update_layout( title_text="Germany - Deaths per Million by Region",height=300,width=800)
fig.show()

Following things emerge from the charts above - 
* **West Germany has almost 9 times the number of Covid cases reported as compared to East Germany**, this difference remains consistent throughout the analysis period. This means that the total spread of virus was much higher in West Germany
* Same thing can be noticed for total_deaths as well, **West Germany has about 10 times the total number of deaths compared to East Germany**, this difference also remains roughly consistent for the analysis period
* To normalize the huge difference in total cases & population across Germany, we look at Fatality Rate, which is the total percentage of cases that resulted to death, this number remains fairly consistent for the analysis period as well, with **Fatality Rate around 5% for West Germany and 4% for East Germany**, which is not a huge difference
* **Deaths per million, however is twice for West Germany compared to East Germany**. 

Difference in Covid confirmed cases and Deaths per Million across East/West Germany may indicate that BCG might provide some sort of protection against the infection as well as reducing the chances of death once a person has been infected. East Germany - which had universal BCG vaccination does much better than West Germany, which had lower coverage of BCG. Let's now look at Fatality Rate & Deaths per million across age groups in East & West Germany.

In [None]:
de_age=de_covid_data_ew.groupby(['East/West','age_group']).agg({'cases':'sum','deaths':'sum'}).reset_index()
de_age_pop=pd.merge(de_age,reg_pop,how='inner')
de_age_pop.age_group = de_age_pop.age_group.astype('str')

de_age_pop['Fatality_Rate']=de_age_pop['deaths']/de_age_pop['cases']
de_age_pop['Deaths_per_Million']=(de_age_pop['deaths']/de_age_pop['Population'])*1000000

fig=px.bar(de_age_pop, x="age_group", y="Fatality_Rate", color='East/West',barmode='group')
fig.update_layout(xaxis_type='category',yaxis_tickformat = '%',title_text="Germany - Fatality Rate Age Group",height=250,margin=dict(l=20, r=20, t=25, b=25))
fig.show()


fig=px.bar(de_age_pop, x="age_group", y="Deaths_per_Million", color='East/West',barmode='group')
fig.update_layout(xaxis_type='category',title_text="Germany - Deaths per Million by Age Group",height=250,margin=dict(l=20, r=20, t=25, b=0))
fig.show()

Above charts tell us following - 
* Fatality Rates across Germany were close to zero for age groups 0-34
* Fatality Rates for age groups 35-59 were close to 1%
* Fatality Rates rose up steadily for 60+ age groups, with Fatality rate around 7-8% for 60-79 and 20%+ for 80+ age groups
* Similarly Deaths per Million were very high as well for higher age groups. However both the fatality rate and deaths per million were much higher in West Germany compared to East Germany

In [None]:
germany_province_data=pd.read_csv('/kaggle/input/hackathon/task_2-Gemany_per_state_stats_20June2020.csv',thousands=',')#Covid stats for Germany provinces
germany_province_data['Deaths'] = germany_province_data['Deaths'].str.replace(',', '').astype(float)
germany_province_data['cases_per_million']=germany_province_data['Cases']/germany_province_data['Population']
germany_province_data['deaths_per_million']=germany_province_data['Deaths']/germany_province_data['Population']
germany_province_data['Fatality_Rate']=germany_province_data['Deaths']/germany_province_data['Cases']

germany_province_data[['Population Density','Fatality_Rate','cases_per_million','deaths_per_million','Population','Deaths','Cases']].corr()

#germany_province_data


Finally, similar to Ireland let's also look at correlation matrix for different variables in Germany. What is interesting here is that unlike Ireland, total cases per province here are not correlated to the population density. It is likely due to the fact that we are looking at provinces here, which can geographically be quite big in area, whereas for Ireland we were looking at counties, which are much slower geographical units. Therefore even though a province may have few cities with dense population which may have high spread of virus, at an overall level the density for province may be low. 

In [None]:
# covid_data_no_world_bcg_1000['mandatory'=='no','mandatory_flag']=0
# covid_data_no_world_bcg_1000['mandatory'=='yes','mandatory_flag']=1
# covid_data_no_world_bcg_1000['mandatory'=='Unknown','mandatory_flag']=2
# data=covid_data_no_world_bcg_1000
# data['Test_positivity_Rate']=data['total_cases']/data['total_tests']

covid_data_no_world_bcg_1000
dummies=pd.get_dummies(covid_data_no_world_bcg_1000['mandatory'])
covid_data_no_world_bcg_1000_d=pd.concat([covid_data_no_world_bcg_1000,dummies],axis=1)
covid_data_no_world_bcg_1000_d

In [None]:
covid_data_no_world_bcg_1000_d.columns

## Conclusion & Next Steps - <a name="conclusion"></a>
Our initial analysis does suggest that countries with long running universal BCG vaccination programs have been impacted lesser than countries lacking such programs. We also saw that lot of other economic, demographic & cultural factors may play a role in the spread of Covid-19 in a particular country, therefore we cannot reliably make a claim on role of BCG in reducing these deaths on this analysis alone.  

Next step would be to perform regression analyses between different countries to determine the contribution of BCG vaccine in lowering the Fatality Rate. I will be performing that analysis in the next iteration of the notebook. Meanwhile please suggest improvements and provide feedback in the comments. Thanks!

### References
* [Ireland's Covid-19 Data Hub](https://covid19ireland-geohive.hub.arcgis.com/search?groupIds=7e244cadac05461fb60b287a37b5ed2b)
* [Ireland County level data](https://www.leinsterexpress.ie/news/coronavirus/551969/how-many-people-have-died-in-your-county-from-covid-19-breakdown-of-coronavirus-deaths-in-ireland.html)
* [WHO TB Dataset](https://www.who.int/tb/country/data/download/en/)
* [Research on BCG Vaccine protection from severe coronavirus diseases](https://www.pnas.org/content/early/2020/07/07/2008410117)
* [Severity of Covid-19 in East & West Germany due to Covid ](https://www.nature.com/articles/s41375-020-0871-4)
* [Covid-19 Insights - Could BCG help fight Covid?](https://www.youtube.com/watch?v=-VpEU3utSJE&feature=youtu.be)
[Impact of BCG vaccination on incidence of tuberculosis disease in southern Ireland
](https://bmcinfectdis.biomedcentral.com/articles/10.1186/s12879-019-4026-z#:~:text=BCG%20vaccination%20policy%20in%20Ireland,as%20of%202016%20%5B10%5D)
* [Further Evidence of a Possible Correlation Between the
Severity of Covid-19 and BCG Immunization](https://www.medrxiv.org/content/10.1101/2020.04.07.20056994v1.full.pdf)
* [History of BCG Vaccine](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3749764/)