In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
!pip install plotly

In [None]:
import plotly.graph_objs as go
import plotly.express as px

# Before I Start

This time I will be working with ***plotly*** instead of ***seaborn***

## Assumptions

- There should be two waves of Covid outburst one around april of 2020. The second one at the end of 2020/ beginning of 2021.
- Testing numbers should be increasing because of introduction of green pass and restrictions not having it.
- Vaccination should be ramping up at the start of it and got lower around summer because of holiday season.
- Countries with better gdp per capita and younger median age should be suffering less from covid

## Questions I Want to Answer

0. What is total number of Covid cases, tests made and vaccinated people?
1. How dense population is in every country?
2. What is Median age in every country and how does it look compared to others?
3. How covid progressed by total cases, new cases, total deaths, new deaths?
4. How testing rate changed during the pandemic?
5. How test per case changed in Lithuania, Japan and USA.
6. Vaccination progess, how it change?
7. How strictly goverments responded with restrictions?
8. Does countries that have more smoking people suffer more from covid?
9. How gdp_per_capita and meadian age correlates with total cases?


# Basic Insights on Data

## Loading Data

In [None]:
df = pd.read_csv('../input/covid19-timeline-analysis/owid-covid-data.csv')
df.head()

## Basic information about the structure of dataset

In [None]:
df.shape

In [None]:
len(df.iso_code.unique())

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
df.describe()

In [None]:
df.date.min(), df.date.max()

### Observations

We have 60 features and 103143 entries from 231 countries. We have a lot of missing data in columns that will need further attention in working with it. We also have some negative values at columns new_cases, new_deaths, new_cases_per_million, I'll have to keep that in mind and investigate the cause of it; There are plenty of columns that has no use for us so I propobly should drop it to make working with data more efective. Finally we are working with data from 2020-01-01 -> 2021-07-17 and it also should be converted to datetime.

#### Stringency index

The nine metrics used to calculate the Stringency Index are: school closures; workplace closures; cancellation of public events; restrictions on public gatherings; closures of public transport; stay-at-home requirements; public information campaigns; restrictions on internal movements; and international travel controls.

A higher score indicates a stricter response (i.e. 100 = strictest response). If policies vary at the subnational level, the index is shown as the response level of the strictest sub-region.

It’s important to note that this index simply records the strictness of government policies. It does not measure or imply the appropriateness or effectiveness of a country’s response. A higher score does not necessarily mean that a country’s response is ‘better’ than others lower on the index.

# Cleaning up dataset

In [None]:
df.columns

In [None]:
cols_to_drop = ['continent', 'reproduction_rate', 'icu_patients', 'icu_patients_per_million', 'hosp_patients', 'hosp_patients_per_million',
               'weekly_icu_admissions_per_million', 'tests_units', 'extreme_poverty', 'cardiovasc_death_rate',
               'diabetes_prevalence', 'handwashing_facilities', 'life_expectancy', 'human_development_index', 'excess_mortality']

In [None]:
df.drop(columns=cols_to_drop, inplace=True)
df.columns

In [None]:
df.info()

# Answering Questions

## 0. What is total number of Covid cases, tests made and vaccinated people?

### Total number of covid cases

In [None]:
total_cases = df.groupby('iso_code')['total_cases'].last().sum()
total_cases

### Total number of tests made

In [None]:
total_tests = df.groupby('iso_code')['total_tests'].last().sum()
total_tests

### Total number of vaccinated people

In [None]:
total_vac = df.groupby('iso_code')['total_vaccinations'].last().sum()
total_vac

## 1. How dense population is in every country?

In [None]:
df.groupby('iso_code')['population_density'].first()

### Sanity check

In [None]:
df.loc[df.iso_code == 'ABW',['population_density']].head()

In [None]:
density_df = df.groupby('iso_code')[['population_density', 'location']].first().reset_index()
density_df.head()

In [None]:
density_df.population_density.min(), density_df.population_density.max()

In [None]:
df.population_density.describe()

In [None]:
fig = px.choropleth(density_df, locations='iso_code', color='population_density',
                   hover_name='location', projection='natural earth',
                   title='Population Density',
                   range_color=(0,500),
                   color_continuous_scale='greens'
                   )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

### Observations

There are few countries that has a massive population density. But those countries are independent regions in other countries in example Monaco and Makau. Plotly by default are not including those regions in map.

***TO DO:*** find a better GeoJSON 

## 2. What is Median age in every country and how does it look compared to others?

In [None]:
age_df = df.groupby('iso_code')[['location', 'median_age']].first().reset_index()
age_df.head()

In [None]:
age_df.describe()

In [None]:
fig = px.choropleth(age_df, locations='iso_code', color='median_age',
                   hover_name='location', projection='orthographic',
                   title='Population Density',
                   range_color=(15,49),
                   color_continuous_scale='greens'
                   )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

## 3. How covid progressed by total cases, new cases, total deaths, new deaths?

### Covid Progression by Total Cases

In [None]:
df.sort_values('date', inplace=True)

In [None]:
fig = px.choropleth(df.loc[:,['iso_code', 'total_cases', 'date']], locations='iso_code',
                    color='total_cases',
                    animation_frame='date', 
                    title='Total Cases of Covid19',
                    height=750,
                    color_continuous_scale='greens'
)

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

### Covid Progression by New Cases

In [None]:
fig = px.choropleth(df.loc[:,['iso_code', 'new_cases_smoothed', 'date']], locations='iso_code',
                    color='new_cases_smoothed',
                    animation_frame='date', 
                    title='New Cases of Covid19',
                    height=750,
                    color_continuous_scale='greens'
)

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

### Covid Progression by Total Deaths

In [None]:
fig = px.choropleth(df.loc[:,['iso_code', 'total_deaths', 'date']], locations='iso_code',
                    color='total_deaths',
                    animation_frame='date', 
                    title='Total Deaths of Covid19',
                    height=750,
                    color_continuous_scale='greens'
)

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

### Covid Progression by New Deaths

In [None]:
fig = px.choropleth(df.loc[:,['iso_code', 'new_deaths_smoothed', 'date']], locations='iso_code',
                    color='new_deaths_smoothed',
                    animation_frame='date', 
                    title='New Deaths of Covid19',
                    height=750,
                    color_continuous_scale='greens'
)

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

## 4. How testing rate changed during the pandemic?

In [None]:
fig = px.choropleth(df.loc[:,['iso_code', 'new_tests', 'date']], locations='iso_code',
                    color='new_tests',
                    animation_frame='date', 
                    title='New Tests of Covid19',
                    height=750,
                    color_continuous_scale='greens'
)

#fig.update(layout_coloraxis_showscale=False)
fig.show()

## 5. How tests per case changed in Lithuania, Japan and USA.

In [None]:
df.sort_index(inplace=True)
df.head()

In [None]:
df.columns

In [None]:
ds = df.loc[(df.iso_code == 'JPN') | (df.iso_code == 'USA') | (df.iso_code == 'LTU'), ['date', 'iso_code', 'tests_per_case']]

In [None]:
fig = px.line(ds, x='date', y='tests_per_case', color='iso_code')

fig.update_xaxes(
    dtick="M1",
    tickformat='%b\n%Y'
)

fig.show()

## 6. Vaccination progess, how it change?

In [None]:
vac = df.groupby('date')['new_vaccinations'].agg(['sum'])
vac.reset_index(inplace=True)

vac.rename(columns={'sum':'Vaccines'}, inplace=True)
vac = vac[vac.Vaccines != 0]
vac.head()

In [None]:
fig = px.line(vac, x='date', y='Vaccines')

fig.update_xaxes(
    dtick="M1",
    tickformat='%b\n%Y'
)

fig.show()

## 7. How strictly goverments responded with restrictions?

In [None]:
fig = px.choropleth(df.loc[:,['iso_code', 'stringency_index', 'date', 'location']], locations='iso_code', color='stringency_index',
                   hover_name='location', projection='natural earth',
                   title='Stringency Index',
                   range_color=(0,100),
                   color_continuous_scale='greens'
                   )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

## 8. Does countries that have more smoking people suffer more from covid?

In [None]:
df.columns

In [None]:
smokers = df.groupby('iso_code')[['female_smokers', 'male_smokers', 'total_cases_per_million']].last()

In [None]:
smokers = smokers.dropna().sort_values('total_cases_per_million')

In [None]:
smokers['total_smokers'] = smokers['female_smokers'] + smokers['male_smokers']

In [None]:
smokers.loc[:,['total_cases_per_million', 'total_smokers']].corr()

In [None]:
smokers.head(10), smokers.tail(10)

### Observations

Looking at correlation between total_smokers and total_cases_per_million we can se weak correlation, which implies that there could be some small proff that countries that have more people that smoke has a greater risk having a more covid cases. That being said correlation is not a perfect way to determen causality and looking at top 10 and bottom 10 countries by covid cases per million people we can see that percentage of people that smoke varies a little, but are very similar.

## 9. How gdp_per_capita and meadian age correlates with total cases?

In [None]:
gdp_age = df.groupby('iso_code')[['gdp_per_capita', 'median_age', 'total_cases_per_million']].last()
gdp_age = gdp_age.dropna().sort_values('total_cases_per_million')
gdp_age.head(10), gdp_age.tail(10)

In [None]:
gdp_age.corr()

### Observations

By Spearman correlation we can see that medium to strong positive dependancy between median age and total cases per million and weaker positive dependency between gdp per capita and total cases per million. Once again correlation is not a perfect way to determen causality but lookit at 10 best (least cases) and worst(most cases) we can see tendencies that yourger median age results in smaller number of cases.  

# Final Toughts

To begin with I want to say thank you for [Alexa](https://www.kaggle.com/saumya5679), who provided this wonderful [dataset](https://www.kaggle.com/saumya5679/covid19-timeline-analysis) and who allowed me to practise EDA and data visualisation with Plotly.

1. As expected there is a rise of cases till summer of 2020 then cases starts to drop till it reaches plateu for few months and around Autunm (or Fall if you are American) total cases stats to increase again.
2. New tests amount increases at the middle of covid and starts to drop in 2021 because people is finishing vaccinations and gaining immunity resulting in no tests needed.
3. Vaccinations ramping up tremendously at the start and then plateus with regular spikes as people are getting there vaccines. The total vacines should drop later on because most people will have there vaccines and less population will remain unvaccinated.
4. As expected younger median age countries has lower total cases per million people with correlation 0.6, gdp has lower correlation with around 0.45. Which implies that this assumption was correct.
