<h1>Tracking the spread of 2019 Coronavirus</h1>

<img src="https://storage.googleapis.com/kaggle-datasets-images/544069/992803/500beb47c451ac68fae29a8eb95ae45c/dataset-card.jpg" width=400></img>

# Introduction

The 2019-nCoV is a highly contagious coronavirus that originated from Wuhan (Hubei province), Mainland China. This new strain of virus has striked fear in many countries as cities are quarantined and hospitals are overcrowded.

We are using here a Kaggle Dataset [Coronavirus 2019-nCoV](https://www.kaggle.com/gpreda/coronavirus-2019ncov) updated daily, based on [John Hopkins data](https://github.com/CSSEGISandData/COVID-19/). 

The Kernel will be rerun frequently to reflect the daily evolution of the cited dataset.

We start by analyzing the data for Mainland China, where the pandemic originated. We show time evolutions and snapshots of Confirmed, Recovered cases as well as Deaths. Then we move to explore the evolution of the pandemics in the rest of the World.


We are also using comparison of log-curves for several countries of Confirmed and Deaths to monitor evolution in time at country level.

Heatmaps are also used to display geographical distribution of Confirmed cases and Deaths.


For both Mainland China and the rest of the World we are also showing the snapshot and time evolution of mortality, calculated in two ways: as Deaths / Confirmed cases (most probably a underestimate) and as Deaths / Recovered cases (most probably an overestimate).

In [None]:
import datetime as dt
dt_string = dt.datetime.now().strftime("%d/%m/%Y")
print(f"Kernel last updated: {dt_string}")

# Analysis preparation

## Load packages

In [None]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns 
import datetime as dt
import folium
from folium.plugins import HeatMap, HeatMapWithTime
%matplotlib inline

## Load the data

There are multiple files in the coronavirus data folder, we will take the last updated one.
We also include GeoJSON data for China and for World.

In [None]:
print(os.listdir('/kaggle/input'))
DATA_FOLDER = "/kaggle/input/coronavirus-2019ncov"


In [None]:
data_df = pd.read_csv(os.path.join(DATA_FOLDER, "covid-19-all.csv"))

# Preliminary data exploration

## Glimpse the data

We check data shape, we look to few rows of the data, we check for missing data.

In [None]:
print(f"Rows: {data_df.shape[0]}, Columns: {data_df.shape[1]}")

In [None]:
data_df.head()

In [None]:
data_df.tail()

In [None]:
for column in data_df.columns:
    print(f"{column}:{data_df[column].dtype}")

In [None]:
print(f"Date - unique values: {data_df['Date'].nunique()} ({min(data_df['Date'])} - {max(data_df['Date'])})")

In [None]:
data_df['Date'] = pd.to_datetime(data_df['Date'])

In [None]:
for column in data_df.columns:
    print(f"{column}:{data_df[column].dtype}")

In [None]:
print(f"Date - unique values: {data_df['Date'].nunique()} ({min(data_df['Date'])} - {max(data_df['Date'])})")

In [None]:
def missing_data(data):
    total = data.isnull().sum()
    percent = (data.isnull().sum()/data.isnull().count()*100)
    tt = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    types = []
    for col in data.columns:
        dtype = str(data[col].dtype)
        types.append(dtype)
    tt['Types'] = types
    return(np.transpose(tt))

Let's look to the missing values.

In [None]:
missing_data(data_df)

Let's check the spread of the 2019-nCoV in various Regions/Countries and Provinces/States.

In [None]:
print(f"Countries/Regions:{data_df['Country/Region'].nunique()}")
print(f"Province/State:{data_df['Province/State'].nunique()}")

Now we will show again the confirmed cases, deaths and recovered cases, grouped by province/state in Mainland China, as evolved in time.

In [None]:
def plot_time_variation(df, y='Confirmed', hue='Province/State', size=1, is_log=False):
    f, ax = plt.subplots(1,1, figsize=(4*size,3*size))
    g = sns.lineplot(x="Date", y=y, hue=hue, data=df)
    plt.xticks(rotation=90)
    plt.title(f'{y} cases grouped by {hue}')
    if(is_log):
        ax.set(yscale="log")
    ax.grid(color='black', linestyle='dotted', linewidth=0.75)
    plt.show()  

## Mainland China - time evolution

In [None]:
data_cn = data_df.loc[data_df['Country/Region']=="China"]

In [None]:
plot_time_variation(data_cn, size=4, is_log=True)

In [None]:
plot_time_variation(data_cn, y='Recovered', size=4, is_log=True)

## Mainland China - overall

Let's compare overall values for Mainland China (Confirmed, Recovered, Deaths).


In [None]:
def plot_time_variation_all(df, title='Mainland China', size=1):
    f, ax = plt.subplots(1,1, figsize=(4*size,2*size))
    g = sns.lineplot(x="Date", y='Confirmed', data=df, color='blue', label='Confirmed')
    g = sns.lineplot(x="Date", y='Recovered', data=df, color='green', label='Recovered')
    g = sns.lineplot(x="Date", y='Deaths', data=df, color = 'red', label = 'Deaths')
    plt.xlabel('Date')
    plt.ylabel(f'Total {title} cases')
    plt.xticks(rotation=90)
    plt.title(f'Total {title} cases')
    ax.grid(color='black', linestyle='dotted', linewidth=0.75)
    plt.show()  


In [None]:
data_cn = data_df.loc[data_df['Country/Region']=="China"]
data_cn = data_cn.sort_values(by = ['Province/State','Date'], ascending=False)
data_cn_agg = data_cn.groupby(['Date']).sum().reset_index()
plot_time_variation_all(data_cn_agg, size=3)