# Tracking the Coronavirus
This data exploration is using a [dataset](https://www.kaggle.com/imdevskp/corona-virus-report) which is updated every 24 hours by a [Kaggle user](https://www.kaggle.com/imdevskp). We will have a look at how this outbreak develops and what we can see from the currently available data. The data contains the number of cases, recoveries and death per province in each country. Additionally, rough coordinates are given and the date.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('../input/corona-virus-report/covid_19_clean_complete.csv')
df.Date = pd.to_datetime(df.Date)
df.head()

This is the available data. As a brief summary, the total numbers per day can be interesting although they convey a limited amount of information.

In [None]:
total = df.groupby(['Date']).sum().loc[:,['Confirmed','Deaths','Recovered']].reset_index()
total.head()

These are the numbers so far. However it has to be kept in mind that there are probably a lot of cases that never persent to a healthcare professional and therefore, never appear in any statistic.

In [None]:
print('A total of: %s  cases were confirmed.' %(total.Confirmed.sum()))
print('A total of: %s (%s %%) deaths were recorded.' %(total.Deaths.sum(), ((100*total.Deaths.sum())/total.Confirmed.sum())))
print('A total of: %s (%s %%) recovered from the infection.' %(total.Recovered.sum(), ((100*total.Recovered.sum())/total.Confirmed.sum())))

Relating this to seasonal influenza is difficult. The issue is that most patients never present to healt care professionals and even if they do they are likely not getting tested. So the true numbers will always be an estimate. In the UK the theflu mortality rate is a [controverse issue](https://www.bmj.com/content/361/bmj.k2795/rr-6) and the office of statistics has been accused of liberally including a large proportion of deaths during winter into the flu statistics. 

Let us plot the current total number of cases, to get a feeling for the spread so far. 

In [None]:
from folium.plugins import HeatMap
import folium


center_lat = df[df['Province/State']=='Hubei'].iloc[0].Lat
center_lon = df[df['Province/State']=='Hubei'].iloc[0].Long


hmp = folium.Map(
    location=[center_lat, center_lon],
    zoom_start=2,
    tiles='OpenStreetMap', 
    width='100%')

HeatMap(data=df[['Lat', 'Long']].groupby(['Lat', 'Long']).count().reset_index().values.tolist(), radius=10, max_zoom=13).add_to(hmp)

hmp


The map above only shows static sums of total cases, which is only of limited values. A more informative way of presenting this data is on an animated map. This shows how it spread across the world. 

In [None]:
from datetime import datetime, timedelta
from folium.plugins import HeatMapWithTime


df_day_list = []

for day in df.Date.sort_values().unique():
    
    temp = df.loc[df.Date == day, ['Lat', 'Long', 'Confirmed']].groupby(['Lat', 'Long']).sum().reset_index()    
    df_day_list.append(temp[temp.Confirmed>0].reset_index(drop=True).values.tolist())


time_index = [(df.Date[0] + k * timedelta(1)).strftime('%Y-%m-%d') for
    k in range(len(df_day_list))
]

def generateBaseMap(default_location=[center_lat, center_lon], default_zoom_start=2):
    base_map = folium.Map(location=default_location, control_scale=True, zoom_start=default_zoom_start)
    return base_map


base_map = generateBaseMap(default_zoom_start=2)

HeatMapWithTime(df_day_list, 
                radius=10, 
                index=time_index,
                auto_play=True,
                use_local_extrema=False).add_to(base_map)
base_map

What we can see is that the virus spreads not necessarily to neighbouring countries but jumps across the world. This is to be expected as air traffic enables humans to travel quickly distances that were previously not possible. As many viruses such as the Coronavirus but also Influenza can be infectious before the patient shows any symptoms this illustrates why it is so difficult to contain a viral disease within a country. <br>

Next, we will look at the global numbers and see what they can tell us. We will subtract the total recovered number of patients and those that died from the total to get an idea of how many people roughly are infected now.

In [None]:
total['Currently_infected'] = total.Confirmed-(total.Deaths+total.Recovered)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_palette('viridis')

fig, axes = plt.subplots(1, 2, sharex=True, figsize=(14,5))


sns.lineplot(x='Date',y='Confirmed', data=total, label='Confirmed', ax=axes[0])
sns.lineplot(x='Date',y='Deaths', data=total, label='Deaths', ax=axes[0])
sns.lineplot(x='Date',y='Recovered', data=total, label='Recovered', ax=axes[0])
sns.lineplot(x='Date',y='Currently_infected', data=total, label='Current infections', ax=axes[0])



sns.lineplot(x='Date',y='Confirmed', data=total, label='Confirmed', ax=axes[1])
sns.lineplot(x='Date',y='Deaths', data=total, label='Deaths', ax=axes[1])
sns.lineplot(x='Date',y='Recovered', data=total, label='Recovered', ax=axes[1])
sns.lineplot(x='Date',y='Currently_infected', data=total, label='Current infections', ax=axes[1])

axes[1].set_yscale('log')

axes[0].title.set_text('Absolute numbers linear')
axes[1].title.set_text('Absolute numbers log')

axes[0].set( ylabel='Absolute numbers')
axes[1].set( ylabel='Log')

plt.sca(axes[0])
plt.xticks(rotation=40)

plt.sca(axes[1])
plt.xticks(rotation=40)

plt.legend()
sns.despine()
plt.tight_layout()

# plt.savefig('log_comparison.png')


The figure above shows the absolute case numbers on the left on a linear scale and the right shows a logarithmic scale. The logarithmic scale shows the percent change rather than absolute numbers. This compensates for the fact that the numbers a continuously increasing but have very different magnitudes. There are only very few deaths compared to the number of confirmed cases. What is visible though is a dramatic jump on the 25th of February where the cases increase rapidly. This is the time the [case definition was changed](https://www.gov.uk/government/publications/wuhan-novel-coronavirus-initial-investigation-of-possible-cases/investigation-and-initial-clinical-management-of-possible-cases-of-wuhan-novel-coronavirus-wn-cov-infection) including a broader set of symptoms.


The data above are global figures, so they cannot tell us anything about differences in mortality rates between countries. It would be expected that different healthcare systems cope differently with the outbreak and this could affect the mortality rate. 

### Explore death rates
The death rates based on countries are given as a percentage of the total case number. It is actually shocking how high the mortality rate on the Philippines is. The second in line is San Marino a microstate (population of 33,400 ) within Italy. This is interesting as a few days ago the mortality was 5% as well however with a much lower case number. The overall cases have now increased and so have the deaths but the mortality rate appears to be the same. Italy on the other hand that surrounds San Marino only has a lower mortality rate. Explanations could be that the demographics are different. This is the case however Italy has with 21% of the population over the age of 65, an older population than San Marino (16%). Another explanation could be differences in the provision of healthcare which is very difficult to assess.  

In [None]:
country_sums = df.groupby('Country/Region').sum().loc[:,['Confirmed', 'Deaths', 'Recovered']]

In [None]:
country_sums['deathrate'] = (100*country_sums.Deaths)/country_sums.Confirmed
deaths = country_sums[country_sums.deathrate>0]

In [None]:
country_sums.sort_values('deathrate', ascending = False)

In [None]:
import matplotlib.pyplot as plt
from matplotlib import cm
from math import log10

data = deaths.sort_values('deathrate',ascending=False)


labels = list(data.index)
data = list(data.deathrate.values.ravel())
#number of data points
n = len(data)
#find max value for full ring
k = 10 ** int(log10(max(data)))
m = k * (1 + max(data) // k)

#radius of donut chart
r = 1.5
#calculate width of each ring
w = r / n 

#create colors along a chosen colormap
colors = [cm.Reds_r(i / n) for i in range(n)]

#create figure, axis
fig, ax = plt.subplots(figsize=(8,8))
ax.axis("equal")

#create rings of donut chart
for i in range(n):
    #hide labels in segments with textprops: alpha = 0 - transparent, alpha = 1 - visible
    innerring, _ = ax.pie([m - data[i], data[i]], radius = r - i * w, startangle = 90, labels = ["", labels[i]], labeldistance = 1 - 1 / (1.5 * (n - i)), textprops = {"alpha": 0}, colors = ["white", colors[i]])
    plt.setp(innerring, width = w, edgecolor = "white")

plt.legend( bbox_to_anchor= (1.2, 0.85))
plt.tight_layout()
plt.savefig('Death_rates.png')
plt.show()


The chart above shows the mortality in the percentage of the cases. One country is completely out of relation to the others, which are the Philipines. The Philippines have over the last few days had a mortality rate of 33% (out of 102 cases). This is a shocking figure and needs explanation. For in-depth explanation see my article on [Medium]().

Lastly we will make an interactive plot showing the numbers of cases per region.

In [None]:
from folium.plugins import MarkerCluster

mp = folium.Map(
    location=[center_lat, center_lon],
    zoom_start=2,
    tiles='OpenStreetMap', 
    width='100%')

mp.add_child(folium.LatLngPopup())

marker_cluster = MarkerCluster().add_to(mp)

for i, row in df.iterrows():
    name = row.Date
    lat = row.Lat
    lon = row.Long
    opened = row.Confirmed
    
    # HTML here in the pop up 
    popup = '<b>{}</b></br><i>setup date = {}</i>'.format(name, opened)
    
    folium.Marker([lat, lon], popup=popup, tooltip=name).add_to(marker_cluster)

In [None]:
mp