## What happens when society stops working for a day (or so)? [A mini-EDA]

Don't you wonder what happen's when medical services or the organizations responsible for record keeping, become lax for a day or two? Maybe a couple of graphs could make you a little more curious about it:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('/kaggle/input/covid-latest/covid_19_clean_complete.csv')

(Note that since the data is getting updated less and less often, I uploaded up-to-date data collected with the script the dataset creator uses. Link to the relevant github page can be found in the dataset's description. Latest data since this notebook was created: 2020/24/8)

Let's see if we can spot anything weird on the reported new cases on Germany and Italy during last month:

In [None]:
fig, ax = plt.subplots(1,2,figsize=(30,8))

X = df[(df['Country/Region']=='Germany')][df.Date>='2020-07-01'][df.Date<'2020-08-01']

ax[0].bar(range(X.shape[0]),X.Confirmed.diff())

ax[0].set_xticks(np.arange(X.shape[0]))
ax[0].set_xticklabels(pd.to_datetime(X.Date).dt.date, rotation=90)
ax[0].set_title('New Confirmed Cases - Germany')

X = df[(df['Country/Region']=='Italy')][df.Date>='2020-07-01'][df.Date<'2020-08-01']

ax[1].bar(range(X.shape[0]),X.Confirmed.diff())

ax[1].set_xticks(np.arange(X.shape[0]))
ax[1].set_xticklabels(pd.to_datetime(X.Date).dt.date, rotation=90)
ax[1].set_title('New Confirmed Cases - Italy')

plt.show()

Can you see the pattern? There seems to be a periodic fluctuation with a 'stable' frequency; 7 days!
Nevertheless, let's explore this further by tweaking the dataset a bit.

Firstly, we're going to drop all of the columns that are not usefull to us:

In [None]:
df.info()

Or maybe, it's going to be easier to keep the ones we'll need..

In [None]:
df = df[['Province/State','Country/Region','Date','Confirmed']]

Now, the confirmed cases for some of the countries are split into provinces/states, but most of the countries don't have that priviledge. So we're going to sum the cases of those that do:

In [None]:
for states_country in df[~df['Province/State'].isnull()]['Country/Region'].unique():
    X = df[df['Country/Region']==states_country].groupby(['Country/Region','Date']).sum().reset_index()
    df.drop(df[df['Country/Region']==states_country].index, inplace=True)
    df = df.append(X).reset_index(drop=True)

df.drop(['Province/State'], axis=1, inplace=True)

Also, let's sort based on country name and date:

In [None]:
df.sort_values(['Country/Region','Date'], ascending=[True, True], inplace=True)
df.reset_index(drop=True, inplace=True)

It would be really usefull if we were to add a column that refers to the day of the week for each row. So, after we transform the date column from an object type to an actual datetime type, we'll add that column:

In [None]:
df['Date'] = pd.to_datetime(df['Date'])
df['Weekday'] = df['Date'].dt.day_name()
df.head()

While we're at it, let's also check that we have data for the same days period for every country:

In [None]:
country_row_count = df[df['Country/Region']==df['Country/Region'].unique()[0]].shape[0]
print(country_row_count, ((df.groupby('Country/Region')['Country/Region'].count()-country_row_count)==0).all())


It would also be very usefull if we were to categorize the rows into different weeks. Since we have a sorted dataframe as to the Date and we do know that the first data we have from are from a Wednesday then:

In [None]:
weeknum = [(x//7)+1 for x in range(2,country_row_count+2)]*len(df['Country/Region'].unique())
df['Weeknum'] = weeknum
df

One last thing we should do, is to translate the Confirmed cases into New Confirmed cases per day:

In [None]:
df['New_Confirmed'] = df.groupby(['Country/Region']).Confirmed.diff()
df['New_Confirmed'].fillna(df['Confirmed'],inplace=True)

Ok, we need to acknowledge some important facts before we proceed.
* Not all countries have free weekends. Some of them have them earlier in the week.
* Not all countries have free days at all.
* Most western countries have either free Sundays or free Weekends.

So, let's try and sort out some of the countries that don't have free Weekends/Sundays. We're going to do this by collecting data from [this wikipedia article](https://en.wikipedia.org/wiki/Workweek_and_weekend):

In [None]:
import requests
from bs4 import BeautifulSoup

URL = "https://en.wikipedia.org/wiki/Workweek_and_weekend"

res = requests.get(URL).text
soup = BeautifulSoup(res,'lxml')
req = soup.find('table', class_='wikitable').find_all('tr')[1::1]

final = []

for ln in req:
    if len(ln)>5:
        final.append([str(list(ln)[1]).replace('<td>','').replace('</td>',''), str(list(ln)[5]).replace('<td>','').replace('</td>','').split()[0]])

sc = pd.DataFrame(final,columns=['Country','Workdays'])
sc['Free_Sunday'] = sc.Workdays.apply(lambda x: True if (((('Friday' in x) or ('Saturday' in x)) and ('Sunday' not in x)) or (('Monday' in x) and ('Sunday' not in x))) else False)
No_Sunday_Countries = sc[~sc.Free_Sunday].Country.to_list()

print('\033[1mList of countries listed our dataset but not on the wikipedia table:\n\033[0m', [cn for cn in df['Country/Region'].unique() if cn not in sc['Country'].unique()])
print('\n\033[1mList of countries with working Sundays that are in our dataset:\n\033[0m', [cn for cn in No_Sunday_Countries if cn in df['Country/Region'].unique()])

df.drop(df[df['Country/Region'].isin(No_Sunday_Countries)].index, inplace=True)
df.reset_index(drop=True, inplace=True)

So, without getting too technical, we scrap the page's table and we actually do find some of those countries. Unfortunately, due to a mismatch in the country names between the dataset and the data from wikipedia, as well as some disputed countries, we'll need to do some manual work. That's a topic for another EDA though. We are just droping the ones we actually found and letting all of the others be.

Now, let's use the data that we added.

In [None]:
WeeklyMean = df.groupby(['Country/Region','Weeknum']).New_Confirmed.mean()
WeeklyMeanGlobal = df.groupby(['Weeknum']).New_Confirmed.mean()

X = df[(df['Weekday'].isin(['Sunday','Monday']))]
X['Confirmed_diff'] = X.New_Confirmed.diff()

Xglobal = X.groupby(['Date','Weeknum','Weekday']).New_Confirmed.mean().reset_index()
Xglobal['Confirmed_diff'] = Xglobal.New_Confirmed.diff()

Firstly, we're going to examine our hypotheis on a global level. By taking into account the mean new confirmed cases per country per day, the mean daily confirmed cases per week as well as the difference between the new cases each monday and the new cases each previous sunday, we're going to make an interesting visualization:

In [None]:
fig, ax = plt.subplots(figsize=(15,8))

c_name='Global'

Y = Xglobal[Xglobal['Weekday']=='Monday']

ax.bar(range(Y.shape[0]),WeeklyMeanGlobal.to_list()[:Y.shape[0]],color='peru', label='Confirmed Weekly (per day) Avg.')
ax.bar(range(Y.shape[0]),Y.New_Confirmed, alpha=0.9,label="Confirmed Next Monday")
ax.bar(range(Y.shape[0]),Y.Confirmed_diff, alpha=0.55,label='Monday/Sunday Difference',color='red',width=0.33)

ax.set_xticks(np.arange(Y.shape[0]))
ax.set_xticklabels(Y.Date.dt.date, rotation=90)
ax.set_title(c_name)
ax.legend()

So, a couple of insights are to follow:
* First of all, the bigger the difference between the previous week's daily avg. and the next's monday's cases whilst the Monday-Sunday difference is significant, the more our hypothesis get's solidified. 
* We can actually see that, per global average, confirmed cases on mondays come short of the average daily confirmed cases of the previous week.
* We can also see that, per global average, confirmed cases on Mondays differ enough from those on the previous Sundays.
* We can agree though that, at global level, it's going to be a rough assumption rather than a proven fact.

With that said, I'm going to leave you with the same plot. Only that this time, it's for every Country that we didn't dropped before. Feel free to examine and see if on Sundays, at your country of interest, the assumption that the organizations responsible for case studying and confirming, are actually reporting less cases after a weekend, than the rest of the week!

In [None]:
from bokeh.io import show ,output_notebook
from bokeh.plotting import figure
from bokeh.models import Panel, Tabs
from datetime import timedelta
output_notebook()

tabs = []
p = []

for c_name in np.sort(abs(X.groupby(['Country/Region']).Confirmed_diff.mean()).sort_values().index.to_list()):
    Y = X[(X['Country/Region']==c_name)&(X['Weekday']=='Monday')]
    p.append(figure(plot_width=920, plot_height=480,x_axis_type='datetime',min_border=0))
    p[-1].xaxis.major_label_orientation = "vertical"
    p[-1].vbar(x=Y.Date.dt.date, top=WeeklyMean[c_name].to_list()[:Y.shape[0]], width=timedelta(days=5),color="royalblue",legend_label='Confirmed Weekly per Day Avg.')
    p[-1].vbar(x=Y.Date.dt.date, top=Y.New_Confirmed, width=timedelta(days=5),color="peru",fill_alpha=0.6,legend_label='Confirmed Next Monday')
    p[-1].vbar(x=Y.Date.dt.date, top=Y.Confirmed_diff, width=timedelta(days=1),color="red",fill_alpha=0.25,legend_label='Monday/Sunday Difference')
    p[-1].legend.location = "top_left"
    p[-1].legend.click_policy="hide"
    tabs.append(Panel(child=p[-1], title=c_name))

show(Tabs(tabs=tabs))

Some improvements could be made. I may update this notebook if I come up with some interesting tweaks or new ideas.