# **Coronavirus study**

**Introduction**

The coronavirus disease 2019 (COVID-19) was first identified in December 2019 in Wuhan, China, and has caused an ongoing pandemic. It is a contagious disease which spread very fast, so that more than 20 million cases were confirmed, causing more than 800,000 deaths at the end of August.

**Motivation**

I am a student and i first got interested by data science early this year. I decided to learn it so i watched videos and i went on specialized websites but i figured out the best way to learn it was to make my own project. This is when the idea of a second lockdown due to the covid-19 came on the news in France. This reminded me that there is a lot of data on this disease, and i also wanted to check the pure data to see if this was a good idea.

In this kernel, we will first see the global situation of the virus, then we'll see which countries have the most number of cases and finally we will conclude on the relevance of a second lockdown in France. It's a short kernel.

**PS :** The following study do not take into account the values after the 08/31/2020 for the study 'per month' and 'per country'. The dataset is updated each day.

As a beginner, I hope this kernel can help other beginners on the use of Seaborn as well as Dataframe Manipulation. I tried to comment my code.


In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import plotly.graph_objects as go
import plotly.express as px
import seaborn as sns 
import random


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os, glob
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename)) 

In [None]:
#We first import the data (copy-paste of the input files' paths)
df_data =pd.read_csv("/kaggle/input/novel-corona-virus-2019-dataset/covid_19_data.csv")
df_ts_confirmed = pd.read_csv("/kaggle/input/novel-corona-virus-2019-dataset/time_series_covid_19_confirmed.csv")
df_ts_recovered = pd.read_csv("/kaggle/input/novel-corona-virus-2019-dataset/time_series_covid_19_recovered.csv")
df_ts_confirmed_us = pd.read_csv("/kaggle/input/novel-corona-virus-2019-dataset/time_series_covid_19_confirmed_US.csv")
df_ts_deaths = pd.read_csv("/kaggle/input/novel-corona-virus-2019-dataset/time_series_covid_19_deaths.csv")
df_ts_death_us = pd.read_csv("/kaggle/input/novel-corona-virus-2019-dataset/time_series_covid_19_deaths_US.csv")
df_ll = pd.read_csv("/kaggle/input/novel-corona-virus-2019-dataset/COVID19_line_list_data.csv")
df_oll = pd.read_csv("/kaggle/input/novel-corona-virus-2019-dataset/COVID19_open_line_list.csv")

# **Global Study**

1. Confirmed/deaths/recovered covid-19 cases according to the date

In [None]:
#We will first look at the Covid 'data'
df_data.head(3)

In [None]:
df_data[['Confirmed','Deaths','Recovered']]=df_data[['Confirmed','Deaths','Recovered']].astype(int) #To convert a float into an int

#Next we want to plot the number of deaths/recovered/confirmed cases according to the date 
#So we sum all the confirmed/deaths/recovered cases grouped by the date
df_plot = df_data.groupby(["ObservationDate"])['Confirmed','Deaths','Recovered'].sum().reset_index().sort_values("ObservationDate",ascending=True)
df_plot.head()

In [None]:
#We keep just some dates to have a better looking graph (else there is too much x labels to see something) :
L=np.linspace(0,219,150).astype(int)
df_plot_reduced=df_plot
for i in L :
    df_plot_reduced = df_plot_reduced.drop(i) 
df_plot_reduced.head()

- Number of covid-19 confirmed cases according to the date

In [None]:
#Graph that shows the number of confirmed cases according to the date
plt.figure(figsize=(20,5))
ax = sns.barplot(x=df_plot_reduced['ObservationDate'],
                 y=df_plot_reduced['Confirmed'],
                 data = df_plot_reduced, 
                 palette = sns.cubehelix_palette(86,start = 2.5)) #Palette of color purple/blue

ax.set(xlabel='Date',
       ylabel='Number of confirmed cases',
      title='Number of covid-19 confirmed cases')

#To be able to read the x-axis, we rotate the labels
ax.set_xticklabels(ax.get_xticklabels(),
                   rotation = 90, 
                   horizontalalignment='right');

The number of confirmed cases increased exponentially till late August and then it tends to stagnate. Therefore either the number of tests is becoming stable, or the measures in most countries (social distancing, wear of masks) are well respected and efficient.

- Number of recovery cases according to the date

In [None]:
#Number of recovery cases :
plt.figure(figsize=(20,5))
ax = sns.barplot(x=df_plot_reduced['ObservationDate'],y=df_plot_reduced['Recovered'],data = df_plot_reduced, palette = sns.cubehelix_palette(86,start =1.6,rot=0.1))

ax.set(xlabel='Date',
       ylabel='Number of recovered cases',
      title='Number of covid-19 recovered cases')

#To be able to read the x-axis, we rotate the labels
ax.set_xticklabels(ax.get_xticklabels(),
                   rotation = 90, 
                   horizontalalignment='right');

The recovered chart has almost the same shape as the confirmed one, therefore we can think that it's the number of tests which is rising slowly.

- Number of deaths according to the date

In [None]:
#Number of deaths :
plt.figure(figsize=(20,5))
ax = sns.barplot(x=df_plot_reduced['ObservationDate'],y=df_plot_reduced['Deaths'],data = df_plot_reduced, palette = sns.cubehelix_palette(86,start=1.1,rot=0.1))

ax.set(xlabel='Date',
       ylabel='Number of deaths',
      title='Number of covid-19 deaths')

#To be able to read the x-axis, we rotate the labels
ax.set_xticklabels(ax.get_xticklabels(),
                   rotation = 90, 
                   horizontalalignment='right');

2.  Global study per month

In [None]:
#Dataframe with just the last day of the month, in order to find later the number of confirmed/deaths/recovered per month (and not the cumulate number)
labels=['01/31/2020','02/29/2020','03/31/2020','04/30/2020','05/31/2020','06/30/2020','07/31/2020','08/31/2020']

df_month = pd.DataFrame() #New DataFrame
for date in labels: #We just want the last day of the different months, and we concatenate the dataframe with its previous version for each month
    df_month = pd.concat(
        [df_plot.loc[df_plot['ObservationDate']==date] , df_month], 
        ignore_index=True)
    
df_month = df_month.sort_values(by='ObservationDate',ascending=False) #The frist line is for August, not necessary
df_month.head(10) #Number of total confirmed/deaths/recovered cases at the end pf each month

In [None]:
#Number of confirmed cases/recovered/deaths per month
#For example : Number in August = (number on 8/29/20) - (number on 7/31/20)

df_month.loc['August'] = df_month.loc[0][['Confirmed','Deaths','Recovered']] - df_month.loc[1][['Confirmed','Deaths','Recovered']]
df_month.loc['July'] = df_month.loc[1][['Confirmed','Deaths','Recovered']] - df_month.loc[2][['Confirmed','Deaths','Recovered']]
df_month.loc['June'] = df_month.loc[2][['Confirmed','Deaths','Recovered']] - df_month.loc[3][['Confirmed','Deaths','Recovered']]
df_month.loc['May'] = df_month.loc[3][['Confirmed','Deaths','Recovered']] - df_month.loc[4][['Confirmed','Deaths','Recovered']]
df_month.loc['April'] = df_month.loc[4][['Confirmed','Deaths','Recovered']] - df_month.loc[5][['Confirmed','Deaths','Recovered']]
df_month.loc['March'] = df_month.loc[5][['Confirmed','Deaths','Recovered']] - df_month.loc[6][['Confirmed','Deaths','Recovered']]
df_month.loc['February'] = df_month.loc[6][['Confirmed','Deaths','Recovered']] - df_month.loc[7][['Confirmed','Deaths','Recovered']]
df_month.loc['January'] = df_month.loc[7][['Confirmed','Deaths','Recovered']]

#We keep just the monthly cases and use iloc to sort it the right way
df_month = df_month.drop([i for i in range(0,8)]).drop('ObservationDate',axis=1).iloc[::-1]
df_month.head(8)

Number of confirmed/deaths/recovered covid-19 cases per month :

In [None]:
labels=['January','February','March','April','May','June','July','August'] #x label as well as the name of the df_month lines

fig, axs= plt.subplots(ncols=3,figsize=(21,5)) #Several charts on one figure
ax1=sns.barplot(x=labels,
                y=df_month['Confirmed'],
                data=df_month,
                ax=axs[0], #First chart
                palette = 'Blues')
ax1.set(xlabel='Month',ylabel='Number of confirmed cases',title='Number of covid-19 confirmed cases per month');

ax2=sns.barplot(x=labels,
                y=df_month['Deaths'],
                data=df_month,
                ax=axs[1], #Second chart
                color='Grey')
ax2.set(xlabel='Month',ylabel='Number of deaths cases',title='Number of covid-19 deaths per month');

ax3=sns.barplot(x=labels,
                y=df_month['Recovered'],
                data=df_month,
                ax=axs[2], #Third chart
                palette = 'Greens')
ax3.set(xlabel='Month',ylabel='Number of recovered cases',title='Number of covid-19 recovered cases per month');

- We can see that even if there is a growth in confirmed covid-19 cases, the number of recovered cases rise the same way so the number of deaths tend to stagnate
- We can also see that the global lockdown of people is effective : The number of deaths decreased between April and May, the lockdown being in place during April or May in most countries. And it tends to slightly re-raise since the lockdown has been removed. Plus, the timing of the lockdown decision seems to be correct because the number of deaths went skyrocket in April

# The impact of the virus per country 

In this section we will first see the situation on 08/31/2020 : which countries have the most confirmed/deaths/recovered cases ?

Then we will show the top 5 countries with the most confirmed/deaths/recovered cases per month.

**1. Situation on 08/31/2020**

- First we want to see which countries have the most confirmed cases on 8/31/2020 :

In [None]:
df_ts_confirmed.head() #To see this dataframe

In [None]:
df_confirmed_top20 = df_ts_confirmed.sort_values(by='8/31/20',ascending=False).head(20) #because we want the 20 countries where there is the most recovered cases

plt.figure(figsize=(20,5)) #To have a bigger figure
ax=sns.barplot(
    x='Country/Region',
    y='8/31/20',
    data=df_confirmed_top20, 
    palette = "Blues_d"
)

ax.set(xlabel='Country',
       ylabel='Number of cases',
       title='Top 20 countries with the most confirmed cases on 8/31/2020');


Scatter graph :

In [None]:
#Scatter graph using plotly
fig = go.Figure(data=[go.Scatter(
    x=df_confirmed_top20['Country/Region'],
    y=df_confirmed_top20['8/31/20'],
    mode='markers',
    marker=dict(size=(df_confirmed_top20['8/31/20']/30000))
)])

fig.update_layout(title='Top 20 countries with the most confirmed cases on 8/31/2020',
    xaxis_title="Country",
    yaxis_title="Confirmed Cases"
)
fig.show()

- Top 20 countries with the most recovered cases

In [None]:
df_recovered_top20 = df_ts_recovered.sort_values(by='8/31/20',ascending=False).head(20) #because we want the 20 countries where there is the most recovered cases

plt.figure(figsize=(20,5)) #To have a bigger figure
ax=sns.barplot(
    x='Country/Region',
    y='8/31/20',
    data=df_recovered_top20,
    palette = "Greens_d"
)

ax.set(xlabel='Country',
       ylabel='Number of cases',
       title='Top 20 countries with the most recovered cases on 8/31/2020');

- The 20 countries with the most covid-19 deaths

In [None]:
df_deaths_top20 = df_ts_deaths.sort_values(by='8/31/20',ascending=False).head(20) #because we want the 20 countries where there is the most deaths cases

plt.figure(figsize=(20,5)) #To have a bigger figure
ax=sns.barplot(
    x='Country/Region',
    y='8/31/20',
    data=df_deaths_top20,
    palette = 'Greys_d'
)

ax.set(xlabel='Country',
       ylabel='Number of cases',
       title='Top 20 countries with the most recovered cases on 8/31/2020');

- The 20 countries with the highest (deaths / confirmed cases) ratio on 08/29/2020

In [None]:
#We first create a new Dataframe
df_ratio = pd.DataFrame() 

df_ratio['Country/Region'] = df_ts_confirmed['Country/Region'] #To have the countries
df_ratio['Ratio deaths/confirmed'] = (df_ts_deaths.loc[df_ts_deaths['Province/State'].isnull()]['8/31/20']/df_ts_confirmed.loc[df_ts_confirmed['Province/State'].isnull()]['8/31/20'] ) * 100 # Deaths/confirmed percentage ratio
df_ratio = df_ratio.sort_values(by='Ratio deaths/confirmed',ascending = False).head(20)
df_ratio.head()

In [None]:
plt.figure(figsize=(20,5)) #To have a bigger figure
ax=sns.barplot(
    x='Country/Region',
    y='Ratio deaths/confirmed',
    data=df_ratio,
    palette='Reds_d'
)

ax.set(xlabel='Country',
       ylabel='Percentage',
       title='Top 20 countries with the highest deaths/confirmed cases on 8/31/2020')

ax.set_xticklabels(ax.get_xticklabels(),
                   rotation = 90, 
                   horizontalalignment='right');

MS Zaandam boat is 2nd because of the low number of passengers.

We can see that the highest ratios are located in Europe, therefore we can try to make some conclusions :
   - People in Europe are more sensitive to the virus (surely this is not the case)
   - Countries in Europe don't make as many tests as some other big countries : making more tests will result in more recovered cases so the ratio will decrease
   - In these countries the virus infected more sensitive people (aged people for example)
    
    
It would be nice to divide these values by the population of each country or by the number of tests in each country to make reliable conclusions.

2. Top 5 countries with the most confirmed/deaths/recovered cases per month

- The 5 countries with the most confirmed cases per month

In [None]:
#We will create a new dataframe which contains the number of confirmed cases per month
df_new_confirmed = pd.DataFrame() #Cretaion of the dataframe

df_new_confirmed['Country/Region'] = df_ts_confirmed['Country/Region'] #We copy the country column

#We calculate the number of confirmed cases for each month
df_new_confirmed['Number of new confirmed cases in August'] = df_ts_confirmed['8/31/20']-df_ts_confirmed['8/1/20']
df_new_confirmed['Number of new confirmed cases in July'] = df_ts_confirmed['7/31/20']-df_ts_confirmed['7/1/20']
df_new_confirmed['Number of new confirmed cases in June'] = df_ts_confirmed['6/30/20']-df_ts_confirmed['6/1/20']
df_new_confirmed['Number of new confirmed cases in May'] =df_ts_confirmed['5/31/20']-df_ts_confirmed['5/1/20']
df_new_confirmed['Number of new confirmed cases in April'] = df_ts_confirmed['4/30/20']-df_ts_confirmed['4/1/20']
df_new_confirmed['Number of new confirmed cases in March'] = df_ts_confirmed['3/31/20']-df_ts_confirmed['3/1/20']
df_new_confirmed['Number of new confirmed cases in February'] = df_ts_confirmed['2/29/20']-df_ts_confirmed['2/1/20']
df_new_confirmed['Number of new confirmed cases in January'] = df_ts_confirmed['1/31/20']-df_ts_confirmed['1/22/20']
df_new_confirmed.head()

In [None]:
#Graph that shows the 5 counties with the most number of confirmed cases due of covid-19 per month
fig, axs=plt.subplots(nrows=2,ncols=4,figsize = (25,15))

ax1=sns.barplot(x='Country/Region',
                y='Number of new confirmed cases in January',
                data=df_new_confirmed.sort_values(by='Number of new confirmed cases in January',ascending=False).head(5), #Top 5 countries
                ax=axs[0][0],
                palette="Blues_d")
ax1.set(xlabel='Country',
        ylabel='Number of confirmed cases',
        title='The number of confirmed cases in January')

ax2=sns.barplot(x='Country/Region',y='Number of new confirmed cases in February',data=df_new_confirmed.sort_values(by='Number of new confirmed cases in February',ascending=False).head(5),ax=axs[0][1],palette="Blues_d")
ax2.set(xlabel='Country',ylabel='',title='The number of confirmed cases in February')

ax3=sns.barplot(x='Country/Region',y='Number of new confirmed cases in March',data=df_new_confirmed.sort_values(by='Number of new confirmed cases in March',ascending=False).head(5),ax=axs[0][2],palette="Blues_d")
ax3.set(xlabel='Country',ylabel='',title='The number of confirmed cases in March')

ax4=sns.barplot(x='Country/Region',y='Number of new confirmed cases in April',data=df_new_confirmed.sort_values(by='Number of new confirmed cases in April',ascending=False).head(5),ax=axs[0][3],palette="Blues_d")
ax4.set(xlabel='Country',ylabel='',title='The number of confirmed cases in April')

ax5=sns.barplot(x='Country/Region',y='Number of new confirmed cases in May',data=df_new_confirmed.sort_values(by='Number of new confirmed cases in May',ascending=False).head(5),ax=axs[1][0],palette="Blues_d")
ax5.set(xlabel='Country',ylabel='Number of confirmed cases',title='The number of confirmed cases in May')

ax6=sns.barplot(x='Country/Region',y='Number of new confirmed cases in June',data=df_new_confirmed.sort_values(by='Number of new confirmed cases in June',ascending=False).head(5),ax=axs[1][1],palette="Blues_d")
ax6.set(xlabel='Country',ylabel='',title='The number of confirmed cases in June')

ax7=sns.barplot(x='Country/Region',y='Number of new confirmed cases in July',data=df_new_confirmed.sort_values(by='Number of new confirmed cases in July',ascending=False).head(5),ax=axs[1][2],palette="Blues_d")
ax7.set(xlabel='Country',ylabel='',title='The number of confirmed cases in July')

ax8=sns.barplot(x='Country/Region',y='Number of new confirmed cases in August',data=df_new_confirmed.sort_values(by='Number of new confirmed cases in August',ascending=False).head(5),ax=axs[1][3],palette="Blues_d")
ax8.set(xlabel='Country',ylabel='',title='The number of confirmed cases in August')

plt.show()


The US is in first position in most month because the virus touches many people and they make many tests. We can see that in March and April, the virus was circulating more in Europe. Then South America, India became the center of the pandemic.

- The 5 countries with the most deaths per month

In [None]:
#Again we create a new dataframe (because the 'Country/Region' may not be in the same order between the confirmed and the deaths database )
df_new_deaths = pd.DataFrame()
df_new_deaths['Country/Region'] = df_ts_deaths['Country/Region']

df_new_deaths['Number of deaths in August'] = df_ts_deaths['8/31/20']-df_ts_deaths['8/1/20']
df_new_deaths['Number of deaths in July'] = df_ts_deaths['7/31/20']-df_ts_deaths['7/1/20']
df_new_deaths['Number of deaths in June'] = df_ts_deaths['6/30/20']-df_ts_deaths['6/1/20']
df_new_deaths['Number of deaths in May'] = df_ts_deaths['5/31/20']-df_ts_deaths['5/1/20']
df_new_deaths['Number of deaths in April'] = df_ts_deaths['4/30/20']-df_ts_deaths['4/1/20']
df_new_deaths['Number of deaths in March'] = df_ts_deaths['3/31/20']-df_ts_deaths['3/1/20']
df_new_deaths['Number of deaths in February'] = df_ts_deaths['2/29/20']-df_ts_deaths['2/1/20']
df_new_deaths['Number of deaths in January'] = df_ts_deaths['1/31/20']-df_ts_deaths['1/22/20']
df_new_deaths.head()

In [None]:
#Graph that shows the 5 countries with the most number of deaths due to the covid-19 per month
fig, axs=plt.subplots(nrows=2,ncols=4,figsize = (25,15))

ax1=sns.barplot(x='Country/Region',y='Number of deaths in January',data=df_new_deaths.sort_values(by='Number of deaths in January',ascending=False).head(5),ax=axs[0][0],palette=sns.dark_palette('white'))
ax1.set(xlabel='Country',ylabel='Number of deaths',title='The number of deaths in January')

ax2=sns.barplot(x='Country/Region',y='Number of deaths in February',data=df_new_deaths.sort_values(by='Number of deaths in February',ascending=False).head(5),ax=axs[0][1],palette=sns.dark_palette('white'))
ax2.set(xlabel='Country',ylabel='',title='The number of deaths in February')

ax3=sns.barplot(x='Country/Region',y='Number of deaths in March',data=df_new_deaths.sort_values(by='Number of deaths in March',ascending=False).head(5),ax=axs[0][2],palette=sns.dark_palette('white'))
ax3.set(xlabel='Country',ylabel='',title='The number of deaths in March')

ax4=sns.barplot(x='Country/Region',y='Number of deaths in April',data=df_new_deaths.sort_values(by='Number of deaths in April',ascending=False).head(5),ax=axs[0][3],palette=sns.dark_palette('white'))
ax4.set(xlabel='Country',ylabel='',title='The number of deaths in April')

ax5=sns.barplot(x='Country/Region',y='Number of deaths in May',data=df_new_deaths.sort_values(by='Number of deaths in May',ascending=False).head(5),ax=axs[1][0],palette=sns.dark_palette('white'))
ax5.set(xlabel='Country',ylabel='Number of deaths',title='The number of deaths in May')

ax6=sns.barplot(x='Country/Region',y='Number of deaths in June',data=df_new_deaths.sort_values(by='Number of deaths in June',ascending=False).head(5),ax=axs[1][1],palette=sns.dark_palette('white'))
ax6.set(xlabel='Country',ylabel='',title='The number of deaths in June')

ax7=sns.barplot(x='Country/Region',y='Number of deaths in July',data=df_new_deaths.sort_values(by='Number of deaths in July',ascending=False).head(5),ax=axs[1][2],palette=sns.dark_palette('white'))
ax7.set(xlabel='Country',ylabel='',title='The number of deaths in July')

ax8=sns.barplot(x='Country/Region',y='Number of deaths in August',data=df_new_deaths.sort_values(by='Number of deaths in August',ascending=False).head(5),ax=axs[1][3],palette=sns.dark_palette('white'))
ax8.set(xlabel='Country',ylabel='',title='The number of deaths in August')

plt.show()

- The 5 countries with the most recovered cases per month

In [None]:
#Same thing as before but with the recovered database
df_new_recovered = pd.DataFrame()
df_new_recovered['Country/Region'] = df_ts_recovered['Country/Region']
df_new_recovered['Number of recovery cases in August']= df_ts_recovered['8/31/20']-df_ts_recovered['8/1/20']
df_new_recovered['Number of recovery cases in July']= df_ts_recovered['7/31/20']-df_ts_recovered['7/1/20']
df_new_recovered['Number of recovery cases in June']= df_ts_recovered['6/30/20']-df_ts_recovered['6/1/20']
df_new_recovered['Number of recovery cases in May']= df_ts_recovered['5/31/20']-df_ts_recovered['5/1/20']
df_new_recovered['Number of recovery cases in April']= df_ts_recovered['4/30/20']-df_ts_recovered['4/1/20']
df_new_recovered['Number of recovery cases in March']= df_ts_recovered['3/31/20']-df_ts_recovered['3/1/20']
df_new_recovered['Number of recovery cases in February']= df_ts_recovered['2/29/20']-df_ts_recovered['2/1/20']
df_new_recovered['Number of recovery cases in January']= df_ts_recovered['1/31/20']-df_ts_recovered['1/22/20']
df_new_recovered.head()


In [None]:
#Graph that shows the 5 countries with the most number of recovery cases of covid-19 per month
fig, axs=plt.subplots(nrows=2,ncols=4,figsize = (25,15))
ax1=sns.barplot(x='Country/Region',y='Number of recovery cases in January',data=df_new_recovered.sort_values(by='Number of recovery cases in January',ascending=False).head(5),ax=axs[0][0],palette=sns.light_palette('green'))
ax1.set(xlabel='Country',ylabel='Number of recovery cases',title='The number of recovery cases in January')

ax2=sns.barplot(x='Country/Region',y='Number of recovery cases in February',data=df_new_recovered.sort_values(by='Number of recovery cases in February',ascending=False).head(5),ax=axs[0][1],palette=sns.light_palette('green'))
ax2.set(xlabel='Country',ylabel='',title='The number of recovery cases in February')

ax3=sns.barplot(x='Country/Region',y='Number of recovery cases in March',data=df_new_recovered.sort_values(by='Number of recovery cases in March',ascending=False).head(5),ax=axs[0][2],palette=sns.light_palette('green'))
ax3.set(xlabel='Country',ylabel='',title='The number of recovery cases in March')

ax4=sns.barplot(x='Country/Region',y='Number of recovery cases in April',data=df_new_recovered.sort_values(by='Number of recovery cases in April',ascending=False).head(5),ax=axs[0][3],palette=sns.light_palette('green'))
ax4.set(xlabel='Country',ylabel='',title='The number of recovery cases in April')

ax5=sns.barplot(x='Country/Region',y='Number of recovery cases in May',data=df_new_recovered.sort_values(by='Number of recovery cases in May',ascending=False).head(5),ax=axs[1][0],palette=sns.light_palette('green'))
ax5.set(xlabel='Country',ylabel='Number of recovery cases',title='The number of recovery cases in May')

ax6=sns.barplot(x='Country/Region',y='Number of recovery cases in June',data=df_new_recovered.sort_values(by='Number of recovery cases in June',ascending=False).head(5),ax=axs[1][1],palette=sns.light_palette('green'))
ax6.set(xlabel='Country',ylabel='',title='The number of recovery cases in June')

ax7=sns.barplot(x='Country/Region',y='Number of recovery cases in July',data=df_new_recovered.sort_values(by='Number of recovery cases in July',ascending=False).head(5),ax=axs[1][2],palette=sns.light_palette('green'))
ax7.set(xlabel='Country',ylabel='',title='The number of recovery cases in July')

ax8=sns.barplot(x='Country/Region',y='Number of recovery cases in August',data=df_new_recovered.sort_values(by='Number of recovery cases in August',ascending=False).head(5),ax=axs[1][3],palette=sns.light_palette('green'))
ax8.set(xlabel='Country',ylabel='',title='The number of recovery cases in August')
plt.show()

# Coronavirus in France

So let's see if the idea of a new lockdown is a good idea

In [None]:
#In this section we will try to re-create a DataFrame with the lines being the months and the columns being confirmed, deaths and recovered cases

#First we find the 'France' line for the data base 'df_data'
df_fr_date = df_data.loc[df_data['Country/Region']=='France'].sort_values(by='ObservationDate',ascending=False).loc[df_data['Province/State'].isnull()]
df_fr_date = df_fr_date.drop(df_fr_date.columns[[0,2,3,4]],axis=1)
df_fr_date.head()

In [None]:
#Dataframe with just the last day of the month, in order to find later the number of confirmed/deaths/recovered per month (and not the cumulate number)
labels=['01/31/2020','02/29/2020','03/31/2020','04/30/2020','05/31/2020','06/30/2020','07/31/2020','08/31/2020']

df_fr= pd.DataFrame() #New DataFrame
for date in labels: #We just want the last day of the different months, and we concatenate the dataframe with its previous version for each month
    df_fr = pd.concat(
        [df_fr_date.loc[df_fr_date['ObservationDate']==date] , df_fr], 
        ignore_index=True)
    
df_fr = df_fr.sort_values(by='ObservationDate',ascending=False) #The frist line is for August, not necessary
df_fr.head(10) #Number of total confirmed/deaths/recovered cases at the end pf each month

In [None]:
#Number of confirmed cases/recovered/deaths per month
#For example : Number in August = (number on 8/29/20) - (number on 7/31/20)

df_fr.loc['August'] = df_fr.loc[0][['Confirmed','Deaths','Recovered']] - df_fr.loc[1][['Confirmed','Deaths','Recovered']]
df_fr.loc['July'] = df_fr.loc[1][['Confirmed','Deaths','Recovered']] - df_fr.loc[2][['Confirmed','Deaths','Recovered']]
df_fr.loc['June'] = df_fr.loc[2][['Confirmed','Deaths','Recovered']] - df_fr.loc[3][['Confirmed','Deaths','Recovered']]
df_fr.loc['May'] = df_fr.loc[3][['Confirmed','Deaths','Recovered']] - df_fr.loc[4][['Confirmed','Deaths','Recovered']]
df_fr.loc['April'] = df_fr.loc[4][['Confirmed','Deaths','Recovered']] - df_fr.loc[5][['Confirmed','Deaths','Recovered']]
df_fr.loc['March'] = df_fr.loc[5][['Confirmed','Deaths','Recovered']] - df_fr.loc[6][['Confirmed','Deaths','Recovered']]
df_fr.loc['February'] = df_fr.loc[6][['Confirmed','Deaths','Recovered']] - df_fr.loc[7][['Confirmed','Deaths','Recovered']]
df_fr.loc['January'] = df_fr.loc[7][['Confirmed','Deaths','Recovered']]

#We keep just the monthly cases and use iloc to sort it the right way
df_fr = df_fr.drop([i for i in range(0,8)]).drop('ObservationDate',axis=1).iloc[::-1]
df_fr.head(8) #There is some errors in the database fot the recovered cases

Number of confirmed/deaths/recovered covid-19 cases per month in France :

In [None]:
labels=['January','February','March','April','May','June','July','August'] #x label as well as the name of the df_month lines

fig, axs= plt.subplots(ncols=3,figsize=(21,5)) #Several charts on one figure
ax1=sns.barplot(x=labels,
                y=df_fr['Confirmed'],
                data=df_fr,
                ax=axs[0], #First chart
                palette = 'Blues')
ax1.set(xlabel='Month',ylabel='Number of confirmed cases',title='Number of covid-19 confirmed cases per month in France');

ax2=sns.barplot(x=labels,
                y=df_fr['Deaths'],
                data=df_fr,
                ax=axs[1], #Second chart
                color='Grey')
ax2.set(xlabel='Month',ylabel='Number of deaths cases',title='Number of covid-19 deaths per month in France');

ax3=sns.barplot(x=labels,
                y=df_fr['Recovered'],
                data=df_fr,
                ax=axs[2], #Third chart
                palette = 'Greens')
ax3.set(xlabel='Month',ylabel='Number of recovered cases',title='Number of covid-19 recovered cases per month in France');

First, we can see that the lockdown has been very efficient in France (the lockdown was in place from the 17th of March to the 11th of May).

In August, the number of confirmed cases went skyrocket, but the number of deaths is very low (compared to the other month), so with this data, we can think that the idea of a second lockdown is a bit exagirated because the rise of confirmed cases in August may be caused by a rise of tests.

So i searched a litlle bit of info on this matter especially on this website : https://www.cascoronavirus.fr/test-depistage/france

We see that around 4% of the tests are positive in late August whereas only 1-2% of them was positive in June-July. 

To conclude, the idea of a second lockdown in France is founded, maybe too early, but scaring.

# Other ideas that can provide some useful information :

* First we can see in which countries the confirmed/deaths/recovered cases have gotten higher between two consecutive months
* We can plot charts per population in the country, to see the proportion of the virus according to the country's population, but this information is not a part of the dataset
* We can plot charts per number of tests per country

# Documentation

- Go see this wonderful kernel from Fedi Ben Messaoud : https://www.kaggle.com/fedi1996/covid-19-analysis-visualization-and-comparaisons . He inspired me with his sum operation on the 'df_data' dataframe and on the scatter plot using plotly

- Plotting data using Seaborn : https://seaborn.pydata.org/api.html
- Operations on DataFrames columns : https://queirozf.com/entries/pandas-dataframe-examples-column-operations#add-new-column
- Rotate x or y label : https://www.drawingfromdata.com/how-to-rotate-axis-labels-in-seaborn-and-matplotlib


# Thanks for your interest !