# Objectives of the analysis 

Now the whole world is overwhelmed with the Coronavirus. The confirmation case grows exponentially and every day the new cases are higher than yesterday Because of its high contamination, it is important to know the virus spread pattern in order to take action in advance. 
In this analysis, I will focus on two topics. First, I will present a general view on the current situation of the outbreak, both global wise, and continent wise. Second, I will implement different metrics for measuring the speed of spreading. The result will be used for two purposes, to see if the trend is hopefully slowing down, if not when it probably will , and to evaluate the effectiveness of lockdown that is being enacted in many countries.

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns
import plotly.express as px
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load the dataset
my_data= pd.read_csv('/kaggle/input/novel-corona-virus-2019-dataset/covid_19_data.csv')
countryContinent= pd.read_csv('/kaggle/input/countrycontinent/countryContinent.csv', encoding = "ISO-8859-1")



my_data["Last Update"] = pd.to_datetime(my_data["Last Update"])
my_data["ObservationDate"] = pd.to_datetime(my_data["ObservationDate"])
my_data['Country/Region'] = np.where(my_data['Country/Region'] == "Mainland China","China" ,  my_data['Country/Region']) 
my_data = my_data.rename(columns={"ObservationDate": "Date", "Country/Region": "Country"})




In all the time series analysis, the amount of daily change is important, and mostly like to be the target for analysis. Here, I added on the daily change for confirmed, death, and recovered. 

In [None]:
country_data = my_data.groupby(['Country','Date'])['Confirmed','Deaths','Recovered'].sum()
#country_data = country_data.set_index(['Country','Date'], inplace=True)
country_data.sort_index(inplace=True)
country_data['Country_New_Confirmed'] = np.nan 
country_data['Country_New_Deaths'] = np.nan 
country_data['Country_New_Recovered'] = np.nan 

for idx in country_data.index.levels[0]:
    country_data.Country_New_Confirmed[idx] = country_data.Confirmed[idx].diff()

for idx in country_data.index.levels[0]:
    country_data.Country_New_Deaths[idx] = country_data.Deaths[idx].diff()

for idx in country_data.index.levels[0]:
    country_data.Country_New_Recovered[idx] = country_data.Recovered[idx].diff()

country_data = country_data.reset_index()

# merge the data with continent information    
country_data = country_data.merge(countryContinent,  how='left', 
                            left_on='Country', 
                            right_on='Country',
                            suffixes=('','_right'))



# Lastest data 

###  Continent

In [None]:
Continent = country_data.groupby(["Date",'Continent'])['Confirmed'].sum().to_frame().reset_index()

Continent = Continent.pivot(index='Date', columns='Continent', values='Confirmed').reset_index()

Continent_percentage  = Continent.iloc[:,1:8].div(Continent.iloc[:,1:8].sum(axis=1), axis=0)
Continent_percentage= round(Continent_percentage,3) *100
Continent_percentage["Date"] =Continent["Date"]
# make the date the first column
cols = Continent_percentage.columns.tolist()
cols = cols[-1:] + cols[:-1]
Continent_percentage = Continent_percentage[cols]


Continent_percentage.tail(5)


### Country - daily change

In [None]:
last_day = country_data["Date"].max()
country_data[country_data["Date"] == last_day].sort_values("Country_New_Confirmed", ascending = False).iloc[:,[0,5,6,7]].head(10).reset_index(drop=True).style.background_gradient(cmap='Blues')

### Country - accumulated 

In [None]:
last_day = country_data["Date"].max()
country_data[country_data["Date"] == last_day].sort_values("Confirmed", ascending = False).iloc[:,[0,2,3,4]].head(10).reset_index(drop=True).style.background_gradient(cmap='Blues')

# Golbal trend

In [None]:
world_data = my_data.groupby('Date')['Confirmed','Deaths','Recovered'].sum()
world_data.reset_index(inplace=True)
world_data["Golbal_New_Confirmed"] = world_data["Confirmed"].diff()
world_data["Golbal_New_Deaths"] = world_data["Deaths"].diff()
world_data["Golbal_New_Recovered"] = world_data["Recovered"].diff()
world_data.tail(1)

In [None]:
fig = plt.figure(figsize=(10,10))
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)


world_data.plot(x='Date',y=['Confirmed','Deaths','Recovered'],kind='line',ax=ax1)
world_data.plot(x='Date',y=['Golbal_New_Confirmed','Golbal_New_Deaths','Golbal_New_Recovered'],kind='line', ax=ax2)


Up to the day that  I upload this file, there are almost  661000 confirmed cases worldwide, with death toll up to 30625. The number of confirmed cases raised dramatically around March 12, the time that both Europe and North American started to have a large scale of the outbreak.


*The peak on Feb,12 is due to the standard change in China, classifying most of the uncertain cases into confirmed cases 


In [None]:
country_data = country_data.merge(world_data[['Date','Golbal_New_Confirmed','Golbal_New_Deaths','Golbal_New_Recovered']], how='inner', 
                            left_on='Date', 
                            right_on='Date',
                            suffixes=('','_world'))

country_data=country_data.sort_values(['Date', 'Country_New_Confirmed'], ascending=[True, False])
ranking = country_data.groupby("Date").head(5)
ranking['Country_New_Confirmed'] = ranking['Country_New_Confirmed'].astype(str)
ranking["Info"] = ranking['Country'].str.cat(ranking['Country_New_Confirmed'],sep=" : ")


In [None]:
plot_graph = ranking.groupby(['Date','Golbal_New_Confirmed'])['Info'].apply(list).to_frame()
plot_graph = plot_graph.reset_index()

fig = plt.figure(figsize=(5,5))
fig = px.line(plot_graph,x="Date", y="Golbal_New_Confirmed",hover_data=[ 'Info'])
fig.show()

The above chart shows the top five countries with the most new-daily-case. There are some points that worth mention.
1. The outbreak outside China started at the end of Feb, when there more new cases in  South Korea than China. 
2. Started from March, Iran got into the top 5 rankings, (possibly started with the religious activity). Italy started to have a  large number of new cases at this time as well.
3. Started from March 10, besides Iran, all the top 5 countries were in either Europe or North America, with Italy almost always be the country with most daily new cases.
4. Started from March 19 till now, the US is the top. 


In [None]:


fig = plt.figure(figsize=(10,10))
ax1 = fig.add_subplot(211)
world_data.plot(x='Date',y=['Confirmed'],kind='line',ax=ax1)


Continent.plot(x='Date',y=['Asia', 'Northern America', 'Oceania', 'South America', 'Europe','Africa', 'Western Asia'],kind='line',ax=ax1)
#Continent_plot[["Asia","Date"].plot(x='Date',y=['Confirmed'],kind='line',ax=ax1)


It obvious that before March, Asian, mainly China, dominated the global case number. The epicenter of the outbreak shift from Asia to Europe. In the future, the center is expected to be North America, the US mainly.

## Country Trend 

In [None]:
country_data = my_data.groupby(['Country','Date'])['Confirmed','Deaths','Recovered'].sum()
#country_data = country_data.set_index(['Country','Date'], inplace=True)
country_data.sort_index(inplace=True)
country_data['Country_New_Confirmed'] = np.nan 
country_data['Country_New_Deaths'] = np.nan 
country_data['Country_New_Recovered'] = np.nan 

for idx in country_data.index.levels[0]:
    country_data.Country_New_Confirmed[idx] = country_data.Confirmed[idx].diff()

for idx in country_data.index.levels[0]:
    country_data.Country_New_Deaths[idx] = country_data.Deaths[idx].diff()

for idx in country_data.index.levels[0]:
    country_data.Country_New_Recovered[idx] = country_data.Recovered[idx].diff()

country_data = country_data.reset_index()


In [None]:
severe_country = country_data[country_data["Date"] == country_data['Date'].max()].sort_values('Confirmed', ascending = False).head(15)["Country"]
severe_country = severe_country.to_list()
severe_country.insert(0,"Date")

fig = plt.figure(figsize=(20,20))
fig = px.line(country_data[(country_data.Country.isin(severe_country)) & (country_data['Confirmed']>100)],  y="Confirmed", color='Country')
fig.update_layout(xaxis_rangeslider_visible=False, yaxis_type='log')
fig.update_layout(
    title= "US, Spain, and Italy is increasing faster than China was",
    xaxis_title="Day after 100th case",
    yaxis_title='Accumulated cases')
fig.show()

- US, Spain, and Italy are suppressing the case number in China, but still not showing the sign of slowing down 

# Growing Speed By Country 

Instead of measuring the accumulated statistics, we are more interested in speed, which is daily changing. I referred the ideas from the following link(https://www.youtube.com/watch?v=Kas0tIxDvrg) and came up with several matrics to measure grow of cases.

1. growing_ratio: the ratio between a day’s total confirmed case and the day before, which can be seen as the power of the exponential growth. When this number is one means that there is no new case that day ^^


2. growth_factor: the ratio between a day’s new confirmed case and the day before. If the factor is one, it means the number of cases is growing at the same speed. If the factor is larger than one, it means the speed increases, and if smaller than one, it means speed decreases. If the factor is zero, means there no new confirmed cased that day.

growth_factor is very senestive to daily change, while growth_factor becomes too less sensitive once the base( the accumlated cases) gets to large. I used both for a more holistic view. 


At the beginning of a pandemic, the number of confirmed cases grows with an increasing rate(growing_ratio >1 and growth_factor>1). After reaching the inflection point, it starts to grow with a decreasing speed(growth_factor <1) and then reach to the end. 
The total confirmed case are estimated to twice the time of the cases at an inflection point.


In [None]:
def exponential_rate(Country,q_date ='2020-1-1', lockdown = False):

    country = country_data[country_data["Country"] == Country]
    country = country[country["Confirmed"] >= 100]
    country["growing_ratio"] = country["Confirmed"].pct_change()+1
    country["growth_factor"] = country["Country_New_Confirmed"].pct_change()+1
    country['Five_days_avaerage_growth_factor'] =  country.loc[:,"growth_factor"].rolling(window=5,min_periods=2).mean()
    
    country["Day_after_100th"] = range(len(country))

    fig = plt.figure(figsize=(12,12))
    #fig, axs = plt.subplots(3, sharex=True, sharey=False)
    #fig.suptitle('Sharing both axes')
    #axs[0].plot(country['Date'], country['Confirmed'])
    #axs[1].plot(country['Date'], country['growing_ratio'])
    #axs[2].plot(country['Date'], country['Five_days_avaerage_growth_factor'])

    #
    ax1 = fig.add_subplot(411)
    country.plot(x="Date",y="Confirmed",kind='line',ax=ax1)
    
    ax2 = fig.add_subplot(412)
    country.plot(x="Date",y="Country_New_Confirmed",kind='line',ax=ax2)
   
    
    ax3 = fig.add_subplot(413)
    country.plot(x="Date",y="growing_ratio",kind='line',ax=ax3)
    plt.axhline(y = 1, color = "deepskyblue",linestyle = '--')
    
    ax4 = fig.add_subplot(414)
    #country.plot(x="Date",y="growth_factor",kind='line',ax=ax4)
    country.plot(x="Date",y="Five_days_avaerage_growth_factor",kind='line',ax=ax4)
    plt.axhline(y = 1, color = "deepskyblue",linestyle = '--')
    ax4.set_ylim([0,5])
    if lockdown:
        plt.axvline(x = q_date, color = "red")
    
       
    #return country[["Date","Day_after_100th","Confirmed","Country_New_Confirmed","growing_ratio","growth_factor"]]
    return 

- This function will show the information by day after a country reached its 100th conformed cases when the disease starts to spread in a scaleable way.
- The red line shows the date that that country started to implement national lockdown. 
- The blue line means marked growth_factor = 1, indicating that the growth may get to slow down.
- In order to avoid the huge daily fluctuation in the growth factor, I used the and moving average with a 5-day range.


### South Korea

In [None]:
exponential_rate("South Korea")

- The growth semes to slow down, but still have around 100 new cases per day 
- Consider using South Korea as bench mark fpr further analysis since they seems to finished the first wave
- March 12th was is the record of daily new case, data been confirmed with local new, caused by cluster outbreak

### Italy 

In [None]:
exponential_rate("Italy",'2020-03-09',lockdown = True)

- The speed of spreding stopped increasing, rougly around 6000 cases per day 
- The stablization started 2 weeks after the national lockdown, could be the result of the lockdown.
- The extremely high growth rate on March 16 can cause by the too low daily new case on theprevious day ( march 15)

### Spain 

In [None]:
exponential_rate("Spain",'2020-03-14',lockdown = True)

- Sadly .. I cant find the trend of slowing down 
- No significant effect of lockdown so far

### UK

In [None]:
exponential_rate("UK",'2020-03-16',lockdown = True)

### France 

In [None]:
exponential_rate("France",'2020-03-17',lockdown = True)

- The growth rate seems to get stable, hope it's a sign of slowing down in the future 

### Germany

In [None]:
exponential_rate("Germany", '2020-03-22',lockdown = True)

### US

In [None]:
# '2020-03-18' is the date they locked down NY
exponential_rate("US", '2020-03-18',lockdown = True)

### Iran

In [None]:
exponential_rate("Iran", '2020-01-01',lockdown = False)

- Seem like the second wave of outbreak is about to start....