<h1 align= 'center'><b>An attempt to analyze the global Corona virus outbreak.</b></h1>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.options.display.max_rows= None
pd.options.display.max_columns= None

- Printing the first five rows of the dataset, followed by it's shape and structure, to get a basic overview of the data we are dealing with.

In [None]:
df= pd.read_csv('../input/novel-corona-virus-2019-dataset/2019_nCoV_data.csv', parse_dates= True, index_col= 'Sno')
df['Date']= pd.to_datetime(df['Date']).dt.date
df['Last Update']= pd.to_datetime(df['Last Update']).dt.date
df.set_index('Date', inplace= True)
df.head(5)

In [None]:
df.shape

In [None]:
df.describe()

# Do we have any missing values?

In [None]:
# Checking for NA values
print('Column\t\t#Missing')
df.isna().sum()

# Data seems to be cumulative here.
- So to get the latest figures, we should try to gety the latest entry for each country & each province.
- We cannot simply sum the figures to produce the numbers.
- We also notice China has been repeated twice: once as China, then again as Mainland China.

# What is the most recent situation?

- Our approach is to present the global situation through two separate approaches:
- once, for the countries having provinces in the dataset so we can have a province-level understanding
- and then, globally in a country wise situation

### Country & province wise situation (for countries having provinces in the dataset):

In [None]:
df['Country'].replace({'Mainland China': 'China'}, inplace= True)
recent_cp_df= df.groupby(['Country', 'Province/State']).last()
recent_cp_df

### Country wise situation globally:

In [None]:
recent_cp_df_c= recent_cp_df.groupby('Country').agg({'Confirmed': 'sum', 'Deaths': 'sum', 'Recovered': 'sum'})
recent_cp_df_c['Recovery Rate']= recent_cp_df_c['Recovered']/recent_cp_df_c['Confirmed']
recent_cp_df_c['Mortality Rate']= recent_cp_df_c['Deaths']/recent_cp_df_c['Confirmed']
recent_c_df=  df

for i in recent_cp_df_c.index:
    recent_c_df= recent_c_df[(recent_c_df['Country']!=i)]
    
recent_c_df= recent_c_df.groupby(['Country']).last()
recent_c_df.drop(['Province/State', 'Last Update'], axis= 1, inplace= True)
recent_c_df['Recovery Rate']= recent_c_df['Recovered']/recent_c_df['Confirmed']
recent_c_df['Mortality Rate']= recent_c_df['Deaths']/recent_c_df['Confirmed']

recent_df= pd.concat([recent_cp_df_c, recent_c_df], axis= 0)
recent_df

- We notice NA values for Brazil, Ivory Coast & Mexico. We can drop them since they have no confirmed cases.

# Confirmed vs Deaths vs Recovered for all countries except China

In [None]:
for i in ['Brazil', 'Ivory Coast', 'Mexico']:
    df= df[(df['Country']!=i)]

recent_df_nc= recent_df.drop(['China']).sort_values(['Confirmed'], ascending= False)
recent_df_nc

In [None]:
f, ax = plt.subplots(figsize=(20, 10))
sns.barplot(x= recent_df_nc["Confirmed"], y= recent_df_nc.index, label="Confirmed", color="yellow").set_title('Global Corona outbreak stats for all countries except China', size= 20)
sns.barplot(x= recent_df_nc["Recovered"], y= recent_df_nc.index, label="Recovered", color="green")
sns.barplot(x= recent_df_nc["Deaths"], y= recent_df_nc.index, label="Deaths", color="red")
sns.despine(left= True)
ax.legend(ncol=3, loc="lower right")
ax.set(ylabel="Countries", xlabel="Values")

- Mostly the neighboring countries of China in south east Asia have been affected.
- US has also been affected strongly in comparison; we can attribute this to the large number of Chinese immigrants or resisdents in US.
- Also, in most of the affected countries, deaths due to Corona virus has not occured, or has not been reported yet.

# What is the global situation?

In [None]:
print('Globally, these are the total numbers reported yet: ')
recent_df.agg({'Confirmed': 'sum', 'Deaths': 'sum', 'Recovered': 'sum', 'Recovery Rate': 'mean', 'Mortality Rate': 'mean'}).to_frame()

# What are the countries affected? How are they affected? How are they recovering?

In [None]:
clist= df['Country'].unique().tolist()

print('Following ' + str(len(clist)) + ' countries were affected: ')
print(clist)

- We check how the countries have been affected by sorting the number of confirmed cases, recoveries & deaths in each of them.

In [None]:
print('Sorted by confirmed cases:')
recent_df.sort_values(['Confirmed'], ascending= False)

### In terms of confirmed cases, it seems the South Asian countries were affected the most.
- After China, which is the centre of the outbreak, the neighboring countries: 
- Japan, Singapore, Thailand & Hong Kong were affected the worst in terms of all 3 parameters.
- Singapore & Thailand among them seems to be handling the situation best with the highest number of confirmed cases
- yet with zero deaths & highest number of recoveries.
- Australia, in the southernmost end, was also affected, but has a high recovery rate.
- Surprisingly, France has a high mortality rate despite not being an Asian country. We might want to look into it.

In [None]:
print('Sorted by Mortality rate:')
recent_df.sort_values(['Mortality Rate'], ascending= False)

### In terms of mortality rate, China seems to be quite low. This is a good indication.
- Philippines has the highest mortality rate, however it has few confirmed cases.

In [None]:
print('Sorted by recovery rate:')
recent_df.sort_values(['Recovery Rate'], ascending= True)

# We should be concerned about the countries having low recovery rates.
- Hong Kong, Canada, Taiwan & US have very poor recovery rates while having a significant number of confirmed cases.
- However most countries in our dataset have hgh recovery rates, and that is a good indication.
- Some countries have recovery rate 1; among them India has the highest number of confirmed cases.

In [None]:
!pip install folium
import folium

# Visualizations: Corona Outbreak numbers all over the world

In [None]:
coord_df= pd.read_csv('../input/corona-analysis-files/world_coordinates.csv', index_col= 'Country')
coord_df.head()

In [None]:
recent_df= recent_df.join(coord_df, how= 'inner')
recent_df.drop(['Brazil', 'Mexico'], inplace= True)
recent_df

In [None]:
world_map = folium.Map(location=[35.861660, 80.195397], zoom_start= 3, tiles='Stamen Toner')
outbreaks = folium.map.FeatureGroup()

for lt, ln, nm, cnfrm, rec, mor in zip(recent_df['latitude'], recent_df['longitude'], recent_df.index, recent_df['Confirmed'], recent_df['Recovery Rate'], recent_df['Mortality Rate']):
    ss= '<b>Country: </b>' + nm + '<br><b>#Confirmed: </b>' + str(int(cnfrm)) + '<br><b>Recovery rate: </b>' + str(round(rec, 2)) + '<br><b>Mortality rate: </b>' + str(round(mor, 2))
    folium.Marker([lt, ln], popup= ss).add_to(world_map) 
    folium.CircleMarker([lt, ln], radius= 0.05*int(cnfrm), color= 'red').add_to(world_map) 
    
world_map

# Visualizations: Corona confirmed cases using Choropleths

In [None]:
wc = r'../input/corona-analysis-files/world_countries.json' # geojson file
tscale= np.linspace(0, recent_df['Confirmed'].max()+1, 6, dtype=int).tolist()

world_map = folium.Map(location=[035.861660, 104.195397], zoom_start=2, tiles='Stamen Toner')
world_map.choropleth(
    geo_data= wc,
    data= recent_df,
    columns=[recent_df.index, 'Confirmed'],
    key_on='feature.properties.name',
    threshold_scale= tscale,
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Corona Outbreak strength',
)

world_map

- China has the highest number of outbreaks, which we knew earlier from the dataframes.
- We see that the southern parts of the world, e.g. South America & Africa remain largely unaffected.
- Mostly the northern and the eastern parts of the world have been affected.

# How is China coping with this outbreak?

In [None]:
china_recent_p_df= recent_cp_df.loc[['China']].reset_index(level= 0, drop= True)

print('Following ' + str(china_recent_p_df.shape[0]) + ' Chinese provinces were affected: ')
print(china_recent_p_df.index.values)

- China, being the largest country, has many provinces. We try to figure out where it all started.

In [None]:
china_recent_p_df['Recovery Rate']= china_recent_p_df['Recovered']/china_recent_p_df['Confirmed']
china_recent_p_df['Mortality Rate']= china_recent_p_df['Deaths']/china_recent_p_df['Confirmed']

china_recent_p_df_s= china_recent_p_df.sort_values(['Confirmed'], ascending= False)
china_recent_p_df_s

# Corona outbreak effect on Hubei vs other provinces in China

In [None]:
f, ax = plt.subplots(figsize=(20, 10))
sns.set(style="whitegrid")
sns.barplot(x= china_recent_p_df_s["Confirmed"], y= china_recent_p_df_s.index, label="Confirmed", color="yellow").set_title('Corona outbreak stats for all affected provinces in China', size= 20)
sns.barplot(x= china_recent_p_df_s["Recovered"], y= china_recent_p_df_s.index, label="Recovered", color="green")
sns.barplot(x= china_recent_p_df_s["Deaths"], y= china_recent_p_df_s.index, label="Deaths", color="red")
sns.despine(left= True, bottom= True)
ax.legend(ncol=3, loc="lower right")
ax.set(ylabel="Chinese provinces", xlabel="Values")

- This gives us an idea about how severely the province of Hubei has been affected, in comparison to other Chinese provinces, or even other countries.
- We notice how the number of deaths in Hubei alone outnumbers the confirmed cases in all the other provinces. This is very unfortunate, to say the least.

In [None]:
china_recent_p_df_s.drop('Hubei', inplace= True)

f, ax = plt.subplots(figsize=(20, 10))
sns.barplot(x= china_recent_p_df_s["Confirmed"], y= china_recent_p_df_s.index, label="Confirmed", color="yellow").set_title('Corona outbreak stats for all affected provinces in China except Hubei', size= 20)
sns.barplot(x= china_recent_p_df_s["Recovered"], y= china_recent_p_df_s.index, label="Recovered", color="green")
sns.barplot(x= china_recent_p_df_s["Deaths"], y= china_recent_p_df_s.index, label="Deaths", color="red")
sns.despine(left= True, bottom= True)
ax.legend(ncol=3, loc="lower right")
ax.set(ylabel="Provinces other than Hubei", xlabel="Values")

- We notice how once Hubei is dropped, the right limit in x axis drops from 60,000 to almost 1,400.
- Deaths have been reported in quite a few Chinese provinces, but these figures are quite low.
- In almost all of the Chinese provinces, the number of recoveries is also quite strong, since the green portion takes up almost half of the bar length in each case. However in the last chart for Hubei, the green portion failed to cover even a quarter of the whole length.
- In almost all the parameters, Hubei has been affected the worst.

# Conclusion: The Hubei province can be assumed to be the origin of the virus. 
- Wuhan, as we know, the centre of the outbreak, is also the capital of the Hubei province, which has the most confirmed cases.
- Hubei has a worryingly low recovery rate, compared to the other provinces.
- China occupied Hong Kong has 0 confirmed cases; we may drop that column.

In [None]:
china_recent_p_df.drop(['Hong Kong'], inplace= True)
china_recent_p_df.sort_values(['Recovery Rate'])

- Among the provinces, Inner Mongolia, Hubei, Xinjiang, Heilongjiang & Guangxi have extremely low recovery rates (<0.25).
- Other provinces like Anhui, Jiangxi, Guangdong, Henan, Zhejiang, Hunan have almost 1000 or more confirmed cases, but have moderately low recovery rates (<0.5).
- Provinces Taiwan, Macau, Gansu & Qinghai seem relatively unaffected by this outbreak; they have high recovery rates.

In [None]:
china_recent_p_df.sort_values(['Mortality Rate'], ascending= False)

- The province of Hubei has the highest mortality rate.
- We can assume the provinces having high mortality rates to be closer to Hubei, & those having low mortality rates away from Hubei. We shall check the same using Choropleth visualizations.
- The Zhejiang province, despite having a large number of confirmed cases, does not have a single confirmed death, and a high recovery rate.

# Visualizations: Corona confirmed cases using Choropleth in China:

In [None]:
wc = r'../input/corona-analysis-files/china.json' # geojson file
tscale= np.linspace(china_recent_p_df['Confirmed'].min(), china_recent_p_df['Confirmed'].max()+1, 6, dtype=int).tolist()

world_map = folium.Map(location=[35.861660, 105.195397], zoom_start= 4, tiles='Mapbox Bright')
world_map.choropleth(
    geo_data= wc,
    data= china_recent_p_df,
    columns=[china_recent_p_df.index, 'Confirmed'],
    key_on='feature.properties.name',
    threshold_scale= tscale,
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Corona Outbreak strength in China'
)

world_map

In [None]:
clatlon= pd.read_csv('../input/corona-analysis-files/China_Provinces_LatLon.csv', index_col= 'Province/State')
clatlon.drop(['Unnamed: 0'], axis= 1, inplace= True)

In [None]:
china_recent_p_df= china_recent_p_df.join(clatlon, how= 'inner')
china_recent_p_df

# Visualizations: Corona Outbreak numbers in China

In [None]:
world_map = folium.Map(location=[35.861660, 110.195397], zoom_start= 5, tiles='Stamen Toner')
outbreaks = folium.map.FeatureGroup()
    
for lt, ln, cd, cnfrm, rec, mor in zip(china_recent_p_df['LAT'], china_recent_p_df['LON'], china_recent_p_df.index, china_recent_p_df['Confirmed'], china_recent_p_df['Recovery Rate'], china_recent_p_df['Mortality Rate']):
    ss= '<b>Province:</b> ' + cd + '<br><b>#Confirmed: </b>' + str(int(cnfrm)) + '<br><b>Recovery rate: </b>' + str(round(rec, 2)) + '<br><b>Mortality rate: </b>' + str(round(mor, 2))
    folium.Marker([lt, ln], popup= ss).add_to(world_map)    
    folium.CircleMarker([lt, ln], radius= 0.05*int(cnfrm), color= 'red').add_to(world_map) 
    
world_map

- This shows that our assumption about high mortality rates in provinces neighboring Wuhan was correct.

# How did the virus spread over time globally?

In [None]:
spread_df= df.groupby(df.index)['Country'].nunique().to_frame()

f, ax = plt.subplots(figsize=(20, 10))
ax.grid(True)
sns.lineplot(x= spread_df.index, y= spread_df['Country']).set_title('Number of Countries affected by Corona virus over time', size= 20)
sns.despine(left= True)
ax.set(ylabel="Values", xlabel="Timeline")

In [None]:
over_time_df= df.groupby(df.index).agg({'Confirmed': 'sum', 'Deaths': 'sum', 'Recovered': 'sum'})
over_time_df['Recovery rate']= over_time_df['Recovered']/over_time_df['Confirmed']
over_time_df['Mortality rate']= over_time_df['Deaths']/over_time_df['Confirmed']

f, ax = plt.subplots(figsize=(20, 10))
ax.grid(True)
sns.lineplot(x= over_time_df.index, y=  over_time_df['Confirmed'], label= 'Confirmed', color= 'blue').set_title("#Confirmed Cases, Deaths & Recoveries over time all over the world", size= 20)
sns.lineplot(x= over_time_df.index, y=  over_time_df['Deaths'], label= 'Deaths', color= 'red')
sns.lineplot(x= over_time_df.index, y=  over_time_df['Recovered'], label= 'Recovered', color= 'green')
sns.despine(left= True)
ax.legend(ncol=3, loc="upper left")
ax.set(ylabel="Values", xlabel="Timeline", )

- All figures in this graph are very much influenced by the figures from China; we shall see the proof later.
- While we have not been able to contain the outbreak yet, it is a good indicator that the number of recoveries is increasing with time.
- The number of deaths is still increasing, but very slowly. The slope is very gradual. We will examine this further.

In [None]:
f, ax = plt.subplots(figsize=(20, 10))
ax.grid(True)
sns.lineplot(x= over_time_df.index, y=  over_time_df['Recovery rate'], label= 'Confirmed', color= 'green').set_title("Mortality rate & Recovery rate over time all over the world", size= 20)
sns.lineplot(x= over_time_df.index, y=  over_time_df['Mortality rate'], label= 'Deaths', color= 'red')
sns.despine(left= True)
ax.legend(ncol=2, loc="upper left")
ax.set(ylabel="Values", xlabel="Timeline")

- All figures in this graph are very much influenced by the figures from China; we shall see the proof later.
- This data is from all the affected countries including China; that is why we are not able to obtain high recovery rates despite some countries having 100% recovery rate.
- At the end of January the recovery rate was shadowed by the mortality rate 
- Around the beginning of February the recovery rates started to spike up, while the mortality rates show a gradual decline.

# How did the virus spread over time in China?

In [None]:
china_over_time_df= df[(df['Country']=='China')]
china_over_time_df.groupby(china_over_time_df.index).agg({'Confirmed': 'sum', 'Deaths': 'sum', 'Recovered': 'sum'})
np_china_over_time_df= china_over_time_df.groupby(china_over_time_df.index).agg({'Confirmed': 'sum', 'Deaths': 'sum', 'Recovered': 'sum'})
np_china_over_time_df['Recovery rate']= np_china_over_time_df['Recovered']/np_china_over_time_df['Confirmed']
np_china_over_time_df['Mortality rate']= np_china_over_time_df['Deaths']/np_china_over_time_df['Confirmed']

f, ax = plt.subplots(figsize=(20, 10))
ax.grid(True)
sns.lineplot(x= np_china_over_time_df.index, y=  np_china_over_time_df['Confirmed'], label= 'Confirmed', color= 'blue').set_title("#Confirmed Cases, Deaths & Recoveries over time in China", size= 20)
sns.lineplot(x= np_china_over_time_df.index, y=  np_china_over_time_df['Deaths'], label= 'Deaths', color= 'red')
sns.lineplot(x= np_china_over_time_df.index, y=  np_china_over_time_df['Recovered'], label= 'Recovered', color= 'green')
sns.despine(left= True)
ax.legend(ncol=3, loc="upper left")
ax.set(ylabel="Values", xlabel="Timeline", )

In [None]:
f, ax = plt.subplots(figsize=(20, 10))
ax.grid(True)
sns.lineplot(x= np_china_over_time_df.index, y=  np_china_over_time_df['Recovery rate'], label= 'Confirmed', color= 'green').set_title("Mortality rate & Recovery rate  over time all over the world")
sns.lineplot(x= np_china_over_time_df.index, y=  np_china_over_time_df['Mortality rate'], label= 'Deaths', color= 'red')
sns.despine(left= True)
ax.legend(ncol=2, loc="upper left")
ax.set(ylabel="Values", xlabel="Timeline")

- Both the above plots for China mirror the global trend in terms of all the three parameters that we plotted earlier. This again shows how strong the influence of China on the impact of this outbreak is.

## We try to plot the virus outbreak in Chinese provinces over time.
- We cannot plot all the 26 Chinese provinces into a single legible graph.
- We shall consider only those provinces having significant number of confirmed cases: Hubei, Guangdong, Henan, Zhejiang, Hunan & Anhui.

In [None]:
china_over_time_df.set_index('Province/State')
china_over_time_df['Mortality rate']= china_over_time_df['Deaths']/china_over_time_df['Confirmed']
china_over_time_df['Recovery rate']= china_over_time_df['Recovered']/china_over_time_df['Confirmed']
china_over_time_df.fillna(0, inplace= True)

f, ax = plt.subplots(figsize=(20, 10))
ax.grid(True)
sns.lineplot(data= china_over_time_df[(china_over_time_df['Province/State']=='Hubei')], x= china_over_time_df[(china_over_time_df['Province/State']=='Hubei')].index, y=  'Confirmed', label= 'Hubei', color= 'red').set_title("Comparison of confirmed cases in Hubei vs other Chinese provinces", size= 20)
sns.lineplot(data= china_over_time_df[(china_over_time_df['Province/State']=='Guangdong')], x= china_over_time_df[(china_over_time_df['Province/State']=='Guangdong')].index, y=  'Confirmed', label= 'Guangdong', color= 'green')
sns.lineplot(data= china_over_time_df[(china_over_time_df['Province/State']=='Henan')], x= china_over_time_df[(china_over_time_df['Province/State']=='Henan')].index, y=  'Confirmed', label= 'Henan', color= 'blue')
sns.lineplot(data= china_over_time_df[(china_over_time_df['Province/State']=='Zhejiang')], x= china_over_time_df[(china_over_time_df['Province/State']=='Zhejiang')].index, y=  'Confirmed', label= 'Zhejiang', color= 'black')
sns.lineplot(data= china_over_time_df[(china_over_time_df['Province/State']=='Hunan')], x= china_over_time_df[(china_over_time_df['Province/State']=='Hunan')].index, y=  'Confirmed', label= 'Hunan', color= 'pink')

sns.despine(left= True)
ax.legend(ncol=5, loc="upper left")
ax.set(ylabel="Values", xlabel="Timeline", )

In [None]:
f, ax = plt.subplots(figsize=(20, 10))
ax.grid(True)
sns.lineplot(data= china_over_time_df[(china_over_time_df['Province/State']=='Guangdong')], x= china_over_time_df[(china_over_time_df['Province/State']=='Guangdong')].index, y=  'Confirmed', label= 'Guangdong', color= 'green')
sns.lineplot(data= china_over_time_df[(china_over_time_df['Province/State']=='Henan')], x= china_over_time_df[(china_over_time_df['Province/State']=='Henan')].index, y=  'Confirmed', label= 'Henan', color= 'blue')
sns.lineplot(data= china_over_time_df[(china_over_time_df['Province/State']=='Zhejiang')], x= china_over_time_df[(china_over_time_df['Province/State']=='Zhejiang')].index, y=  'Confirmed', label= 'Zhejiang', color= 'black')
sns.lineplot(data= china_over_time_df[(china_over_time_df['Province/State']=='Hunan')], x= china_over_time_df[(china_over_time_df['Province/State']=='Hunan')].index, y=  'Confirmed', label= 'Hunan', color= 'pink')
sns.lineplot(data= china_over_time_df[(china_over_time_df['Province/State']=='Anhui')], x= china_over_time_df[(china_over_time_df['Province/State']=='Anhui')].index, y=  'Confirmed', label= 'Anhui 	', color= 'red').set_title("Confirmed cases over time in Chinese provinces except Hunan", size= 20)

sns.despine(left= True)
ax.legend(ncol=5, loc="upper left")
ax.set(ylabel="Values", xlabel="Timeline", )