## COVID-19 Analysis, Visualization and Forecasting

Coronavirus is a family of viruses that can cause illness, which can vary from common cold and cough to sometimes more severe disease. Middle East Respiratory Syndrome (MERS-CoV) and Severe Acute Respiratory Syndrome (SARS-CoV) were such severe cases with the world already has faced.

SARS-CoV-2 (n-coronavirus) is the new virus of the coronavirus family, which first discovered in 2019, which has not been identified in humans before. It is a contiguous virus which started from Wuhan in December 2019. Which later declared as Pandemic by WHO due to high rate spreads throughout the world. Currently (on the date 29 Aug 2020), this leads to a total of 900K+ Deaths across the globe.

Pandemic is spreading all over the world; it becomes more important to understand about this spread. This is an effort to analyze the cumulative data of confirmed, deaths, and recovered cases over time. In this notebook, the main focus is to analyze the spread trend of this virus all over the world and its predictions.


In [None]:
### Downloading and Installing Prerequisite
!pip install pycountry_convert
!pip install folium
!pip install calmap
!pip install altair
!pip install prophet==0.6
!pip install pmdarima



In [2]:
# Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import ticker 
import pycountry_convert as pc
import folium
from datetime import datetime, timedelta,date
import plotly.express as px
import json, requests
import calmap

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

## Sourcing and loading data

### 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE ([LINK](https://github.com/CSSEGISandData/COVID-19)) 
<hr>
Dataset consists of time-series data from 22 JAN 2020 to Till date (Updated on daily Basis).<br>
**Three Time-series dataset (Depricated):**
* time_series_19-covid-Confirmed.csv ([Link Raw File](https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv))
* time_series_19-covid-Deaths.csv ([Link Raw File](https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv))
* time_series_19-covid-Recovered.csv ([Link Raw File](https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv))

**New Time-series dataset:**
* time_series_covid19_confirmed_global.csv ([Link Raw File](https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv))
* time_series_covid19_deaths_global ([Link Raw File](https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv))

**New Dataset (Updated more frequently by web crawler of JHU):**
* cases_country.csv ([Link Raw File]("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/web-data/data/cases_country.csv"))

In [3]:
# Retriving Dataset
df_confirmed = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
df_deaths = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')

# Depricated
df_recovered = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv")
#df_covid19 = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/web-data/data/cases_country.csv")
#df_table = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/web-data/data/cases_time.csv",parse_dates=['Last_Update'])

In [4]:
df_confirmed.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,1/31/20,2/1/20,2/2/20,2/3/20,2/4/20,2/5/20,2/6/20,2/7/20,2/8/20,2/9/20,2/10/20,2/11/20,2/12/20,2/13/20,2/14/20,2/15/20,2/16/20,2/17/20,2/18/20,2/19/20,2/20/20,2/21/20,2/22/20,2/23/20,2/24/20,2/25/20,2/26/20,...,9/25/20,9/26/20,9/27/20,9/28/20,9/29/20,9/30/20,10/1/20,10/2/20,10/3/20,10/4/20,10/5/20,10/6/20,10/7/20,10/8/20,10/9/20,10/10/20,10/11/20,10/12/20,10/13/20,10/14/20,10/15/20,10/16/20,10/17/20,10/18/20,10/19/20,10/20/20,10/21/20,10/22/20,10/23/20,10/24/20,10/25/20,10/26/20,10/27/20,10/28/20,10/29/20,10/30/20,10/31/20,11/1/20,11/2/20,11/3/20
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,...,39186,39192,39227,39233,39254,39268,39285,39290,39297,39341,39422,39486,39548,39616,39693,39703,39799,39870,39928,39994,40026,40073,40141,40200,40287,40357,40510,40626,40687,40768,40833,40937,41032,41145,41268,41334,41425,41501,41633,41728
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,13045,13153,13259,13391,13518,13649,13806,13965,14117,14266,14410,14568,14730,14899,15066,15231,15399,15570,15752,15955,16212,16501,16774,17055,17350,17651,17948,18250,18556,18858,19157,19445,19729,20040,20315,20634,20875,21202,21523,21904
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,...,50754,50914,51067,51213,51368,51530,51690,51847,51995,52136,52270,52399,52520,52658,52804,52940,53072,53325,53399,53584,53777,53998,54203,54402,54616,54829,55081,55357,55630,55880,56143,56419,56706,57026,57332,57651,57942,58272,58574,58979
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,1836,1836,1836,1966,1966,2050,2050,2110,2110,2110,2370,2370,2568,2568,2696,2696,2696,2995,2995,3190,3190,3377,3377,3377,3623,3623,3811,3811,4038,4038,4038,4325,4410,4517,4567,4665,4756,4825,4888,4910
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,4590,4672,4718,4797,4905,4972,5114,5211,5370,5402,5530,5725,5725,5958,6031,6246,6366,6488,6680,6846,7096,7222,7462,7622,7829,8049,8338,8582,8829,9026,9381,9644,9871,10074,10269,10558,10805,11035,11228,11577


## Preprocessing

In [5]:
df_confirmed = df_confirmed.rename(columns={"Province/State":"state","Country/Region": "country"})
df_deaths = df_deaths.rename(columns={"Province/State":"state","Country/Region": "country"})
df_recovered = df_recovered.rename(columns={"Province/State":"state","Country/Region": "country"})

## Changing the conuntry names as required by pycountry_convert Lib

df_confirmed.loc[df_confirmed['country'] == "US", "country"] = "USA"
df_deaths.loc[df_deaths['country'] == "US", "country"] = "USA"
df_recovered.loc[df_recovered['country'] == "US", "country"] = "USA"

df_confirmed.loc[df_confirmed['country'] == 'Korea, South', "country"] = 'South Korea'
df_deaths.loc[df_deaths['country'] == 'Korea, South', "country"] = 'South Korea'
df_recovered.loc[df_recovered['country'] == 'Korea, South', "country"] = 'South Korea'

df_confirmed.loc[df_confirmed['country'] == 'Congo (Kinshasa)', "country"] = 'Democratic Republic of the Congo'
df_deaths.loc[df_deaths['country'] == 'Congo (Kinshasa)', "country"] = 'Democratic Republic of the Congo'
df_recovered.loc[df_recovered['country'] == 'Congo (Kinshasa)', "country"] = 'Democratic Republic of the Congo'

df_confirmed.loc[df_confirmed['country'] == "Cote d'Ivoire", "country"] = "Côte d'Ivoire"
df_deaths.loc[df_deaths['country'] == "Cote d'Ivoire", "country"] = "Côte d'Ivoire"
df_recovered.loc[df_recovered['country'] == "Cote d'Ivoire", "country"] = "Côte d'Ivoire"

df_confirmed.loc[df_confirmed['country'] == "Reunion", "country"] = "Réunion"
df_deaths.loc[df_deaths['country'] == "Reunion", "country"] = "Réunion"
df_recovered.loc[df_recovered['country'] == "Reunion", "country"] = "Réunion"

df_confirmed.loc[df_confirmed['country'] == 'Congo (Brazzaville)', "country"] = 'Republic of the Congo'
df_deaths.loc[df_deaths['country'] == 'Congo (Brazzaville)', "country"] = 'Republic of the Congo'
df_recovered.loc[df_recovered['country'] == 'Congo (Brazzaville)', "country"] = 'Republic of the Congo'

df_confirmed.loc[df_confirmed['country'] == 'Bahamas, The', "country"] = 'Bahamas'
df_deaths.loc[df_deaths['country'] == 'Bahamas, The', "country"] = 'Bahamas'
df_recovered.loc[df_recovered['country'] == 'Bahamas, The', "country"] = 'Bahamas'

df_confirmed.loc[df_confirmed['country'] == 'Gambia, The', "country"] = 'Gambia'
df_deaths.loc[df_deaths['country'] == 'Gambia, The', "country"] = 'Gambia'
df_recovered.loc[df_recovered['country'] == 'Gambia, The', "country"] = 'Gambia'




## Merging Confirmed ,deaths and recovered

### melting dataframes
1.use ‘Province/State’, ‘Country/Region’, ‘Lat’, ‘Long’ as identifier variables.
2.Unpivot date columns (confirmed_df.columns[4:] ) with variable column ‘Date’ and value column ‘Confirmed’

In [6]:
dates=df_confirmed.columns[4:]
confirmed_df_melt=df_confirmed.melt(
id_vars=['state','country','Lat','Long'],
value_vars=dates,
var_name='Date',
value_name='Confirmed'
)


deaths_df_melt=df_deaths.melt(
id_vars=['state','country','Lat','Long'],
value_vars=dates,
var_name='Date',
value_name='Deaths'
)


recovered_df_melt=df_recovered.melt(
id_vars=['state','country','Lat','Long'],
value_vars=dates,
var_name='Date',
value_name='Recovered'
)

print(confirmed_df_melt.shape)
print(deaths_df_melt.shape)
print(recovered_df_melt.shape)


(76916, 6)
(76916, 6)
(73185, 6)


In [7]:
confirmed_df_melt

Unnamed: 0,state,country,Lat,Long,Date,Confirmed
0,,Afghanistan,33.939110,67.709953,1/22/20,0
1,,Albania,41.153300,20.168300,1/22/20,0
2,,Algeria,28.033900,1.659600,1/22/20,0
3,,Andorra,42.506300,1.521800,1/22/20,0
4,,Angola,-11.202700,17.873900,1/22/20,0
...,...,...,...,...,...,...
76911,,West Bank and Gaza,31.952200,35.233200,11/3/20,55408
76912,,Western Sahara,24.215500,-12.885800,11/3/20,10
76913,,Yemen,15.552727,48.516388,11/3/20,2063
76914,,Zambia,-13.133897,27.849332,11/3/20,16661


In [8]:
recovered_df_melt

Unnamed: 0,state,country,Lat,Long,Date,Recovered
0,,Afghanistan,33.939110,67.709953,1/22/20,0
1,,Albania,41.153300,20.168300,1/22/20,0
2,,Algeria,28.033900,1.659600,1/22/20,0
3,,Andorra,42.506300,1.521800,1/22/20,0
4,,Angola,-11.202700,17.873900,1/22/20,0
...,...,...,...,...,...,...
73180,,West Bank and Gaza,31.952200,35.233200,11/3/20,47744
73181,,Western Sahara,24.215500,-12.885800,11/3/20,8
73182,,Yemen,15.552727,48.516388,11/3/20,1375
73183,,Zambia,-13.133897,27.849332,11/3/20,15763


In [9]:
## Removing data for Canada  mismatch as canada recovered data is counted by Country wise rather than Province/State-wise
recovered_df_melt=recovered_df_melt[recovered_df_melt['country']!= 'Canada']

In [10]:
## merging  3 dataframes one after another
full_table = pd.merge(left=confirmed_df_melt, right=deaths_df_melt, how='left',
                      on=['state', 'country', 'Date', 'Lat', 'Long'])

full_table = pd.merge(left=full_table, right=recovered_df_melt, how='left',
                      on=['state', 'country', 'Date', 'Lat', 'Long'])
full_table.sample(4)

Unnamed: 0,state,country,Lat,Long,Date,Confirmed,Deaths,Recovered
20113,Tasmania,Australia,-42.8821,147.3272,4/6/20,86,2,26.0
44116,,Libya,26.3351,17.228331,7/4/20,989,27,258.0
17676,Montserrat,United Kingdom,16.742498,-62.187366,3/27/20,5,0,0.0
8629,,Central African Republic,6.6111,20.9394,2/23/20,0,0,0.0


In [11]:
#convert to proper date format
full_table['Date']=pd.to_datetime (full_table['Date'])
# checking for missing value
full_table.isnull().sum()

state        53669
country          0
Lat              0
Long             0
Date             0
Confirmed        0
Deaths           0
Recovered     5453
dtype: int64

In [12]:
#fill na with 0
full_table['Recovered']=full_table['Recovered'].fillna(0)
##Handling the missing values
full_table[['state']]=full_table[['state']].fillna('None')

In [13]:

#checking datatypes
full_table.dtypes

state                object
country              object
Lat                 float64
Long                float64
Date         datetime64[ns]
Confirmed             int64
Deaths                int64
Recovered           float64
dtype: object

In [14]:
#fixing dtypes
full_table['Recovered']=full_table['Recovered'].astype(int)


In [15]:
full_table.sample(6)

Unnamed: 0,state,country,Lat,Long,Date,Confirmed,Deaths,Recovered
38512,"Bonaire, Sint Eustatius and Saba",Netherlands,12.1784,-68.2385,2020-06-13,7,0,7
4802,,Ukraine,48.3794,31.1656,2020-02-08,0,0,0
64563,,Turkey,38.9637,35.2433,2020-09-18,299810,7377,264805
45879,Saskatchewan,Canada,52.9399,-106.4509,2020-07-11,815,15,0
1168,,Cuba,21.521757,-77.781167,2020-01-26,0,0,0
67047,Nova Scotia,Canada,44.682,-63.7443,2020-09-28,1087,65,0


In [16]:
#Grouped by day,country
datewise = full_table.groupby(['Date', 'country'])['Confirmed', 'Deaths', 'Recovered'].sum().reset_index()

In [17]:
#Calculating the Mortality Rate, Recovery Rate,active and closed cases
datewise["Mortality Rate"]=(datewise["Deaths"]/datewise["Confirmed"])*100
datewise["Recovery Rate"]=(datewise["Recovered"]/datewise["Confirmed"])*100
datewise["Active Cases"]=datewise["Confirmed"]-datewise["Recovered"]-datewise["Deaths"]
datewise["Closed Cases"]=datewise["Recovered"]+datewise["Deaths"]

print("Average Mortality Rate",datewise["Mortality Rate"].mean())
print("Median Mortality Rate",datewise["Mortality Rate"].median())
print("Average Recovery Rate",datewise["Recovery Rate"].mean())
print("Median Recovery Rate",datewise["Recovery Rate"].median())

datewise.sample(10)


Average Mortality Rate 3.063768225305565
Median Mortality Rate 1.941123537729785
Average Recovery Rate 53.102635885792715
Median Recovery Rate 59.37431634215707


Unnamed: 0,Date,country,Confirmed,Deaths,Recovered,Mortality Rate,Recovery Rate,Active Cases,Closed Cases
6896,2020-02-27,Estonia,1,0,0,0.0,0.0,1,0
28027,2020-06-17,Liberia,516,33,240,6.395349,46.511628,243,273
50074,2020-10-11,Malawi,5821,180,4647,3.092252,79.831644,994,4827
44943,2020-09-14,Madagascar,15769,213,14411,1.350751,91.388167,1145,14624
10007,2020-03-14,Norway,1090,3,1,0.275229,0.091743,1086,4
43457,2020-09-06,Qatar,120095,203,116998,0.169033,97.421208,2894,117201
24460,2020-05-29,Russia,387623,4374,159257,1.128416,41.085539,223992,163631
4332,2020-02-13,Singapore,58,0,15,0.0,25.862069,43,15
8608,2020-03-07,Ethiopia,0,0,0,,,0,0
33942,2020-07-18,New Zealand,1553,22,1506,1.416613,96.973599,25,1528


In [18]:
#filling missing value
colms=['Mortality Rate','Recovery Rate']
datewise[colms]=datewise[colms].fillna(0)

In [19]:
# aggregrated number of cases datewise
datewise_agg=datewise.groupby(["Date"]).agg({"Confirmed":'sum',"Recovered":'sum',"Deaths":'sum',"Mortality Rate":'sum',
                                        "Recovery Rate":'sum',"Active Cases":'sum',"Closed Cases":'sum' })

In [20]:
datewise_agg.tail()

Unnamed: 0_level_0,Confirmed,Recovered,Deaths,Mortality Rate,Recovery Rate,Active Cases,Closed Cases
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-10-30,45594203,30364554,1188584,440.486659,13472.326376,14041065,31553138
2020-10-31,46070822,30607515,1195142,439.307437,13477.683791,14268165,31802657
2020-11-01,46502095,30863386,1200056,437.999617,13492.222278,14438653,32063442
2020-11-02,46959365,31140611,1206119,436.930225,13492.481864,14612635,32346730
2020-11-03,47405395,31388109,1213735,435.429517,13483.28849,14803551,32601844


In [21]:
## average increase of cases in per days
print("Average increase in number of Confirmed Cases every day: ",np.round(datewise_agg["Confirmed"].diff().fillna(0).mean()))
print("Average increase in number of Recovered Cases every day: ",np.round(datewise_agg["Recovered"].diff().fillna(0).mean()))
print("Average increase in number of Deaths Cases every day: ",np.round(datewise_agg["Deaths"].diff().fillna(0).mean()))

Average increase in number of Confirmed Cases every day:  165174.0
Average increase in number of Recovered Cases every day:  109366.0
Average increase in number of Deaths Cases every day:  4229.0


In [22]:
# new cases 
temp = datewise.groupby(['country', 'Date', ])['Confirmed', 'Deaths', 'Recovered']
temp = temp.sum().diff().reset_index()

mask = temp['country'] != temp['country'].shift(1)

temp.loc[mask, 'Confirmed'] = np.nan
temp.loc[mask, 'Deaths'] = np.nan
temp.loc[mask, 'Recovered'] = np.nan

# renaming columns
temp.columns = ['country', 'Date', 'New cases', 'New deaths', 'New recovered']

# merging new values

df_covid19 = pd.merge(datewise, temp, on=['country', 'Date'])

# filling na with 0

df_covid19 = df_covid19.fillna(0)

# fixing data types

cols = ['New cases', 'New deaths', 'New recovered']
df_covid19[cols] = df_covid19[cols].astype('int')
# 
df_covid19['New cases'] = df_covid19['New cases'].apply(lambda x: 0 if x<0 else x)

In [23]:
df_covid19.sample(4)

Unnamed: 0,Date,country,Confirmed,Deaths,Recovered,Mortality Rate,Recovery Rate,Active Cases,Closed Cases,New cases,New deaths,New recovered
42796,2020-09-03,Denmark,17800,626,15892,3.516854,89.280899,1282,16518,180,0,91
86,2020-01-22,Japan,2,0,0,0.0,0.0,2,0,0,0,0
38546,2020-08-11,Syria,1327,53,0,3.993971,0.0,1274,53,72,1,0
19572,2020-05-04,Algeria,4648,465,1998,10.004303,42.986231,2185,2463,174,2,62


In [24]:
#Now, will take population data from worldometer by webscrapping
#saved  file in csv
#pop_data_c =pop_data.to_csv(r'/Users/ajesh_mahto/Desktop/springboard/global_pop_data.csv',index=False)
import requests
from bs4 import BeautifulSoup
url="https://www.worldometers.info/world-population/population-by-country/"
r= requests.get(url)
soup=BeautifulSoup(r.content)
countries=soup.find_all("table")[0]
pop_data=pd.read_html(str(countries))[0]

def function(a,b,c,d,e,f,g,h,i,j,k):
    data=pd.DataFrame(
    {'a':pop_data[a],
     'b':pop_data[b],
     'c':pop_data[c],
     'd':pop_data[d],
     'e':pop_data[e],
     'f':pop_data[f],
     'g':pop_data[g],
     'h':pop_data[h],
     'i':pop_data[i],
     'j':pop_data[j],
     'k':pop_data[k]})
    return data

pop_data=function('Country (or dependency)','Population (2020)','Yearly Change','Net Change','Density (P/Km²)',
                  'Land Area (Km²)','Migrants (net)','Fert. Rate','Med. Age','Urban Pop %','World Share')
pop_data.columns=['Country (or dependency)','Population (2020)','Yearly Change','Net Change','Density (P/Km²)',
                  'Land Area (Km²)','Migrants (net)','Fert. Rate','Med. Age','Urban Pop %','World Share']



In [25]:
pwd

'/content'

In [26]:
pop_data_c =pop_data.to_csv(r'/content/global_pop_data.csv',index=False)

In [27]:
#loading the file of world poulation
world_population=pd.read_csv('global_pop_data.csv')

#subsetting
world_population = world_population[['Country (or dependency)', 'Population (2020)', 'Density (P/Km²)', 'Land Area (Km²)', 'Med. Age', 'Urban Pop %']]
world_population.columns = ['Country (or dependency)', 'Population (2020)', 'Density', 'Land Area', 'Med Age', 'Urban Pop']

#Replace united states by US
world_population.loc[world_population['Country (or dependency)']=='United States', 'Country (or dependency)'] = 'USA'

# Remove the % character from Urban Pop values
world_population['Urban Pop'] = world_population['Urban Pop'].str.rstrip('%')

## Replace Urban Pop and Med Age "N.A" by their respective modes, then transform to int
world_population.loc[world_population['Urban Pop']=='N.A.', 'Urban Pop'] = int(world_population.loc[world_population['Urban Pop']!='N.A.', 'Urban Pop'].mode()[0])
world_population['Urban Pop'] = world_population['Urban Pop'].astype('int16')
world_population.loc[world_population['Med Age']=='N.A.', 'Med Age'] = int(world_population.loc[world_population['Med Age']!='N.A.', 'Med Age'].mode()[0])
world_population['Med Age'] = world_population['Med Age'].astype('int16')

#now join dataset to previous data set
final_data=pd.merge(
    left=df_covid19,
    right=world_population,
    left_on='country',
    right_on='Country (or dependency)',
    how='left'
)

#dropping country(or dependency data)
final_dataset=final_data.drop('Country (or dependency)',axis=1)


In [28]:
final_dataset.tail(15)

Unnamed: 0,Date,country,Confirmed,Deaths,Recovered,Mortality Rate,Recovery Rate,Active Cases,Closed Cases,New cases,New deaths,New recovered,Population (2020),Density,Land Area,Med Age,Urban Pop
54515,2020-11-03,Turkey,382118,10481,328824,2.74287,86.052999,42813,339305,2343,79,1817,84339067.0,110.0,769630.0,32.0,76.0
54516,2020-11-03,USA,9382617,232620,3705130,2.479266,39.489302,5444867,3937750,91530,1130,30149,331002651.0,36.0,9147420.0,38.0,83.0
54517,2020-11-03,Uganda,13099,115,7612,0.87793,58.111306,5372,7727,128,1,56,45741007.0,229.0,199810.0,17.0,26.0
54518,2020-11-03,Ukraine,423683,7749,176220,1.828962,41.592417,239714,183969,9116,165,5340,43733762.0,75.0,579320.0,41.0,69.0
54519,2020-11-03,United Arab Emirates,136149,503,133490,0.369448,98.046993,2156,133993,1008,6,1466,9890402.0,118.0,83600.0,33.0,86.0
54520,2020-11-03,United Kingdom,1077099,47340,2906,4.395139,0.269799,1026853,50246,20078,397,56,67886011.0,281.0,241930.0,40.0,83.0
54521,2020-11-03,Uruguay,3196,61,2727,1.908636,85.325407,408,2788,31,1,69,3473730.0,20.0,175020.0,36.0,96.0
54522,2020-11-03,Uzbekistan,67553,574,64815,0.849703,95.946886,2164,65389,299,3,349,33469203.0,79.0,425400.0,28.0,50.0
54523,2020-11-03,Venezuela,93100,810,87941,0.870032,94.458647,4349,88751,395,4,394,28435940.0,32.0,882050.0,30.0,57.0
54524,2020-11-03,Vietnam,1202,35,1069,2.911814,88.935108,98,1104,10,0,4,97338579.0,314.0,310070.0,32.0,38.0


 I will be using df_covid19 dataset for Exploratory data analysis and final dataset for modelling as it has more features.

In [29]:
#getting all countries
countries = np.asarray(df_confirmed["country"])
countries1 = np.asarray(df_covid19["country"])

#Continent_code to Continent_names
continents = {
     'NA': 'North America',
    'SA': 'South America', 
    'AS': 'Asia',
    'OC': 'Australia',
    'AF': 'Africa',
    'EU' : 'Europe',
    'na' : 'Others'
}

# Defininng Function for getting continent code for country.
def country_to_continent_code(country):
    try:
        return pc.country_alpha2_to_continent_code(pc.country_name_to_country_alpha2(country))
    except :
        return 'na'

# #Collecting Continent Information
df_confirmed.insert(2,"continent", [continents[country_to_continent_code(country)] for country in countries[:]])
df_deaths.insert(2,"continent",  [continents[country_to_continent_code(country)] for country in countries[:]])
#df_recovered.insert(2,"continent",  [continents[country_to_continent_code(country)] for country in countries[:]])   
df_covid19.insert(1,"continent",  [continents[country_to_continent_code(country)] for country in countries1[:]])


In [30]:

# checking data with continent
df_covid19.tail(15)

Unnamed: 0,Date,continent,country,Confirmed,Deaths,Recovered,Mortality Rate,Recovery Rate,Active Cases,Closed Cases,New cases,New deaths,New recovered
54515,2020-11-03,Asia,Turkey,382118,10481,328824,2.74287,86.052999,42813,339305,2343,79,1817
54516,2020-11-03,North America,USA,9382617,232620,3705130,2.479266,39.489302,5444867,3937750,91530,1130,30149
54517,2020-11-03,Africa,Uganda,13099,115,7612,0.87793,58.111306,5372,7727,128,1,56
54518,2020-11-03,Europe,Ukraine,423683,7749,176220,1.828962,41.592417,239714,183969,9116,165,5340
54519,2020-11-03,Asia,United Arab Emirates,136149,503,133490,0.369448,98.046993,2156,133993,1008,6,1466
54520,2020-11-03,Europe,United Kingdom,1077099,47340,2906,4.395139,0.269799,1026853,50246,20078,397,56
54521,2020-11-03,South America,Uruguay,3196,61,2727,1.908636,85.325407,408,2788,31,1,69
54522,2020-11-03,Asia,Uzbekistan,67553,574,64815,0.849703,95.946886,2164,65389,299,3,349
54523,2020-11-03,South America,Venezuela,93100,810,87941,0.870032,94.458647,4349,88751,395,4,394
54524,2020-11-03,Asia,Vietnam,1202,35,1069,2.911814,88.935108,98,1104,10,0,4


In [31]:
df_covid19.to_csv(r'/content/drive/My Drive/capstone1/df_covid19.csv',index=False)
final_dataset.to_csv(r'/content/drive/My Drive/capstone1/final_dataset.csv',index=False)
df_confirmed.to_csv(r'/content/drive/My Drive/capstone1/df_confirmed.csv',index=False)
df_deaths.to_csv(r'/content/drive/My Drive/capstone1/df_deaths.csv',index=False)