
## **Table of Contents**
### **Step_0: Data Exploration**

### **Step_1: Data preparation**

### **Step_2: Data Analysis**

### **Step_3: Covid Data Visualization:bar charts, pie charts, flattened cumulative charts & world map showing the prevalence of covid-19**



In [None]:
import pandas as pd
import numpy as np
import datetime
import matplotlib.pyplot as plt
import folium
from keras.preprocessing import image


## **Step_0: Data Exploration**

In [None]:
covid_df = pd.read_csv('../input/novel-corona-virus-2019-dataset/covid_19_data.csv')
covid_df.head()

In [None]:
df_confirmed= pd.read_csv('../input/novel-corona-virus-2019-dataset/time_series_covid_19_confirmed.csv')
df_deaths= pd.read_csv('../input/novel-corona-virus-2019-dataset/time_series_covid_19_deaths.csv')
df_recovered= pd.read_csv('../input/novel-corona-virus-2019-dataset/time_series_covid_19_recovered.csv')

In [None]:
df_confirmed.head(3)

In [None]:
df_deaths.head(3)

In [None]:
df_recovered.head(3)

#### **- Explore the difference between covid_19_data.csv and the other datasets?**

In [None]:
countries_1= set(covid_df['Country/Region'])
countries_2= set(df_confirmed['Country/Region'])
print(len(countries_1))
print(len(countries_2))

In [None]:
diff1=[]
for item in countries_1:
    if item not in countries_2:
        diff1.append(item)
print(len(diff1))
diff1
        

In [None]:
diff2=[]
for item in countries_2:
    if item not in countries_1:
        diff2.append(item)
print(len(diff2))
diff2

=> **As we see, the main difference is about how the names have been written (for example, about "Cote d'Ivoire: it is written in French but in the first dataset, it is written in English (Ivory Coast), but also we have other types of differences :for example : "Guadaloupe","Martinique", "Reunion"..are  Overseas Departments and Territories (of France), while, these territories are included in France(in the other datasets)...**

#### **- Explore why the length of df_recovered is different from df_confirmed(and df_deaths)?**

In [None]:
print(df_confirmed.shape)
print(df_deaths.shape)
print(df_recovered.shape)

In [None]:
set1= set(df_confirmed['Country/Region'])
set2= set(df_recovered['Country/Region'])

In [None]:
print(len(set1))
print(len(set2))
#as we see the unique values of countries are the same in botn (so no problem), we can deal with it

In [None]:
liste1=[]
for item in set1:
    if len(df_confirmed.loc[df_confirmed['Country/Region']== item])!=1:
           liste1.append(item)
print(liste1)
print(len(liste1))

In [None]:
liste2=[]
for item in set2:
    if len(df_recovered.loc[df_recovered['Country/Region']== item])!=1:
           liste2.append(item)
print(liste2)
print(len(liste2))

In [None]:
for item in liste1:
    if item in liste2:
        continue
    else:
        print(item)

=> **As, we see below, canada in df_confirmed(the same thing with df_deaths) has many Provine/states, but in df_recovered has only one entry: this is all the difference between the datasets.**

In [None]:
df_confirmed[df_confirmed['Country/Region']=='Canada']

In [None]:
df_recovered[df_recovered['Country/Region']=='Canada']

#### **- Explore the columns**

In [None]:
df_conf_col= df_confirmed.keys()[4:]
df_deaths_col = df_deaths.keys()[4:]
df_recov_col = df_recovered.keys()[4:]
if (df_conf_col.equals(df_deaths_col)) & (df_conf_col.equals(df_recov_col)):
    print("The same columns in the 3 datasets")


## **Step_1: Data preparation**

#### **- Transform the dataframes, so that they have the same shape**

In [None]:
df_confirmed= df_confirmed.groupby('Country/Region').sum()
df_confirmed.reset_index(inplace=True)
df_confirmed.shape

In [None]:
df_deaths= df_deaths.groupby('Country/Region').sum()
df_deaths.reset_index(inplace=True)
df_deaths.shape

In [None]:
df_recovered= df_recovered.groupby('Country/Region').sum()
df_recovered.reset_index(inplace=True)
df_recovered.shape

In [None]:
df_confirmed.head()

In [None]:
df_deaths.head()

In [None]:
df_recovered.head()

#### **- Combine the dataframes, into one, containing 3 columns: confirmed, deaths, recovered**

In [None]:
dates= df_confirmed.columns[3:]
dates

In [None]:
df_confirmed= df_confirmed.melt(id_vars= ['Country/Region','Lat','Long'],
                               value_vars= dates,
                               var_name= 'Date',
                               value_name= 'confirmed' )
df_confirmed

In [None]:
df_deaths= df_deaths.melt(id_vars= ['Country/Region','Lat','Long'],
                         value_vars = dates,
                         var_name ='Date',
                         value_name = 'deaths')
df_deaths

In [None]:
df_recovered= df_recovered.melt(id_vars= ['Country/Region','Lat','Long'],
                               value_vars= dates,
                                var_name= 'Date',
                                value_name='recovered')
df_recovered

In [None]:
#concatenate the three dataframes into one:
df_covid= pd.concat([df_confirmed,df_deaths['deaths'],df_recovered['recovered']], axis=1)
df_covid.head()

In [None]:
#we check if we have null values
df_covid.isnull().sum()

In [None]:
df_covid.info()
#As e see, below, 'Date' has the type 'object' (String)

#### **- Convert "Date" column to datetime format, so that we could perform a sort based on dates**

In [None]:
df_covid['Date']= df_covid['Date'].apply(lambda x: datetime.datetime.strptime(x,'%m/%d/%y'))

In [None]:
df_covid.head()

In [None]:
df_covid.info()
#The 'Date' column has, now the type datetime64

## **Step_2: Data analysis**

#### **- Display the daily number of confirmed, recovered cases and deaths**

In [None]:
df_covid

In [None]:
#choose random day
df_covid_group= df_covid.groupby('Date')
for date, group in df_covid_group:
    if date == df_covid.loc[np.random.choice(range(len(df_covid))),'Date']:
        print(date)
        print("****************************************************")
        print(group)
    

#### **- Display the daily number of confirmed, recovered cases and deaths per country**

In [None]:
df_covid_country = df_covid.groupby(['Date','Country/Region']).aggregate({'confirmed': sum,\
                                                                            'deaths': sum, 'recovered':sum})

In [None]:
df_covid_country

#### **- Get data for a specific day & a specific country**

In [None]:
df_covid_country.loc[(datetime.datetime.strptime('02/27/21','%m/%d/%y'),'Tunisia')]

In [None]:
df_covid_country.loc[(datetime.datetime.strptime('02/27/21','%m/%d/%y'), 'US')]

In [None]:
df_covid_country.loc[(datetime.datetime.strptime('02/27/21','%m/%d/%y'), 'China')]

In [None]:
df_covid_country.loc[(datetime.datetime.strptime('02/27/21','%m/%d/%y'), 'France')]

In [None]:
df_covid_country.loc[(datetime.datetime.strptime('02/27/21','%m/%d/%y'), 'Morocco')]

#### **- Get the Top 10 : highest nulber of deaths, confirmed & recovered cases**

In [None]:
df_covid_2= df_covid.groupby('Country/Region').aggregate({'confirmed':max, 'deaths':max, 'recovered':max})
df_covid_2

In [None]:
# Top 10 highest confirmed cases (Prevalence)
top_10_confirmed= df_covid_2.sort_values(by='confirmed',ascending=False)[0:10]
top_10_confirmed

In [None]:
# Top 10 highest recovered cases 
top_10_recovered= df_covid_2.sort_values(by='recovered',ascending=False)[0:10]
top_10_recovered

In [None]:
# Top 10 highest deaths
top_10_deaths= df_covid_2.sort_values(by='deaths',ascending=False)[0:10]
top_10_deaths

## **Step_3: Covid Data Visualization: bar charts, pie charts, flattened cumulative charts & world map showing the prevalence of covid-19**

In [None]:
#just for recall
df_covid_country

#### **1- bar charts for confirmed, recovered cases and deaths for some countries**

In [None]:
for country, df_country in df_covid_country.groupby(level=1):
    if country in ['Tunisia','Algeria','Morocco','France','US','China','India','Korea, South']:
        dates= list(df_country.index.get_level_values('Date'))
        confirmed= list(df_country.confirmed)
        recovered= list(df_country.recovered)
        deaths= list(df_country.deaths)
        plt.bar(dates, confirmed, color='blue')
        plt.bar(dates,recovered,color='green')
        plt.bar(dates,deaths,color='red')
        plt.xlabel('Dates')
        plt.ylabel('Number of people')
        plt.title(country)
        plt.legend()
        plt.show()

#### **2- Pie charts**

In [None]:
#Just for recall
top_10_confirmed.head(3)

In [None]:
def plot_pie_covid(df, column,title):
    
    labels_countries= list(df.index)
    values= df[column].values
    explode= [0 for i in range(10)]
    
    with plt.style.context({'axes.prop_cycle': plt.cycler('color',plt.cm.tab20.colors)}):
        fig,ax= plt.subplots(figsize=(12,6))
        ax.pie(values, explode= explode,labels= labels_countries, autopct='%1.0f%%')
        ax.axis('equal')
        plt.legend(loc=1)
        plt.title(title,fontsize=15)
    plt.show()
    

In [None]:
plot_pie_covid(top_10_confirmed, 'confirmed','Covid_19:Top_10 highest confirmed cases(Last update 2021)')

In [None]:
plot_pie_covid(top_10_recovered, 'recovered','Covid_19:Top_10 highest recovered cases\
(Last update 2021)')

In [None]:
plot_pie_covid(top_10_recovered, 'deaths',' Covid_19: Top_10 highest deaths:Last update 2021')

#### **3- Flatten the cumulitave curves for some countries**

**Note:**

the recorded numbers (confirmed, recovered cases and deaths) are all **cumulative** and because they are cumulative, it would be difficult to know **whether the situation in a country is getting better or worse.** So, with this code, below, we will **"unroll"** the numbers to get the new numbers **reported for each day.** : 

In [None]:
for country, df_country in df_covid_country.groupby(level=1): 
    
    if country in ['Tunisia','Algeria','Morocco','France','China','India','Korea, South']:
        
        dates = list(df_country.index.get_level_values('Date'))
        confirmed = list(df_country.confirmed)
        recovered = list(df_country.recovered)
        deaths = list(df_country.deaths)
    
        df = pd.DataFrame(dates)
        df['confirmed'] = confirmed
        df['deaths'] = deaths
        df['recovered']= recovered
        
        df_unrolled = df.diff().fillna(df) 
        df_unrolled[0] = dates
        
        plt.figure(figsize=(12,6))
        plt.plot(dates, df_unrolled['confirmed'], color='blue', 
             label='Confirmed cases per day')
        plt.plot(dates, df_unrolled['recovered'], color='green',
                label= 'Recovered cases per day')
        plt.plot(dates, df_unrolled['deaths'], color='red', 
             label='Deaths per day')
        plt.xlabel('Dates')
        plt.ylabel('Number of people')
        plt.title(country)
        plt.legend()
        plt.show()

##### ***Important Note:the peaks that fall below 0?***
we notice that some countries like France, have some peaks that fall below 0: This is probably due to the adjustments made to the numbers. For example, one day the recorded number may be 8000, but the next day the (cumulative) number may be adjusted down to 6000 (due to errors in tests, records,etc.).


#### **4- Show the confirmed, recovered cases and deaths on the world map**

In [None]:
df_covid.head()

In [None]:
most_recent_date= df_covid['Date'].max()
df_covid_3= df_covid[df_covid.Date==most_recent_date]
df_covid_3.set_index('Country/Region',inplace=True)

In [None]:
df_covid_3.head()

In [None]:
folium_map= folium.Map(location=[40.738,-73.98], zoom_start=4)
color= '#E37222'
scale= 50000
for place in df_covid_3.index:
    lat= df_covid_3.loc[place]['Lat']
    long= df_covid_3.loc[place]['Long']
    confirmed= df_covid_3.loc[place]['confirmed']
    recovered= df_covid_3.loc[place]['recovered']
    deaths= df_covid_3.loc[place]['deaths']
    
    marker_confirmed= folium.CircleMarker(location=[lat,long],
                                         radius= confirmed/scale,
                                         color='blue',
                                         fill=True)
    marker_confirmed.add_to(folium_map)
    
    marker_recovered= folium.CircleMarker(location=[lat,long],
                                         radius= recovered/scale,
                                         color='green',
                                         fill=True)
    marker_recovered.add_to(folium_map)
    
    radius_deaths= deaths/scale if deaths >0 else 0.000000001
    marker_deaths= folium.CircleMarker(location=[lat, long],
                                 radius=radius_deaths,
                                 color="red",
                                 fill=True)
    marker_deaths.add_to(folium_map)
                                    
folium_map.save('./Covid-19 Map.html')                                    

In [None]:
image_path= '../input/world-map-covid19/world_map covid19.jpg'
img= image.load_img(image_path)
img