<a href="https://www.kaggle.com/code/rodolphojustino/covid-eda-for-gds?scriptVersionId=122280706" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## Author: Rodolpho Justino

This notebook is part of a EDA study of COVID in Brazil for the year of 2022. The data is available by day [here](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports) for Cases and [here](https://covid.ourworldindata.org/data/owid-covid-data.csv) for vaccination and is separated by month and year.

The dashboard generated by the data on this code is available [here](https://lookerstudio.google.com/reporting/814d5ca8-7dd2-45b3-bff9-ba5e494d2470)


In [1]:
import math 
from typing import Iterator
from datetime import datetime, timedelta

import numpy as np
import pandas as pd

## Infection Data

As there are multiple archives, separated by day and month, we need to iterate in a defined range.

In [2]:
cases = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-12-2021.csv', sep=',')
cases

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
0,,,,Afghanistan,2021-01-13 05:22:15,33.93911,67.709953,53584,2301,44608,6675,Afghanistan,137.647787,4.294192
1,,,,Albania,2021-01-13 05:22:15,41.15330,20.168300,64627,1252,38421,24954,Albania,2245.708527,1.937271
2,,,,Algeria,2021-01-13 05:22:15,28.03390,1.659600,102641,2816,69608,30217,Algeria,234.067409,2.743543
3,,,,Andorra,2021-01-13 05:22:15,42.50630,1.521800,8682,86,7930,666,Andorra,11236.653077,0.990555
4,,,,Angola,2021-01-13 05:22:15,-11.20270,17.873900,18343,422,15512,2409,Angola,55.811022,2.300605
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4007,,,Unknown,Ukraine,2021-01-13 05:22:15,,,0,0,0,0,"Unknown, Ukraine",0.000000,0.000000
4008,,,,Nauru,2021-01-13 05:22:15,-0.52280,166.931500,0,0,0,0,Nauru,0.000000,0.000000
4009,,,Niue,New Zealand,2021-01-13 05:22:15,-19.05440,-169.867200,0,0,0,0,"Niue, New Zealand",0.000000,0.000000
4010,,,,Tuvalu,2021-01-13 05:22:15,-7.10950,177.649300,0,0,0,0,Tuvalu,0.000000,0.000000


Why use an iterator for this case?

For this case specifically there is no particular reason, the difference between the two is that the iterator uses less memory, but takes longer to run, whereas for the list, it's much faster, but consumes more memory.

The use of an iterator for this case is just for the use of a different tool

In [3]:
def date_range(
    start_date: datetime,
    end_date: datetime) -> Iterator[datetime]:
    date_range_days: int = (end_date - start_date).days
    for lag in range (date_range_days):
            yield start_date + timedelta(lag)

In [4]:
start_date = datetime(2021,1,1)
end_date = datetime(2021,12,31)

Now, selecting specifically the columns that are related to Brazil

In [5]:
cases = None
cases_is_empty = True

for date in date_range(start_date = start_date, end_date = end_date):
    
    date_str = date.strftime('%m-%d-%Y')
    data_source_url = f'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/{date_str}.csv'
    
    case = pd.read_csv(data_source_url, sep = ',')
    
    case = case.drop([
        'FIPS',
        'Admin2',
        'Last_Update',
        'Lat',
        'Long_',
        'Recovered',
        'Active',
        'Combined_Key',
        'Case_Fatality_Ratio'],
        axis = 1)
    
    case = case.query('Country_Region == "Brazil"').reset_index(drop = True)
    case ['Date'] = pd.to_datetime(date.strftime('%Y-%m-%d'))
    
    if cases_is_empty:
        cases = case
        cases_is_empty = False
    else:
        cases = cases.append(case, ignore_index = True)

In [6]:
cases

Unnamed: 0,Province_State,Country_Region,Confirmed,Deaths,Incident_Rate,Date
0,Acre,Brazil,41689,796,4726.992352,2021-01-01
1,Alagoas,Brazil,105091,2496,3148.928928,2021-01-01
2,Amapa,Brazil,68361,926,8083.066602,2021-01-01
3,Amazonas,Brazil,201574,5295,4863.536793,2021-01-01
4,Bahia,Brazil,494684,9159,3326.039611,2021-01-01
...,...,...,...,...,...,...
9823,Roraima,Brazil,128793,2078,21261.355551,2021-12-30
9824,Santa Catarina,Brazil,1242654,20183,17343.904663,2021-12-30
9825,Sao Paulo,Brazil,4455011,155186,9701.879932,2021-12-30
9826,Sergipe,Brazil,278507,6057,12115.869171,2021-12-30


In [7]:
cases.query('Province_State == "Paraiba"').head()

Unnamed: 0,Province_State,Country_Region,Confirmed,Deaths,Incident_Rate,Date
14,Paraiba,Brazil,167062,3680,4157.708305,2021-01-01
41,Paraiba,Brazil,167615,3692,4171.470937,2021-01-02
68,Paraiba,Brazil,168044,3706,4182.147553,2021-01-03
95,Paraiba,Brazil,168179,3722,4185.507327,2021-01-04
122,Paraiba,Brazil,168545,3740,4194.616049,2021-01-05


In [8]:
cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9828 entries, 0 to 9827
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Province_State  9828 non-null   object        
 1   Country_Region  9828 non-null   object        
 2   Confirmed       9828 non-null   int64         
 3   Deaths          9828 non-null   int64         
 4   Incident_Rate   9828 non-null   float64       
 5   Date            9828 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(2), object(2)
memory usage: 460.8+ KB


There is no missing data in the df, we can proceed with the analysis

In [9]:
cases = cases.rename(
    columns = {'Province_State': 'State', 'Country_Region': 'Country'})
for col in cases.columns:
    cases = cases.rename(columns = {col: col.lower()})

Now, changing the states names to the correct ones, with accents

In [10]:
states_map = {
    'Amapa':'Amapá',
    'Ceara':'Ceará',
    'Espirito Santo': 'Espírito Santo',
    'Goias': 'Goiás',
    'Para': 'Pará',
    'Paraiba':'Paraíba',
    'Piaui':'Piauí',
    'Rondonia':'Rondônia',
    'Sao Paulo':'São Paulo'
}

cases['state'] = cases['state'].apply(lambda state: states_map.get(state) if state in states_map.keys() else state)

In [11]:
cases

Unnamed: 0,state,country,confirmed,deaths,incident_rate,date
0,Acre,Brazil,41689,796,4726.992352,2021-01-01
1,Alagoas,Brazil,105091,2496,3148.928928,2021-01-01
2,Amapá,Brazil,68361,926,8083.066602,2021-01-01
3,Amazonas,Brazil,201574,5295,4863.536793,2021-01-01
4,Bahia,Brazil,494684,9159,3326.039611,2021-01-01
...,...,...,...,...,...,...
9823,Roraima,Brazil,128793,2078,21261.355551,2021-12-30
9824,Santa Catarina,Brazil,1242654,20183,17343.904663,2021-12-30
9825,São Paulo,Brazil,4455011,155186,9701.879932,2021-12-30
9826,Sergipe,Brazil,278507,6057,12115.869171,2021-12-30


Adding temporal keys to the df

In [12]:
cases['month'] = cases['date'].apply(lambda date: date.strftime('%Y-%m'))
cases['year'] = cases['date'].apply(lambda date: date.strftime('%Y'))

To estimate the population of the state, we perform a rule of three, using the incident rate, which is related to the infection by 100000 people

In [13]:
cases['population'] = round(100000 *(cases['confirmed'] / cases ['incident_rate']))
cases = cases.drop('incident_rate', axis = 1) 

In [14]:
cases

Unnamed: 0,state,country,confirmed,deaths,date,month,year,population
0,Acre,Brazil,41689,796,2021-01-01,2021-01,2021,881935.0
1,Alagoas,Brazil,105091,2496,2021-01-01,2021-01,2021,3337357.0
2,Amapá,Brazil,68361,926,2021-01-01,2021-01,2021,845731.0
3,Amazonas,Brazil,201574,5295,2021-01-01,2021-01,2021,4144597.0
4,Bahia,Brazil,494684,9159,2021-01-01,2021-01,2021,14873064.0
...,...,...,...,...,...,...,...,...
9823,Roraima,Brazil,128793,2078,2021-12-30,2021-12,2021,605761.0
9824,Santa Catarina,Brazil,1242654,20183,2021-12-30,2021-12,2021,7164788.0
9825,São Paulo,Brazil,4455011,155186,2021-12-30,2021-12,2021,45919049.0
9826,Sergipe,Brazil,278507,6057,2021-12-30,2021-12,2021,2298696.0


To calculate the moving average of the cases and the stability, we perform the following calculation

In [15]:
cases_ = None
cases_is_empty = True

def get_trend(rate: float) -> str:
    if np.isnan(rate):
        return np.NaN
    if rate < 0.85:
        status = 'downward'
    elif rate > 1.15:
        status = 'upward'
    else: 
        status = 'stable'
        
    return status

for state in cases['state'].drop_duplicates():
    cases_per_state = cases.query(f'state == "{state}"').reset_index(drop=True)
    cases_per_state = cases_per_state.sort_values(by = ['date'])
    
    #Performing aggregations for confirmed cases
    #the diff operator takes the value from one column and takes the difference with the anterior 
    cases_per_state['conf_1d'] = cases_per_state['confirmed'].diff(periods = 1)
    #the rolling operator takes a window of places before (in this case days) and aggregates them
    cases_per_state['conf_moving_avg_7d'] = np.ceil(cases_per_state['conf_1d'].rolling(window = 7).mean())
    #calculating the rate of the moving average of 14 days    
    cases_per_state['conf_moving_avg_7d_rate_14d'] = cases_per_state['conf_moving_avg_7d'] / cases_per_state['conf_moving_avg_7d'].shift(periods = 14)
    #applying the get_trend function to the created column
    cases_per_state['conf_trend'] = cases_per_state['conf_moving_avg_7d_rate_14d'].apply(get_trend)
                                  
    #Performing aggregations for confirmed deaths
    cases_per_state['deaths_1d'] = cases_per_state['deaths'].diff(periods = 1)
    cases_per_state['deaths_moving_avg_7d'] = np.ceil(cases_per_state['deaths_1d'].rolling(window = 7).mean())
    cases_per_state['deaths_moving_avg_7d_rate_14d'] = cases_per_state['deaths_moving_avg_7d'] / cases_per_state['deaths_moving_avg_7d'].shift(periods = 14)
    cases_per_state['deaths_trend'] = cases_per_state['deaths_moving_avg_7d_rate_14d'].apply(get_trend)
    if cases_is_empty:
        cases_ = cases_per_state
        cases_is_empty = False
    else:
        cases_ = cases_.append(cases_per_state, ignore_index = True)

cases = cases_
cases_ = None

In [16]:
cases['population'] = cases['population'].astype('Int64')
cases['conf_1d'] = cases['conf_1d'].astype('Int64')
cases['conf_moving_avg_7d'] = cases['conf_moving_avg_7d'].astype('Int64')
cases['deaths_1d'] = cases['deaths_1d'].astype('Int64')
cases['deaths_moving_avg_7d'] = cases['deaths_moving_avg_7d'].astype('Int64')

Reorganizing the DF

In [17]:
cases = cases[['date', 'country', 'state', 'population','confirmed', 'conf_1d', 'conf_moving_avg_7d', 'conf_moving_avg_7d_rate_14d', 'conf_trend', 'deaths', 'deaths_1d', 'deaths_moving_avg_7d', 'deaths_moving_avg_7d_rate_14d', 'deaths_trend', 'month', 'year']]

We see that there are missing values on moving averages of 7 and 14 days, this happens because the calculations is only possible when there are 7 and 14 days passed, before this, there is not a number to be shown

In [18]:
cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9828 entries, 0 to 9827
Data columns (total 16 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   date                           9828 non-null   datetime64[ns]
 1   country                        9828 non-null   object        
 2   state                          9828 non-null   object        
 3   population                     9828 non-null   Int64         
 4   confirmed                      9828 non-null   int64         
 5   conf_1d                        9801 non-null   Int64         
 6   conf_moving_avg_7d             9639 non-null   Int64         
 7   conf_moving_avg_7d_rate_14d    9261 non-null   float64       
 8   conf_trend                     9261 non-null   object        
 9   deaths                         9828 non-null   int64         
 10  deaths_1d                      9801 non-null   Int64         
 11  deaths_moving_avg

In [19]:
cases.head()

Unnamed: 0,date,country,state,population,confirmed,conf_1d,conf_moving_avg_7d,conf_moving_avg_7d_rate_14d,conf_trend,deaths,deaths_1d,deaths_moving_avg_7d,deaths_moving_avg_7d_rate_14d,deaths_trend,month,year
0,2021-01-01,Brazil,Acre,881935,41689,,,,,796,,,,,2021-01,2021
1,2021-01-02,Brazil,Acre,881935,41941,252.0,,,,798,2.0,,,,2021-01,2021
2,2021-01-03,Brazil,Acre,881935,42046,105.0,,,,802,4.0,,,,2021-01,2021
3,2021-01-04,Brazil,Acre,881935,42117,71.0,,,,806,4.0,,,,2021-01,2021
4,2021-01-05,Brazil,Acre,881935,42170,53.0,,,,808,2.0,,,,2021-01,2021


Now, we load the dataset in order to use it in Looker Studio

In [20]:
cases.to_csv('./covid-cases.csv', sep = ',', index = False)

## Vaccination Data

In [21]:
vaccines = pd.read_csv('https://covid.ourworldindata.org/data/owid-covid-data.csv', sep = ',', parse_dates = [3], infer_datetime_format = True )

In [22]:
vaccines.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
0,AFG,Asia,Afghanistan,2020-01-03,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,
1,AFG,Asia,Afghanistan,2020-01-04,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,
2,AFG,Asia,Afghanistan,2020-01-05,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,
3,AFG,Asia,Afghanistan,2020-01-06,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,
4,AFG,Asia,Afghanistan,2020-01-07,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,


Selecting columns only with info about Brazil

In [23]:
vaccines = vaccines.query('location == "Brazil"').reset_index(drop = True)
vaccines = vaccines[['location', 'population', 'total_vaccinations', 'people_vaccinated', "people_fully_vaccinated","total_boosters", "date"]]
vaccines.head()

Unnamed: 0,location,population,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,date
0,Brazil,215313504.0,,,,,2020-01-03
1,Brazil,215313504.0,,,,,2020-01-04
2,Brazil,215313504.0,,,,,2020-01-05
3,Brazil,215313504.0,,,,,2020-01-06
4,Brazil,215313504.0,,,,,2020-01-07


In [24]:
vaccines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1167 entries, 0 to 1166
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   location                 1167 non-null   object        
 1   population               1167 non-null   float64       
 2   total_vaccinations       690 non-null    float64       
 3   people_vaccinated        686 non-null    float64       
 4   people_fully_vaccinated  670 non-null    float64       
 5   total_boosters           450 non-null    float64       
 6   date                     1167 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(5), object(1)
memory usage: 63.9+ KB


We see that there is a lot of missing data, so we will work on that

In [25]:
#the ffill method fills the missing data with the closest value before that
vaccines = vaccines.fillna(method = 'ffill')

To make sure that the two databases are related with the same time period, we filter the data by date

In [26]:
vaccines = vaccines[(vaccines['date'] >= "2021-01-01") & (vaccines['date'] <= "2021-12-31")].reset_index(drop = True)

In [27]:
vaccines = vaccines.rename(
    columns = {
        'location' : 'country',
        'total_vaccinations': 'total',
        'people_vaccinated' : 'one_shot',
        'people_fully_vaccinated' : 'two_shots',
        'total_boosters' : 'three_shots',
    }
)

In [28]:
vaccines['month'] = vaccines['date'].apply(lambda date: date.strftime('%Y-%m'))
vaccines['year'] = vaccines['date'].apply(lambda date: date.strftime('%Y'))

Now, percentages

In [29]:
vaccines['one_shot_perc'] = round(vaccines['one_shot'] / vaccines['population'],4)
vaccines['two_shots_perc'] = round(vaccines['two_shots'] / vaccines['population'],4)
vaccines['three_shots_perc'] = round(vaccines['three_shots'] / vaccines['population'],4)

In [30]:
vaccines['population'] = vaccines['population'].astype('Int64')
vaccines['total'] = vaccines['total'].astype('Int64')
vaccines['one_shot'] = vaccines['one_shot'].astype('Int64')
vaccines['two_shots'] = vaccines['two_shots'].astype('Int64')
vaccines['three_shots'] = vaccines['three_shots'].astype('Int64')

In [31]:
vaccines = vaccines[['date', 'country', 'population', 'total', 'one_shot', 'one_shot_perc', 'two_shots','two_shots_perc','three_shots', 'three_shots_perc','month', 'year']]
vaccines.head()

Unnamed: 0,date,country,population,total,one_shot,one_shot_perc,two_shots,two_shots_perc,three_shots,three_shots_perc,month,year
0,2021-01-01,Brazil,215313504,,,,,,,,2021-01,2021
1,2021-01-02,Brazil,215313504,,,,,,,,2021-01,2021
2,2021-01-03,Brazil,215313504,,,,,,,,2021-01,2021
3,2021-01-04,Brazil,215313504,,,,,,,,2021-01,2021
4,2021-01-05,Brazil,215313504,,,,,,,,2021-01,2021


load the data to use in Looker studio

In [32]:
vaccines.to_csv('./covid-vaccines.csv', sep = ',', index = False)

In [33]:
cases

Unnamed: 0,date,country,state,population,confirmed,conf_1d,conf_moving_avg_7d,conf_moving_avg_7d_rate_14d,conf_trend,deaths,deaths_1d,deaths_moving_avg_7d,deaths_moving_avg_7d_rate_14d,deaths_trend,month,year
0,2021-01-01,Brazil,Acre,881935,41689,,,,,796,,,,,2021-01,2021
1,2021-01-02,Brazil,Acre,881935,41941,252,,,,798,2,,,,2021-01,2021
2,2021-01-03,Brazil,Acre,881935,42046,105,,,,802,4,,,,2021-01,2021
3,2021-01-04,Brazil,Acre,881935,42117,71,,,,806,4,,,,2021-01,2021
4,2021-01-05,Brazil,Acre,881935,42170,53,,,,808,2,,,,2021-01,2021
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9823,2021-12-26,Brazil,Tocantins,1572866,234113,0,0,0.000000,downward,3927,0,0,0.0,downward,2021-12,2021
9824,2021-12-27,Brazil,Tocantins,1572866,234113,0,0,0.000000,downward,3927,0,0,0.0,downward,2021-12,2021
9825,2021-12-28,Brazil,Tocantins,1572866,234964,851,122,2.837209,upward,3933,6,1,1.0,stable,2021-12,2021
9826,2021-12-29,Brazil,Tocantins,1572866,235340,376,176,inf,upward,3936,3,2,inf,upward,2021-12,2021


In [34]:
vaccines

Unnamed: 0,date,country,population,total,one_shot,one_shot_perc,two_shots,two_shots_perc,three_shots,three_shots_perc,month,year
0,2021-01-01,Brazil,215313504,,,,,,,,2021-01,2021
1,2021-01-02,Brazil,215313504,,,,,,,,2021-01,2021
2,2021-01-03,Brazil,215313504,,,,,,,,2021-01,2021
3,2021-01-04,Brazil,215313504,,,,,,,,2021-01,2021
4,2021-01-05,Brazil,215313504,,,,,,,,2021-01,2021
...,...,...,...,...,...,...,...,...,...,...,...,...
360,2021-12-27,Brazil,215313504,329011365,165952037,0.7707,142764283,0.6631,25218893,0.1171,2021-12,2021
361,2021-12-28,Brazil,215313504,329861730,166062249,0.7713,142965728,0.6640,25758909,0.1196,2021-12,2021
362,2021-12-29,Brazil,215313504,330718457,166143380,0.7716,143282084,0.6655,26219623,0.1218,2021-12,2021
363,2021-12-30,Brazil,215313504,331164041,166185628,0.7718,143398692,0.6660,26507937,0.1231,2021-12,2021
