# Brazil Covid-19 Data Analysis

#### Author: Romullo Ferreira

The aim of this project is to analyze Covid-19 (Brazil) data, which includes the following steps:

1. Defining the list of questions to be answered with data analysis.
2. Data wrangling 
    - 2.1. Gather
    - 2.2. Assess
    - 2.3. Clean
3. Data exploration and effective communication of the exploratory steps.
4. Conclusions.

Useful facts about the data:
- epidemiological situation of COVID-19 in Brazil
- The data update process is carried out daily by the Ministry of Health through official information provided by the State Health Secretariats of the 27 Brazilian Federative Units
- 27 states of Brazil
- 5 regions of brazil (Central-West, North, Northeast, South and Southeast)
- Data were collected from 2021-03-09 to 2021-03-19
- Data Source: State Health Secretariats. Brazil, 2020
- Detailed data can be downloaded from [OpenDataSUS](https://opendatasus.saude.gov.br/)
- You can also access the [Brazilian Open Data Portal](https://dados.gov.br/dataset) to find more coronavirus data

## 1. Defining the list of questions to be answered with data analysis

We have a lot of things to ask. The dataset has a lot of information, but for the analysis not to be too long I decided to ask some questions that I thought were important as a start. But after taking a look at my analysis feel free to ask other questions and analyze.

It is important to remember that All questions below refer to Covid-19 data from Brazil:

- a. - What is the total accumulated cases?
- b. - What is the total new cases?
- c. - What is the total recovered cases?
- d. - What is the total followup cases?
- e. - What is the total Covid-19 deaths?
- f. - What is the total new deaths?
- g. - What is the Incidence Coefficient of Covid-19?
- h. - What is the Mortality Coefficient for Covid-19?
- i. - What is the Covid-19 Lethality Rate?
- j. - Which region(zone) has the highest Covid-19 Cases?
- k. - Which states have the highest number of covid cases?
- l. - Which cities have the highest number of covid cases?
- m. - Which region(zone) has the highest Covid-19 deaths?
- n. - Which states have the highest number of covid deaths?
- o. - Which cities have the highest number of covid deaths?

## 2. Data wrangling

## 2.1. Gather

Firstly, let's import all the libraries necessary for this project

In [4]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Import CSV File

In [5]:
df_covid19br = pd.read_csv('COVIDBR.csv', sep=';')

## 2.2. Assess

Let's take a look at the dataframe

In [6]:
df_covid19br.sample(4)

Unnamed: 0,regiao,estado,municipio,coduf,codmun,codRegiaoSaude,nomeRegiaoSaude,data,semanaEpi,populacaoTCU2019,casosAcumulado,casosNovos,obitosAcumulado,obitosNovos,Recuperadosnovos,emAcompanhamentoNovos,interior/metropolitana
836923,Sudeste,MG,Astolfo Dutra,31,310460.0,31044.0,LEOPOLDINA / CATAGUASES,2020-07-28,31,14179.0,60,0,2,0,,,0.0
257850,Nordeste,PI,Baixa Grande do Ribeiro,22,220115.0,22007.0,TABULEIROS DO ALTO PARNAIBA,2021-01-15,2,11586.0,1571,0,15,0,,,0.0
45390,Norte,AM,Eirunepé,13,130140.0,13007.0,REGIONAL JURUA,2020-08-02,32,35273.0,1327,0,4,0,,,0.0
1203511,Sudeste,SP,Auriflama,35,350420.0,35021.0,CENTRAL DO DRS II,2020-07-24,30,15189.0,144,11,6,0,,,0.0


We can already see that there are missing values in the columns estado, municipio, codmun and other columns for example.

Here we can see the size of original dataset, 2.012.472 rows e 17 columns.

In [7]:
df_covid19br.shape

(2012472, 17)

Let's count the unique values

In [8]:
df_covid19br.nunique()

regiao                        6
estado                       27
municipio                  5297
coduf                        28
codmun                     5591
codRegiaoSaude              450
nomeRegiaoSaude             440
data                        389
semanaEpi                    53
populacaoTCU2019           5104
casosAcumulado            33078
casosNovos                 3779
obitosAcumulado            7273
obitosNovos                 641
Recuperadosnovos            336
emAcompanhamentoNovos       336
interior/metropolitana        2
dtype: int64

##### Let's create a copy of the original dataframe. And create another dataframe (filtered_df) with just today's date data, which is where I'm going to focus my analysis.

In [9]:
filtered_df = df_covid19br.copy()
filtered_df = filtered_df.query("data > '2021-03-18'")
filtered_df.sample(3)

Unnamed: 0,regiao,estado,municipio,coduf,codmun,codRegiaoSaude,nomeRegiaoSaude,data,semanaEpi,populacaoTCU2019,casosAcumulado,casosNovos,obitosAcumulado,obitosNovos,Recuperadosnovos,emAcompanhamentoNovos,interior/metropolitana
410779,Nordeste,RN,Doutor Severiano,24,240320.0,24006.0,6ª REGIAO DE SAUDE - PAU DOS FERROS,2021-03-19,11,7076.0,450,-10,12,0,,,0.0
137267,Norte,TO,Combinado,17,170555.0,17003.0,SUDESTE,2021-03-19,11,4852.0,331,2,4,0,,,0.0
1251721,Sudeste,SP,Floreal,35,351590.0,35157.0,VOTUPORANGA,2021-03-19,11,2917.0,92,1,2,0,,,0.0


Dimensions of the dataframe. Here we can already see that our filtered dataframe is much smaller than the original file, with 5619 rows and 17 columns. 

In [10]:
filtered_df.shape

(5619, 17)

- Which columns have missing values? 
- Here we can already see that the "munincipio" column has only 5570 against 5619 of the original file. This means that we have 49 missing values. And the "codmun" column has only 5591 out of a total of 5619, this means that we have 28 missing values. Let's keep that information.

In [11]:
filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5619 entries, 390 to 2012471
Data columns (total 17 columns):
regiao                    5619 non-null object
estado                    5618 non-null object
municipio                 5570 non-null object
coduf                     5619 non-null int64
codmun                    5591 non-null float64
codRegiaoSaude            5570 non-null float64
nomeRegiaoSaude           5570 non-null object
data                      5619 non-null object
semanaEpi                 5619 non-null int64
populacaoTCU2019          5598 non-null float64
casosAcumulado            5619 non-null int64
casosNovos                5619 non-null int64
obitosAcumulado           5619 non-null int64
obitosNovos               5619 non-null int64
Recuperadosnovos          1 non-null float64
emAcompanhamentoNovos     1 non-null float64
interior/metropolitana    5570 non-null float64
dtypes: float64(6), int64(6), object(5)
memory usage: 790.2+ KB


Another way to see missing values. Count of missing values.

In [12]:
filtered_df.isnull().sum()

regiao                       0
estado                       1
municipio                   49
coduf                        0
codmun                      28
codRegiaoSaude              49
nomeRegiaoSaude             49
data                         0
semanaEpi                    0
populacaoTCU2019            21
casosAcumulado               0
casosNovos                   0
obitosAcumulado              0
obitosNovos                  0
Recuperadosnovos          5618
emAcompanhamentoNovos     5618
interior/metropolitana      49
dtype: int64

## 2.3. Clean

Make copy of the filtered_df dataframe to clean

In [13]:
df_clean = filtered_df.copy()

### #Delete columns 

* The columns 'coduf ', ' codRegiaoSaude ', ' nomeRegiaoSaude ', 'populacaoTCU2019',  'semanaEpi ' and ' interior / metropolitana ' are not required, because I will not use it in my analysis until now.

#### Define

* We will delete the columns 'coduf ', ' codRegiaoSaude ', ' nomeRegiaoSaude ', 'PopilacaoTCU2019', semanaEpi ' and ' interior / metropolitana ' 

#### Code

In [14]:
colunas = ['coduf', 'codRegiaoSaude', 'nomeRegiaoSaude', 'populacaoTCU2019', 'semanaEpi', 'interior/metropolitana']
df_clean.drop(columns = colunas, inplace = True)

#### Test

In [15]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5619 entries, 390 to 2012471
Data columns (total 11 columns):
regiao                   5619 non-null object
estado                   5618 non-null object
municipio                5570 non-null object
codmun                   5591 non-null float64
data                     5619 non-null object
casosAcumulado           5619 non-null int64
casosNovos               5619 non-null int64
obitosAcumulado          5619 non-null int64
obitosNovos              5619 non-null int64
Recuperadosnovos         1 non-null float64
emAcompanhamentoNovos    1 non-null float64
dtypes: float64(3), int64(4), object(4)
memory usage: 526.8+ KB


In [16]:
df_clean.sample(3)

Unnamed: 0,regiao,estado,municipio,codmun,data,casosAcumulado,casosNovos,obitosAcumulado,obitosNovos,Recuperadosnovos,emAcompanhamentoNovos
653145,Nordeste,SE,Japaratuba,280330.0,2021-03-19,587,2,32,0,,
1855667,Centro-Oeste,MS,Douradina,500350.0,2021-03-19,330,0,4,0,,
171635,Norte,TO,Tocantinópolis,172120.0,2021-03-19,1708,11,29,0,,


In [17]:
df_clean.shape

(5619, 11)

### #Rename columns

* I decided to rename the columns to be better for everyone who is going to analyze this file, and improve visualization.

#### Define

* Let's rename the columns

#### Code

In [18]:
renamecolumns = {'regiao':'region', 'estado': 'state', 'municipio' : 'city', 'codmun': 'codcity', 'data': 'date', 'casosAcumulado': 'totalInfected', 'casosNovos': 'newInfected', 'obitosAcumulado': 'totalDeaths', 'obitosNovos': 'newdeaths', 'Recuperadosnovos': 'recovered', 'emAcompanhamentoNovos': 'followUp'}
df_clean = df_clean.rename(columns = renamecolumns)

#### Test

In [19]:
df_clean.sample(3)

Unnamed: 0,region,state,city,codcity,date,totalInfected,newInfected,totalDeaths,newdeaths,recovered,followUp
1095275,Sudeste,MG,Senador Amaral,316557.0,2021-03-19,265,0,5,0,,
1754711,Sul,RS,Mariana Pimentel,431198.0,2021-03-19,149,0,3,0,,
1542775,Sul,PR,São José dos Pinhais,412550.0,2021-03-19,16099,156,411,7,,


### #Missing values (Fixing NaN data values)

* There are missing values from cities that have no name, let's disregard these values. 

#### Define

* Let's exclude these cities from the analysis, because we have no way of knowing which names

* In our project the "city" column is essential for our analysis, so lines that have this missing data must be excluded, so let's clean up the lines without cities. For this we will use the "codcity" column.

#### Code

In [20]:
#Filter only not null lines (NaN)
df_clean_notnull = df_clean.dropna(subset=['codcity'])

#### Test

In [22]:
df_clean_notnull.shape

(5591, 11)

In [23]:
df_clean_notnull.count()

region           5591
state            5591
city             5570
codcity          5591
date             5591
totalInfected    5591
newInfected      5591
totalDeaths      5591
newdeaths        5591
recovered           0
followUp            0
dtype: int64

### #Duplicated Data

##### Another important information is that in Brazil we have some cities with the same names but in different states.

Is there really duplicate data? Cities with the same names but different states.

For this reason, we will not use the column 'city' (the name of the city) in our analysis. So let's use codmun to make our analysis more useful. And that is also why we are not going to delete this duplicate data.

Let's take a look at the analysis of duplicate data below to understand the problem.

##### Total number of cities

In [24]:
df_clean_notnull['city'].count()

5570

##### Counting the unique values for the "city" column

In [25]:
df_clean_notnull['city'].nunique()

5297

##### Counting the occurrence of the value of cities

In [26]:
df_clean_notnull['city'].value_counts().sort_values(ascending=False).head(10)

Bom Jesus          5
São Domingos       5
Santa Inês         4
Santa Helena       4
Vera Cruz          4
Santa Terezinha    4
São Francisco      4
Planalto           4
Santa Luzia        4
Bonito             4
Name: city, dtype: int64

##### List of times that the city Bom Jesus appears

For example, the city Bom Jesus appears 5 times in the dataset but each city has a different state. Because they are really different cities in the country.

In [27]:
df_clean_notnull[df_clean_notnull['city'] == 'Bom Jesus'].head(5)

Unnamed: 0,region,state,city,codcity,date,totalInfected,newInfected,totalDeaths,newdeaths,recovered,followUp
262567,Nordeste,PI,Bom Jesus,220190.0,2021-03-19,2338,5,27,0,,
405051,Nordeste,RN,Bom Jesus,240170.0,2021-03-19,516,-4,10,0,,
469849,Nordeste,PB,Bom Jesus,250220.0,2021-03-19,96,1,3,0,,
1575353,Sul,SC,Bom Jesus,420253.0,2021-03-19,297,1,4,0,,
1684185,Sul,RS,Bom Jesus,430230.0,2021-03-19,490,0,12,0,,


##### Total cities with duplicate names

In [28]:
sum(df_clean_notnull.duplicated(subset='city', keep='first'))

293