# Brazil Covid-19 Data Analysis

#### Author: Romullo Ferreira

The aim of this project is to analyze Covid-19 (Brazil) data, which includes the following steps:

1. Defining the list of questions to be answered with data analysis.
2. Data wrangling 
    - 2.1. Gather
    - 2.2. Assess
    - 2.3. Clean
3. Data exploration and effective communication of the exploratory steps.
4. Conclusions.

Useful facts about the data:
- epidemiological situation of COVID-19 in Brazil
- The data update process is carried out daily by the Ministry of Health through official information provided by the State Health Secretariats of the 27 Brazilian Federative Units
- 27 states of Brazil
- 5 regions of brazil (Central-West, North, Northeast, South and Southeast)
- Data were collected from 2021-03-09 to 2021-03-18
- Data Source: State Health Secretariats. Brazil, 2020
- Detailed data can be downloaded from [OpenDataSUS](https://opendatasus.saude.gov.br/)
- You can also access the [Brazilian Open Data Portal](https://dados.gov.br/dataset) to find more coronavirus data

## 1. Defining the list of questions to be answered with data analysis

We have a lot of things to ask. The dataset has a lot of information, but for the analysis not to be too long I decided to ask some questions that I thought were important as a start. But after taking a look at my analysis feel free to ask other questions and analyze.

It is important to remember that All questions below refer to Covid-19 data from Brazil:

- a. - What is the total accumulated cases?
- b. - What is the total new cases?
- c. - What is the total recovered cases?
- d. - What is the total followup cases?
- e. - What is the total Covid-19 deaths?
- f. - What is the total new deaths?
- g. - What is the Incidence Coefficient of Covid-19?
- h. - What is the Mortality Coefficient for Covid-19?
- i. - What is the Covid-19 Lethality Rate?
- j. - Which region(zone) has the highest Covid-19 Cases?
- k. - Which states have the highest number of covid cases?
- l. - Which cities have the highest number of covid cases?
- m. - Which region(zone) has the highest Covid-19 deaths?
- n. - Which states have the highest number of covid deaths?
- o. - Which cities have the highest number of covid deaths?

## 2. Data wrangling

## 2.1. Gather

Firstly, let's import all the libraries necessary for this project

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Import CSV File

In [2]:
df_covid19br = pd.read_csv('COVIDBR.csv', sep=';')

## 2.2. Assess

Let's take a look at the dataframe

In [3]:
df_covid19br.sample(3)

Unnamed: 0,regiao,estado,municipio,coduf,codmun,codRegiaoSaude,nomeRegiaoSaude,data,semanaEpi,populacaoTCU2019,casosAcumulado,casosNovos,obitosAcumulado,obitosNovos,Recuperadosnovos,emAcompanhamentoNovos,interior/metropolitana
955025,Sudeste,MG,Itatiaiuçu,31,313370.0,31031.0,ITAUNA,2020-06-20,25,11146.0,37,-3,1,0,,,1.0
895825,Sudeste,MG,Coronel Pacheco,31,311960.0,31097.0,JUIZ DE FORA,2021-02-03,5,3086.0,101,0,4,0,,,0.0
1352189,Sudeste,SP,Presidente Epitácio,35,354130.0,35114.0,EXTREMO OESTE PAULISTA,2020-11-09,46,44200.0,612,0,20,0,,,0.0


We can already see that there are missing values in the columns estado, municipio, codmun and other columns for example.

Here we can see the size of original dataset, 2.012.472 rows e 17 columns.

In [4]:
df_covid19br.shape

(2012472, 17)

Let's count the unique values

In [5]:
df_covid19br.nunique()

regiao                        6
estado                       27
municipio                  5297
coduf                        28
codmun                     5591
codRegiaoSaude              450
nomeRegiaoSaude             440
data                        389
semanaEpi                    53
populacaoTCU2019           5104
casosAcumulado            33078
casosNovos                 3779
obitosAcumulado            7273
obitosNovos                 641
Recuperadosnovos            336
emAcompanhamentoNovos       336
interior/metropolitana        2
dtype: int64

##### Let's create a copy of the original dataframe. And create another dataframe (filtered_df) with just today's date data, which is where I'm going to focus my analysis.

In [6]:
filtered_df = df_covid19br.copy()
filtered_df = filtered_df.query("data > '2021-03-18'")
filtered_df.sample(3)

Unnamed: 0,regiao,estado,municipio,coduf,codmun,codRegiaoSaude,nomeRegiaoSaude,data,semanaEpi,populacaoTCU2019,casosAcumulado,casosNovos,obitosAcumulado,obitosNovos,Recuperadosnovos,emAcompanhamentoNovos,interior/metropolitana
198485,Nordeste,MA,Feira Nova do Maranhão,21,210407.0,21003.0,BALSAS,2021-03-19,11,8504.0,1288,0,5,0,,,0.0
1994571,Centro-Oeste,GO,Rio Quente,52,521878.0,52005.0,ESTRADA DE FERRO,2021-03-19,11,4493.0,283,4,1,0,,,0.0
346339,Nordeste,CE,Carnaubal,23,230340.0,23013.0,13ª REGIAO TIANGUA,2021-03-19,11,17606.0,972,0,10,0,,,0.0


Dimensions of the dataframe. Here we can already see that our filtered dataframe is much smaller than the original file, with 5619 rows and 17 columns. 

In [7]:
filtered_df.shape

(5619, 17)

- Which columns have missing values? 
- Here we can already see that the "munincipio" column has only 5570 against 5619 of the original file. This means that we have 49 missing values. And the "codmun" column has only 5591 out of a total of 5619, this means that we have 28 missing values. Let's keep that information.

In [8]:
filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5619 entries, 390 to 2012471
Data columns (total 17 columns):
regiao                    5619 non-null object
estado                    5618 non-null object
municipio                 5570 non-null object
coduf                     5619 non-null int64
codmun                    5591 non-null float64
codRegiaoSaude            5570 non-null float64
nomeRegiaoSaude           5570 non-null object
data                      5619 non-null object
semanaEpi                 5619 non-null int64
populacaoTCU2019          5598 non-null float64
casosAcumulado            5619 non-null int64
casosNovos                5619 non-null int64
obitosAcumulado           5619 non-null int64
obitosNovos               5619 non-null int64
Recuperadosnovos          1 non-null float64
emAcompanhamentoNovos     1 non-null float64
interior/metropolitana    5570 non-null float64
dtypes: float64(6), int64(6), object(5)
memory usage: 790.2+ KB


Another way to see missing values. Count of missing values.

In [9]:
filtered_df.isnull().sum()

regiao                       0
estado                       1
municipio                   49
coduf                        0
codmun                      28
codRegiaoSaude              49
nomeRegiaoSaude             49
data                         0
semanaEpi                    0
populacaoTCU2019            21
casosAcumulado               0
casosNovos                   0
obitosAcumulado              0
obitosNovos                  0
Recuperadosnovos          5618
emAcompanhamentoNovos     5618
interior/metropolitana      49
dtype: int64