<a href="https://colab.research.google.com/github/rodrigodemend/Previsao_Covid/blob/main/Notebooks/Importa%C3%A7%C3%A3o_e_Limpeza_dos_dados_de_Covid_19.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Projeto de Previsão de Series Temporais usando Prophet - Bootcamp Data Science Alura
Autor: Rodrigo de Mendonça

e-mail: rodrigodemend@gmail.com

[Fique à vontade para visitar meu Linkedin](https://www.linkedin.com/in/rodrigomendonça/)

Este projeto consiste na previsão do número de mortes por Covid-19 em Santa Catarina e será desenvolvido utilizando o framework Prophet. Uma biblioteca dedicada a previsão de series temporais. 

Este notebook irá apenas fazer a importação dos dados e sua limpeza. Caso queira ver as análises e previsões relizadas após a fase de limpeza dos dados, fique à vontade de ir para esse [Notebook](https://github.com/rodrigodemend/Previsao_Covid/blob/main/Notebooks/Previsão_de_Series_Temporais_usando_Prophet.ipynb).

# Importando as bibliotecas

In [24]:
# Importação das bibliotecas utilizadas
import pandas as pd
import numpy as np

# Importando os dados brutos

Nesta seção estamos fazendo a importação dos dados que serão utilizados para a criação do modelo.

O conjunto de dados diz a respeito do número de novas mortes por Covid-19 de 25 de Fevereiro de 2020 até 20 de Dezembro de 2021. Nossa previsão será sobre o estado de Santa Catarina, mas afim de facilitar futuros trabalhos sobre os outros estados, resolvi fazer a importação de todos os estados do Brasil.

Os dados foram obtidos através do [Brasil.IO](https://brasil.io/dataset/covid19/caso_full/) onde estão sendo disponibilizados boletins informativos sobre os casos do coronavírus.

In [25]:
# Lendo os dados do Github
raw_data = pd.read_csv('https://raw.githubusercontent.com/rodrigodemend/Previsao_Covid/main/Dados/Raw/Covid_per_state.csv')
raw_data.head()

Unnamed: 0,epidemiological_week,date,order_for_place,state,city,city_ibge_code,place_type,last_available_confirmed,last_available_confirmed_per_100k_inhabitants,new_confirmed,last_available_deaths,new_deaths,last_available_death_rate,estimated_population,is_last,is_repeated
0,202151,2021-12-20,644,AC,,12,state,88335,9875.68057,0,1850,0,0.0209,894470,False,True
1,202151,2021-12-20,653,AL,,27,state,241901,7217.60097,6,6379,2,0.0264,3351543,True,False
2,202151,2021-12-20,648,AM,,13,state,432712,10283.77879,47,13820,0,0.0319,4207714,True,False
3,202151,2021-12-20,641,AP,,16,state,126200,14644.22766,60,2009,1,0.0159,861773,True,False
4,202151,2021-12-20,655,BA,,29,state,1264804,8471.20089,0,27438,6,0.0217,14930634,True,False


Agora vamos entender o que significa cada coluna:

In [26]:
# Lendo  os metadados
metadata = pd.read_html('https://raw.githubusercontent.com/rodrigodemend/Previsao_Covid/main/Dados/Raw/Metadatos', encoding='utf-8')[0]
metadata

Unnamed: 0,Coluna,Tipo,Título,Descrição
0,epidemiological_week,integer,Semana epidemiológica,Número da semana epidemiológica.
1,date,string (max_length=10),Data,Data de coleta dos dados no formato YYYY-MM-DD.
2,order_for_place,integer,Dias a partir do 1o caso,Número que identifica a ordem do registro para...
3,state,string (max_length=2),UF,"Sigla da unidade federativa, exemplo: SP."
4,city,string (max_length=64),Município,Nome do município (pode estar em branco quando...
5,city_ibge_code,integer,Cód. IBGE,Código IBGE do local.
6,place_type,string (max_length=5),Tipo de local,"Tipo de local que esse registro descreve, pode..."
7,last_available_date,string (max_length=10),Data da informação,Data da qual o dado se refere.
8,last_available_confirmed,integer,Confirmados acum.,Número de casos confirmados do último dia disp...
9,last_available_confirmed_per_100k_inhabitants,float,Confirmados acum./100k hab.,Número de casos confirmados por 100.000 habita...


Nós não iremos trabalhar com todas as colunas, então vamos fazer um filtro de apenas algumas colunas:

In [27]:
# Filtrando as colunas desejadas
filter_columns = ['epidemiological_week', 'date', 'state', 'last_available_confirmed', 'new_confirmed', 'last_available_deaths', 'new_deaths']
data = raw_data[filter_columns]
data.tail()

Unnamed: 0,epidemiological_week,date,state,last_available_confirmed,new_confirmed,last_available_deaths,new_deaths
17495,202009,2020-02-29,SP,2,0,0,0
17496,202009,2020-02-28,SP,2,1,0,0
17497,202009,2020-02-27,SP,1,0,0,0
17498,202009,2020-02-26,SP,1,0,0,0
17499,202009,2020-02-25,SP,1,1,0,0


Agora vamos pivotar a tabela a fim de ter os novos casos de Covid-19 por estado:

In [28]:
# Criação da tabela pivotada para novos casos confirmados por estado
data_new_confirmed = data.pivot_table(index=['date'], columns=['state'], values=['new_confirmed'])
data_new_confirmed.columns = data_new_confirmed.columns.droplevel()
data_new_confirmed.tail()

state,AC,AL,AM,AP,BA,CE,DF,ES,GO,MA,MG,MS,MT,PA,PB,PE,PI,PR,RJ,RN,RO,RR,RS,SC,SE,SP,TO
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
2021-12-16,8.0,14.0,179.0,110.0,0.0,121.0,30.0,270.0,0.0,234.0,401.0,0.0,152.0,799.0,0.0,18.0,45.0,585.0,5.0,137.0,265.0,7.0,163.0,109.0,9.0,0.0,0.0
2021-12-17,6.0,15.0,319.0,104.0,0.0,212.0,39.0,311.0,0.0,386.0,393.0,0.0,149.0,865.0,0.0,13.0,0.0,501.0,0.0,92.0,327.0,5.0,215.0,164.0,11.0,490.0,0.0
2021-12-18,15.0,17.0,87.0,72.0,0.0,106.0,0.0,173.0,0.0,416.0,176.0,0.0,64.0,138.0,34.0,7.0,145.0,410.0,30.0,0.0,100.0,0.0,151.0,137.0,1.0,102.0,0.0
2021-12-19,6.0,3.0,77.0,51.0,0.0,91.0,0.0,9.0,0.0,17.0,94.0,0.0,11.0,143.0,0.0,9.0,6.0,339.0,7.0,146.0,79.0,4.0,132.0,80.0,4.0,0.0,0.0
2021-12-20,0.0,6.0,47.0,60.0,0.0,106.0,73.0,471.0,0.0,86.0,99.0,0.0,104.0,666.0,4.0,8.0,19.0,407.0,2.0,0.0,240.0,0.0,125.0,54.0,4.0,0.0,0.0


Assim como fizemos com os dados dos novos casos, também vamos pivotar a tabela para ter os dados das novas mortes por Covid-19 por estado:

In [29]:
# Criação da tabela pivotada para novas mortes confirmadas por estado
data_new_deaths = data.pivot_table(index=['date'], columns=['state'], values=['new_deaths'])
data_new_deaths.columns = data_new_deaths.columns.droplevel()
data_new_deaths.tail()

state,AC,AL,AM,AP,BA,CE,DF,ES,GO,MA,MG,MS,MT,PA,PB,PE,PI,PR,RJ,RN,RO,RR,RS,SC,SE,SP,TO
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
2021-12-16,0.0,2.0,3.0,0.0,4.0,4.0,3.0,2.0,0.0,2.0,24.0,0.0,2.0,12.0,0.0,6.0,2.0,4.0,5.0,7.0,4.0,2.0,17.0,9.0,0.0,0.0,0.0
2021-12-17,0.0,2.0,0.0,0.0,7.0,0.0,7.0,13.0,0.0,3.0,19.0,0.0,5.0,6.0,0.0,10.0,6.0,1.0,0.0,2.0,2.0,0.0,12.0,10.0,0.0,259.0,0.0
2021-12-18,0.0,1.0,0.0,0.0,2.0,7.0,0.0,1.0,0.0,2.0,17.0,0.0,0.0,1.0,7.0,5.0,0.0,2.0,30.0,0.0,2.0,0.0,5.0,4.0,1.0,51.0,0.0
2021-12-19,0.0,0.0,0.0,0.0,2.0,0.0,0.0,1.0,0.0,2.0,15.0,0.0,0.0,1.0,0.0,4.0,4.0,0.0,7.0,3.0,1.0,0.0,4.0,5.0,0.0,0.0,0.0
2021-12-20,0.0,2.0,0.0,1.0,6.0,10.0,3.0,11.0,0.0,3.0,1.0,0.0,3.0,7.0,2.0,4.0,4.0,2.0,2.0,0.0,7.0,0.0,2.0,2.0,1.0,0.0,0.0


Uma vez prontos os dados, vamos salvá-los para futuras análises:


In [30]:
# Salvando dados limpos e prontos para análises por estado
data_new_deaths.to_csv('/content/data_new_deaths.csv', encoding='utf-8', index=True)
data_new_confirmed.to_csv('/content/data_new_confirmed.csv', encoding='utf-8', index=True)