# Process Italian Covid-19 Data

v2 20200316

Addressing review comments:
* *Diff should be the difference since yesterday in case the total case is a running sum. I do not know if this data is a daily snapshot or a running sum. But the goal is to have columns where we can see for instance in case of "Recovered" how many reported recoveries happened since yesterday.* --> This notebook caclulates DIFFs as today's snapshot data minus yesterday's snapshot data for Hospitalized, Intensive Care, Total Hospitalized (sum of Hospitalized and Intensive Care), Home Isolation, Total Positive, Discharged Healed, Deceased, Total Cases and Tested. New Positive cases are not calculated here as they are provided as change since yesterday in the source data. In v2 corrected total cases vs active cases issue.
* *please call the workbook as PCM_DPS_COVID19 to know who is the data provider* --> notebook renamed to PCM_DPS_COVID19
* *we move S3 upload to a different place (out from the notebook).* --> S3 uplaod removed (changed to markup)
* *Please output PCM_DPS_COVID19.csv as output (same as the basename of the notebook)* --> output file renamed (easily configurable in Parameters section)


v1 20200313

* Load latest Covid-19 data from [https://github.com/pcm-dpc/COVID-19](https://github.com/pcm-dpc/COVID-19)
* Transform for easy reporting (calcualte day-to-day changes, rename columns)
* Create summary file, similar to international data
* Upload to S3 bucket


## Imports

In [None]:
import numpy as np
import pandas as pd
import datetime as dt
import os

import boto3
from botocore.exceptions import ClientError

In [None]:
# papermill parameters
output_folder = '../output/'

## Parameters

In [None]:
INPUT_FILE = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv'
OUTPUT_FILE_FULL = 'PCM_DPS_COVID19-DETAILS.csv'
OUTPUT_FILE_SUMMARY = 'PCM_DPS_COVID19.csv'

## Input data

In [None]:
data_ita = pd.read_csv(INPUT_FILE)

In [None]:
# data_ita.columns: 
# ['data', 'stato', 'codice_regione', 'denominazione_regione', 'lat', 'long', 'ricoverati_con_sintomi', 'terapia_intensiva', 'totale_ospedalizzati', 
# 'isolamento_domiciliare', 'totale_attualmente_positivi', 'nuovi_attualmente_positivi', 'dimessi_guariti', 'deceduti', 'totale_casi', 'tamponi']

data_ita.columns = ['Date', 'State', 'Region_Code', 'Region', 'Lat', 'Long', 
                    'Hospitalized', 'Intensive_Care', 'Total_Hospitalized', 
                    'Home_Isolation', 'Total_Positive', 'New_Positive', 
                    'Discharged_Healed', 'Deceased', 'Total_Cases', 'Tested']


In [None]:
data_ita.info()

In [None]:
# number of regions, number of dates
r = data_ita.Region.nunique()
d = data_ita.Date.nunique()
r, d, r*d

## Transform data

In [None]:
data_ita.Date = pd.to_datetime(data_ita.Date).dt.floor('d')


In [None]:
# calculate day-to-day changes for all figures (except new positive)
data_ita = data_ita.sort_values(by=['Region_Code', 'Date'])
data_ita['Hospitalized_Since_Prev_Day'] = data_ita.groupby(['Region_Code'])['Hospitalized'].diff().fillna(0).astype(int)
data_ita['Intensive_Care_Since_Prev_Day'] = data_ita.groupby(['Region_Code'])['Intensive_Care'].diff().fillna(0).astype(int)
data_ita['Total_Hospitalized_Since_Prev_Day'] = data_ita.groupby(['Region_Code'])['Total_Hospitalized'].diff().fillna(0).astype(int)
data_ita['Home_Isolation_Since_Prev_Day'] = data_ita.groupby(['Region_Code'])['Home_Isolation'].diff().fillna(0).astype(int)
data_ita['Total_Positive_Since_Prev_Day'] = data_ita.groupby(['Region_Code'])['Total_Positive'].diff().fillna(0).astype(int)
data_ita['Discharged_Healed_Since_Prev_Day'] = data_ita.groupby(['Region_Code'])['Discharged_Healed'].diff().fillna(0).astype(int)
data_ita['Deceased_Since_Prev_Day'] = data_ita.groupby(['Region_Code'])['Deceased'].diff().fillna(0).astype(int)
data_ita['Total_Cases_Since_Prev_Day'] = data_ita.groupby(['Region_Code'])['Total_Cases'].diff().fillna(0).astype(int)
data_ita['Tested_Since_Prev_Day'] = data_ita.groupby(['Region_Code'])['Tested'].diff().fillna(0).astype(int)


In [None]:
data_ita.tail(10)

In [None]:
data_ita.to_csv(output_folder + OUTPUT_FILE_FULL, index=False)

In [None]:
data_ita.info()

In [None]:
columns_summary = ['Country/Region', 'Province/State', 'Date', 'Cases', 'Long', 'Lat', 'Difference']

data_ita_confirmed = data_ita[['State', 'Region', 'Date', 'Total_Cases' , 'Long', 'Lat', 'Total_Cases_Since_Prev_Day']].copy()
data_ita_confirmed.columns = columns_summary
data_ita_confirmed['Case_Type'] = 'Confirmed'

data_ita_deceased = data_ita[['State', 'Region', 'Date', 'Deceased' , 'Long', 'Lat', 'Deceased_Since_Prev_Day']].copy()
data_ita_deceased.columns = columns_summary
data_ita_deceased['Case_Type'] = 'Deceased'

data_ita_recovered = data_ita[['State', 'Region', 'Date', 'Discharged_Healed' , 'Long', 'Lat', 'Discharged_Healed_Since_Prev_Day']].copy()
data_ita_recovered.columns = columns_summary
data_ita_recovered['Case_Type'] = 'Recovered'

data_ita_active = data_ita[['State', 'Region', 'Date', 'Total_Positive' , 'Long', 'Lat', 'Total_Positive_Since_Prev_Day']].copy()
data_ita_active.columns = columns_summary
data_ita_active['Case_Type'] = 'Active'

In [None]:
data_ita_summary = pd.concat([data_ita_confirmed, data_ita_deceased, data_ita_recovered, data_ita_active], ignore_index = True)

In [None]:
data_ita_summary.to_csv(output_folder + OUTPUT_FILE_SUMMARY, index=False)