# 2019 Novel Coronavirus COVID-19 (2019-nCoV) Unpivoted Data

The following script takes data from the repository of the 2019 Novel Coronavirus Visual Dashboard operated by 
the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). It will apply necessary 
cleansing/reformatting to make it use in traditional relational databases and data visualization tools.


In [303]:
import pandas as pd
import pygsheet as gsheet
from datetime import datetime


Data downloaded directly from Johns Hopkins git repository, located at: https://github.com/CSSEGISandData/COVID-19. Their repository has three different CSV files for `confirmed`, `deaths` and `recovered` data.

In [309]:
confirmed = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv",keep_default_na=False)
deaths = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv",keep_default_na=False)
recovered = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv",keep_default_na=False)

confirmed['Case_Type'] = 'Confirmed'
deaths['Case_Type'] = 'Deaths'
recovered['Case_Type'] = 'Recovered'

key_columns = ['Country/Region','Province/State','Lat','Long','Case_Type']

data = [confirmed, deaths, recovered]
    
#list(map( lambda df: len(df.index), data))

The original dataset stores the number of `Cases` for a given day in columns. 
This is not useful for reporting, thus we move these date columns to rows:

In [310]:
def unpivot(df):
    # unpivot all non-key columns
    melted = df.melt(id_vars=key_columns, var_name='Date', value_name = 'Cases')
    # change our new Date field to Date type
    melted['Date']= pd.to_datetime(melted['Date']) 
    
    return melted

unpivoted_data = list(map(unpivot, data))

#unpivoted_data[0]["Date"].describe()

Sorting the data by primary keys and `Date`, to make sure we can add a `Differences` column easily. 

As `Cases` are actual snapshots (running numbers), to know what was the change since the previous day we introduce a new column called `Differences`.

In [311]:
sorted_data = list( map(lambda df: df.sort_values(by=key_columns + ['Date'], ascending=True), unpivoted_data) )

#sorted_data[0].tail(5)

`Difference` is today's `Cases` minus yesterday's `Cases` for each region/state.

In [312]:
for df in sorted_data:
    df["Difference"] = df["Cases"] - df.groupby( key_columns )["Cases"].shift(1, fill_value = 0) 

concated = pd.concat(sorted_data)

#concated.tail(5)

We also want to show the number of active cases. In our definition, `Active` is calculated as:

```
Active = Confirmed - Deaths - Recovered
```

As a first step, we merge the different type of cases into a single line for each `Country/Province/Date` keys:

In [313]:
confirmed = concated[concated["Case_Type"].eq("Confirmed")]
deaths = concated[concated["Case_Type"].eq("Deaths")]
recovered = concated[concated["Case_Type"].eq("Recovered")]

active = confirmed.merge(deaths, validate= "one_to_one", suffixes =["","_d"], on=["Country/Region","Province/State","Date"]) \
         .merge(recovered, validate= "one_to_one", suffixes =["","_r"], on=["Country/Region","Province/State","Date"])

#active.head()

The apply the calculations both for `Cases` and `Difference`:

In [314]:
active["Case_Type"] = 'Active'
active["Cases"] = active["Cases"] - active["Cases_r"] - active["Cases_d"]
active["Difference"] = active["Difference"] - active["Difference_r"] - active["Difference_d"]

#active.tail()

Then merge the `Active` dataset with the original one. 

In [317]:
data = pd.concat([concated,active], join="inner")

#data["Case_Type"].unique()

Before we save the file locally, we add the `Last_Update_Date` in `UTC` time zone.

In [318]:
data["Last_Update_Date"] = datetime.utcnow()
data.to_csv("./JHU_COVID-19_active.csv", index=False)