# 2019 Novel Coronavirus (SARS-CoV-2) and COVID-19 Unpivoted Data

The following script takes data from the repository of the 2019 Novel Coronavirus Visual Dashboard operated by Johns Hopkins University's Center for Systems Science and Engineering (JHU CSSE). It will apply necessary cleansing/reformatting to make it use in traditional relational databases and data visualization tools.

In [None]:
import pandas as pd
import gc
import os
import datetime
import pycountry
import numpy
from copy import deepcopy
import re
from urllib.error import HTTPError

In [None]:
# papermill parameters
output_folder = "../output/"

Data until 22MAR2020 is stored in a cache. This is collated and reshaped data from previous days.

In [None]:
pre_2203_data = pd.read_csv("https://s3-us-west-1.amazonaws.com/starschema.covid/CSSEGISandData_COVID-19_until_0322.csv",keep_default_na=False)
pre_2203_data["Date"] = pd.to_datetime(pre_2203_data["Date"])

Daily reports from and including 23MAR2020 are downloaded from the JHU CSSE GIS and Data Github repository.

In [None]:
def urls():
    return [template.format(month=dt.month, day=dt.day, year=dt.year) for dt in dates]

In [None]:
def retrieve_and_merge():
    dates = []
    if os.getenv("ENVIRONMENT") == "CI":
        dates = [datetime.date(year=2020, month=12, day=23) + datetime.timedelta(n) for n in range(int((datetime.datetime.now().date() - datetime.datetime(year=2020, month=12, day=23).date()).days))]
    else
        dates = [datetime.date(year=2020, month=3, day=23) + datetime.timedelta(n) for n in range(int((datetime.datetime.now().date() - datetime.datetime(year=2020, month=3, day=23).date()).days))]
    
    template = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/{month:02d}-{day:02d}-{year}.csv"
    
    res = pd.DataFrame()
    
    for dt in dates:
        try:
            df = pd.read_csv(template.format(year=dt.year,
                                             month=dt.month,
                                         day=dt.day),keep_default_na=False)
            df["Date"] = dt
            res = res.append(df, ignore_index=True)
        except HTTPError:
            print(f"HTTP error for {dt.year}-{dt.month}-{dt.day} acquisition skipped.")
            pass
    return res.melt(id_vars=[col for col in res.columns if col not in ["Confirmed", "Deaths", "Recovered", "Active"]],
                    var_name="Case_Type",
                    value_name="Cases").drop(["Last_Update"], axis=1).rename(columns={"Long_": "Long",
                                                                                      "Country_Region": "Country/Region",
                                                                                      "Province_State": "Province/State"})

Active and recovered cases are [no longer reported with confidence](https://github.com/starschema/COVID-19-data/issues/78). However, [due to popular demand](https://github.com/starschema/COVID-19-data/issues/78), these will continue to be reported.

In [None]:
df = retrieve_and_merge()
df["Date"] = pd.to_datetime(df["Date"])

Add leading zeroes where needed

In [None]:
df['FIPS'].loc[df['FIPS'] != ''] = df['FIPS'].str.zfill(5)

Fix County-Equivalent Entities

In [None]:
df['FIPS'] = df['FIPS'].replace(r'^(0{3,})(\d{2})$', r'\g<2>\g<1>', regex=True)

In [None]:
df = df.astype({
    'FIPS': 'object'
})

In [None]:
df['Admin2'] = df['Admin2'].replace(r'(?i)unassigned', 'unassigned', regex=True) 

In [None]:
df = df.rename(columns={"Admin2": "County"})

In [None]:
cldf_us = df.loc[df["Country/Region"] == "US"]
cldf_nonus = df.loc[df["Country/Region"] != "US"]
del df
gc.collect()

We filter the county-level data set for state data only to prevent DQ issues in JHU inputs that account for 'Recovered'/'Active' as states.

In [None]:
cldf_us_recovered = cldf_us[ (cldf_us["Province/State"]=='Recovered') & cldf_us.Case_Type.isin(['Recovered'])]
cldf_us_recovered['Province/State'] = ''

In [None]:
cldf_us = cldf_us[cldf_us["Province/State"].isin([s.name for s in pycountry.subdivisions.get(country_code = "US")])]

In [None]:
cldf_us = cldf_us.append(cldf_us_recovered, ignore_index = True)

In [None]:
cldf_us[ cldf_us["Cases"] != 0 ]

## Data Quality

We use `pycountry` to resolve geographies.

A number of states have inconsistent naming or special characters, such as `Taiwan*`. These are normalised through a replacement `dict` with ISO3166-1 compliant names. Data is then aggregated for each division by date and case type.

In [None]:
changed_names = {
    "Holy See": "Holy See (Vatican City State)",
    "Vatican City": "Holy See (Vatican City State)",
    "Hong Kong SAR": "Hong Kong",
    "Iran (Islamic Republic of)": "Iran, Islamic Republic of",
    "Iran": "Iran, Islamic Republic of",
    "Macao SAR": "Macao",
    "Macau": "Macao",
    "Republic of Korea": "Korea, Republic of",
    "South Korea": "Korea, Republic of",
    "Korea, South": "Korea, Republic of",
    "Republic of Moldova": "Moldova, Republic of",
    "Russia": "Russian Federation",
    "Saint Martin": "Sint Maarten (Dutch part)",
    "St. Martin": "Sint Maarten (Dutch part)",
    "Taipei and environs": "Taiwan, Province of China",
    "Vietnam": "Viet Nam",
    "occupied Palestinian territory": "Palestine, State of",
    "West Bank and Gaza": "Palestine, State of",
    "Taiwan*": "Taiwan, Province of China",
    "Congo (Brazzaville)": "Congo",
    "Congo (Kinshasa)": "Congo, The Democratic Republic of the",
    "Gambia, The": "Gambia",
    "The Gambia": "Gambia",
    "Tanzania": "Tanzania, United Republic of",
    "US": "United States",
    "Curacao": "Curaçao",
    "Brunei": "Brunei Darussalam",
    "Cote d'Ivoire": "Côte d'Ivoire",
    "Moldova": "Moldova, Republic of",
    "The Bahamas": "Bahamas",
    "Venezuela": "Venezuela, Bolivarian Republic of",
    "Bolivia": "Bolivia, Plurinational State of",
    "East Timor": "Timor-Leste",
    "Cape Verde": "Cabo Verde",
    "US": "United States",
    "Laos": "Lao People's Democratic Republic",
    "Burma": "Myanmar"
}

def normalize_names(df):
    df["Country/Region"] = df["Country/Region"].replace(changed_names)
    df["Cases"] = df["Cases"].replace('',0).astype(int)
        
    return(df.groupby(by=["Country/Region","Province/State", "Date", "Case_Type"], as_index=False).agg({"Cases": "sum", "Long": "first", "Lat": "first"}))

In [None]:
cldf_nonus = normalize_names(cldf_nonus)

In [None]:
cldf_us["Country/Region"] = "United States"

## Normalize cruise ship names

In [None]:
cldf_nonus.loc[cldf_nonus["Country/Region"] == "Diamond Princess", "Province/State"] = "Diamond Princess"

In [None]:
cldf_nonus.loc[cldf_nonus["Country/Region"] == "Diamond Princess", "Country/Region"] = "Cruise Ship"

## Adding ISO3166-1 and ISO3166-2 identifiers

To facilitate easy recognition, ISO3166-1 identifiers are added to all countries and ISO3166-2 identifiers are added where appropriate. This is the case where subregional data exists:

* Australia
* Canada
* France (`France` for metropolitan France, separate regions for DOM/TOMs
* PRC
* US
* UK (the `UK` province identifier encompasses only Great Britain and Northern Ireland, other dependencies reporting to the UK authorities are separate subdivisions)
* The Kingdom of the Netherlands (`Netherlands` encompasses the constituent country of the Netherlands, and the other constituent countries register cases as separate provinces of the Kingdom of the Netherlands)

In [None]:
def resolve_iso3166_1(row):
    if row["Country/Region"] is not "Cruise Ship":
        if pycountry.countries.get(name=row["Country/Region"]):
            row["ISO3166-1"] = pycountry.countries.get(name=row["Country/Region"]).alpha_2
        else:
            row["ISO3166-1"] = ""
    return row

In [None]:
cldf_nonus = cldf_nonus.apply(resolve_iso3166_1, axis=1)
cldf_us["ISO3166-1"] = "US"

We then encode level 2 IDs:

In [None]:
fr_subdivisions = {"France": "FR",
                       "French Guiana": "GF",
                       "French Polynesia": "PF",
                       "Guadeloupe": "GUA",
                       "Mayotte": "YT",
                       "Reunion": "RE",
                       "Saint Barthelemy": "BL",
                       "St Martin": "MF"}

nl_subdivisions = {"Netherlands": "NL",
                   "Aruba": "AW",
                   "Curacao": "CW"}

cn_subdivisions = {'Jilin': 'CN-JL',
 'Xizang': 'CN-XZ',
 'Anhui': 'CN-AH',
 'Jiangsu': 'CN-JS',
 'Yunnan': 'CN-YN',
 'Beijing': 'CN-BJ',
 'Jiangxi': 'CN-JX',
 'Zhejiang': 'CN-ZJ',
 'Chongqing': 'CN-CQ',
 'Liaoning': 'CN-LN',
 'Fujian': 'CN-FJ',
 'Guangdong': 'CN-GD',
 'Inner Mongolia': 'CN-NM',
 'Gansu': 'CN-GS',
 'Ningxia': 'CN-NX',
 'Guangxi': 'CN-GX',
 'Qinghai': 'CN-QH',
 'Guizhou': 'CN-GZ',
 'Sichuan': 'CN-SC',
 'Henan': 'CN-HA',
 'Shandong': 'CN-SD',
 'Hubei': 'CN-HB',
 'Shanghai': 'CN-SH',
 'Hebei': 'CN-HE',
 'Shaanxi': 'CN-SN',
 'Hainan': 'CN-HI',
 'Shanxi': 'CN-SX',
 'Tianjin': 'CN-TJ',
 'Heilongjiang': 'CN-HL',
 'Hunan': 'CN-HN',
 'Xinjiang': 'CN-XJ',
 'Tibet': "CN-XZ"}

uk_subdivisions = {"United Kingdom": "UK",
                   "Cayman Islands": "KY",
                   "Channel Islands": "CHA",
                   "Gibraltar": "GI",
                   "Montserrat": "MS"}

subdivisions = {
    "AU": {subdivision.name: subdivision.code.replace("AU-", "") for subdivision in pycountry.subdivisions.get(country_code="AU")},
    "CA": {subdivision.name: subdivision.code.replace("CA-", "") for subdivision in pycountry.subdivisions.get(country_code="CA")},
    "US": {subdivision.name: subdivision.code.replace("US-", "") for subdivision in pycountry.subdivisions.get(country_code="US")},
    "GB": uk_subdivisions,
    "CN": cn_subdivisions,
    "NL": nl_subdivisions,
    "FR": fr_subdivisions
}

In [None]:
countries_with_subdivisions = list(subdivisions.keys())

def resolve_iso3166_2(row):
    if row["ISO3166-1"] in countries_with_subdivisions:
        row["ISO3166-2"] = subdivisions[row["ISO3166-1"]].get(row["Province/State"])
    else:
        row["ISO3166-2"] = ""
    return row

In [None]:
cldf_us = cldf_us.apply(resolve_iso3166_2, axis=1)
cldf_nonus = cldf_nonus.apply(resolve_iso3166_2, axis=1)

## Fixing county name inconsistencies

See [Issue #128](https://github.com/starschema/COVID-19-data/issues/128#issue-590293662) and [Issue #145](https://github.com/starschema/COVID-19-data/issues/145) for details.

In [None]:
county_remappings = {
    "Walla Walla County": "Walla Walla",
    "Doña Ana": "Dona Ana",
    "Elko County": "Elko",
    "Washington County": "Washington"
}

In [None]:
cldf_us["County"] = cldf_us["County"].replace(county_remappings)

In [None]:
fips_mapping = pd.read_csv("https://s3-us-west-1.amazonaws.com/starschema.covid/US_County_FIPS_Mapping.csv", 
                           index_col=["ISO3166_2","COUNTY"])

def add_missing_fips(row):
    if row["FIPS"] == "" or row["Lat"] == "" or row["Long"] == "":
        if (row['ISO3166-2'], row["County"]) in fips_mapping.index:
            row["FIPS"] =  fips_mapping.loc[row['ISO3166-2'], row["County"]]["FIPS"]
            row["Lat"] =  fips_mapping.loc[row['ISO3166-2'], row["County"]]["LATITUDE"]
            row["Long"] =  fips_mapping.loc[row['ISO3166-2'], row["County"]]["LONGITUDE"]
    return row



In [None]:
cldf_us = cldf_us.apply(add_missing_fips, axis=1)

In [None]:
cldf_us["Lat"] = pd.to_numeric(cldf_us["Lat"])
cldf_us["Long"] = pd.to_numeric(cldf_us["Long"])

In [None]:
cldf_us = cldf_us[cldf_us["County"] == ""].append(cldf_us[cldf_us["County"] != ""].groupby([
    "County", "Province/State", "Country/Region",
    "Date", "Case_Type", "ISO3166-1", "ISO3166-2"
]).agg({
    "Cases": "sum",
    "Lat": "mean",
    "Long": "mean",
    "FIPS": "first"
}).reset_index(),sort=True)

In [None]:
cldf_us['FIPS'].loc[cldf_us['County'] == 'unassigned'] = numpy.nan
cldf_us['Lat'].loc[cldf_us['County'] == 'unassigned'] = numpy.nan
cldf_us['Long'].loc[cldf_us['County'] == 'unassigned'] = numpy.nan

## Calculating case changes

In [None]:
cldf_nonus = cldf_nonus.sort_values(by=["Country/Region", "Province/State", "Case_Type", "Date"], ascending=True)
cldf_us = cldf_us.sort_values(by=["Country/Region", "Province/State", "County", "Case_Type", "Date"], ascending=True)

In [None]:
cldf_nonus = pre_2203_data[pre_2203_data["Country/Region"] != "US"].append(cldf_nonus, sort=True)
cldf_nonus["Difference"] = cldf_nonus["Cases"] - cldf_nonus.groupby(["Country/Region", "Province/State", "Case_Type"])["Cases"].shift(periods=1)

In [None]:
cldf_us["Difference"] = pd.to_numeric(cldf_us["Cases"],errors='coerce') - pd.to_numeric(cldf_us.groupby(["Country/Region", "Province/State", "County", "Case_Type"])["Cases"].shift(periods=1),errors='coerce')

In [None]:
result = cldf_nonus.append(cldf_us)
result['County'] = result['County'].replace('', numpy.nan)
result['FIPS'] = result['FIPS'].replace('', numpy.nan)
result['Lat'] = result['Lat'].replace(r'^0?$', numpy.nan, regex=True)
result['Lat'] = result['Lat'].replace(0, numpy.nan)
result['Long'] = result['Long'].replace(r'^0?$', numpy.nan, regex=True)
result['Long'] = result['Long'].replace(0, numpy.nan)
result['Province/State'] = result['Province/State'].replace('', numpy.nan)

In [None]:
result.loc[(result["Date"] == "2020-01-22") & (result["Country/Region"] == "United States"), "Difference"] = result[(result["Date"] == "2020-01-22") & (result["Country/Region"] == "United States")]["Cases"]

Drop all records with 0 case and 0 differences. We do not need records prior any `Case_Type` events.

In [None]:
result = result[ ~(result.Cases.eq(0) & result.Difference.eq(0))]
result.groupby(["Date","Case_Type"]).sum()

## Set Last_Reported_Date_Flag

In [None]:
result['Last_Reported_Flag'] = result["Date"].max() == result["Date"]

## Adding timestamp

Before we save the file locally, we add the `Last_Update_Date` in `UTC` time zone.

In [None]:
result["Last_Update_Date"] = datetime.datetime.utcnow()

## Output

Finally, we store the output in the `output` folder as `JHU_COVID-19.csv` as an unindexed CSV file.

In [None]:
result.to_csv(output_folder + "JHU_COVID-19.csv", index=False, columns=["Country/Region",
                                                                          "Province/State",
                                                                          "County",
                                                                          "FIPS",
                                                                          "Date",
                                                                          "Case_Type",
                                                                          "Cases",
                                                                          "Long",
                                                                          "Lat", 
                                                                          "ISO3166-1",
                                                                          "ISO3166-2",
                                                                          "Difference",
                                                                          "Last_Update_Date",
                                                                          "Last_Reported_Flag"
                                                                       ])