Load covid cases per population group from CDC (https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf)
The data list all individual cases (deidentified), with information on sex, age group, race and identity, and date of case, plus some additional information (hospitalisation, intensive care, death ...).
As the dataset is huge (20.6 Mrows at mid-march), and since their are many identical rows (because of deidentification), we query the data through the api, grouped by the categories we're interested in. Thus, the downloaded dataset remains relatively small.

In [1]:
#!pip install sodapy

You should consider upgrading via the '/home/user/anaconda3/envs/airflow/bin/python -m pip install --upgrade pip' command.[0m


In [1]:
import pandas as pd
from sodapy import Socrata
import configparser
import json
import os
from datetime import datetime

Without an application token, the size of the downloaded data is limited (see https://dev.socrata.com/docs/app-tokens.html). 

CDC uses the [Socrata Open Data API](https://dev.socrata.com/) to manage access to its data. To increase download capabilities, you have to create an AppToken.

First, [create a Socrata account](https://data.cdc.gov/signup). 

Then, [sign in](https://data.cdc.gov/login) to Socrata, using the Socrata ID field. Go to 'My Profile', then 'Edit profile', then tab 'Developer settings' (https://data.cdc.gov/profile/edit/developer_settings). Create an AppToken by following [this guide](https://support.socrata.com/hc/en-us/articles/210138558-Generating-an-App-Token).

Then store the AppToken into the config file `capstone.cfg` :
```
[CDC]
APPTOKEN=<MyAppToken>
```


## Connect to cdc data server

In [2]:
config = configparser.ConfigParser()
config.read("capstone.cfg")
local_data_dir = config["PATH"]["LOCAL_DATA_DIR"]

['capstone.cfg']

In [3]:
# Connect to CDC using AppToken for Socrata API
# set timeout to 100, since the query of such large dataset may take longer time tat default timeout
app_token = config["CDC"]["APPTOKEN"]
client = Socrata("data.cdc.gov",
                 app_token,
                timeout = 100)

In [4]:
# identifier of CDC dataset with covid case surveillance (https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf)
cdc_dataset_identifier = "vbim-akqf"

Column names of the data set. We will use only sex, age, ethnicity and date.

In [5]:
metadata = client.get_metadata(cdc_dataset_identifier)
[x['name'] for x in metadata['columns']]

['cdc_case_earliest_dt ',
 'cdc_report_dt',
 'pos_spec_dt',
 'onset_dt',
 'current_status',
 'sex',
 'age_group',
 'race_ethnicity_combined',
 'hosp_yn',
 'icu_yn',
 'death_yn',
 'medcond_yn']

documentation about Socrata API is available [here](https://dev.socrata.com/docs/queries/). Some exemple uses of the python client library  [sodapy](https://github.com/xmunoz/sodapy) can be found [here](https://github.com/xmunoz/sodapy/blob/master/examples/soql_queries.ipynb)

## Download data : cases per date, sex, age and race

In [9]:
# by default, max number of records returned is 1000. We have to set it to 200000 to get the full results
res_sex_age_race_date = client.get(cdc_dataset_identifier, 
                     group = "cdc_case_earliest_dt, sex, age_group, race_ethnicity_combined",
                     select = "cdc_case_earliest_dt, sex, age_group, race_ethnicity_combined, count(*)",
                                  limit = 200000)

In [51]:
# number of different groups, and corresponding total number of rows (must match the total number of rows of the whole file)
nb_rows = len(res_sex_age_race_date)
total_cases = sum( [ int(row["count"]) for row in res_sex_age_race_date])
print( f"nb rows : {nb_rows}")
print( f"total cases : {total_cases}")

103590


24441351

In [10]:
# output it to json for further use
with open(os.path.join(local_data_dir, "covid_by_pop_group.json"), "w") as fs :
    json.dump(res_sex_age_race_date, fs, indent = 2)

## Death status per sex, age, race and ethnicity

In [28]:
res_sex_age_ethnicity_death =client.get(cdc_dataset_identifier, 
                     group = "sex, age_group, race_ethnicity_combined, death_yn",
                     select = "sex, age_group, race_ethnicity_combined, death_yn, count(*)",
                                  limit = 200000)

In [29]:
nb_rows = len(res_sex_age_ethnicity_death)
total_deaths = sum( [ int(row["count"]) for row in res_sex_age_ethnicity_death])
print( f"nb rows : {nb_rows}")
print( f"total deaths : {total_deaths}")

1327


22507139

In [10]:
with open(os.path.join(local_data_dir, "covid_death_by_pop_group.json"), "w") as fs :
    json.dump(res_sex_age_ethnicity_death, fs)
    