Load covid cases per population group from CDC (https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf)
The data list all individual cases (deidentified), with information on sex, age group, race and identity, and date of case, plus some additional information (hospitalisation, intensive care, death ...).
As the dataset is huge (20.6 Mrows at mid-march), and since their are many identical rows (because of deidentification), we query the data through the api, grouped by the categories we're interested in. Thus, the downloaded dataset remains relatively small.

In [1]:
#!pip install sodapy

You should consider upgrading via the '/home/user/anaconda3/envs/airflow/bin/python -m pip install --upgrade pip' command.[0m


In [1]:
import pandas as pd
from sodapy import Socrata
import configparser
import json
import os
from datetime import datetime

Without an application token, the size of the downloaded data is limited (see https://dev.socrata.com/docs/app-tokens.html). 

CDC uses the [Socrata Open Data API](https://dev.socrata.com/) to manage access to its data. To increase download capabilities, you have to create an AppToken.

First, [create a Socrata account](https://data.cdc.gov/signup). 

Then, [sign in](https://data.cdc.gov/login) to Socrata, using the Socrata ID field. Go to 'My Profile', then 'Edit profile', then tab 'Developer settings' (https://data.cdc.gov/profile/edit/developer_settings). Create an AppToken by following [this guide](https://support.socrata.com/hc/en-us/articles/210138558-Generating-an-App-Token).

Then store the AppToken into the config file `capstone.cfg` :
```
[CDC]
APPTOKEN=<MyAppToken>
```


## Connect to cdc data server

In [2]:
config = configparser.ConfigParser()
config.read("capstone.cfg")

['capstone.cfg']

In [3]:
# Connect to CDC using AppToken for Socrata API
# set timeout to 100, since the query of such large dataset may take longer time tat default timeout
app_token = config["CDC"]["APPTOKEN"]
client = Socrata("data.cdc.gov",
                 app_token,
                timeout = 100)

In [4]:
# identifier of CDC dataset with covid case surveillance (https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf)
cdc_dataset_identifier = "vbim-akqf"

Column names of the data set. We will use only sex, age, ethnicity and date.

In [5]:
metadata = client.get_metadata(cdc_dataset_identifier)
[x['name'] for x in metadata['columns']]

['cdc_case_earliest_dt ',
 'cdc_report_dt',
 'pos_spec_dt',
 'onset_dt',
 'current_status',
 'sex',
 'age_group',
 'race_ethnicity_combined',
 'hosp_yn',
 'icu_yn',
 'death_yn',
 'medcond_yn']

documentation about Socrata API is available [here](https://dev.socrata.com/docs/queries/). Some exemple uses of the python client library  [sodapy](https://github.com/xmunoz/sodapy) can be found [here](https://github.com/xmunoz/sodapy/blob/master/examples/soql_queries.ipynb)

## Download cases data for a given date

In [7]:
selected_date = datetime(2021, 4,1)
str_date = selected_date.isoformat() + ".000"
str_date

'2021-04-01T00:00:00.000'

In [8]:
# iso format with millisecond set to .000
str_date = selected_date.strftime( "%Y-%m-%d") + "T00:00:00.000"
str_date

'2021-04-01T00:00:00.000'

In [42]:
f"cdc_case_earliest_dt = '{str_date}'"

"cdc_case_earliest_dt = '2021-04-01T00:00:00.000'"

In [24]:
cases_per_date = client.get(cdc_dataset_identifier,
                           group = "cdc_case_earliest_dt, sex, age_group, race_ethnicity_combined",
                           select = "cdc_case_earliest_dt, sex, age_group, race_ethnicity_combined, count(*)"
                           ,where = f"cdc_case_earliest_dt = '{str_date}'",
                            content_type = "json"
                           )


In [10]:
len(cases_per_date)

242

In [45]:
cases_per_date

[{'cdc_case_earliest_dt': '2021-04-01T00:00:00.000',
  'sex': 'Male',
  'age_group': '30 - 39 Years',
  'race_ethnicity_combined': 'American Indian/Alaska Native, Non-Hispanic',
  'count': '9'},
 {'cdc_case_earliest_dt': '2021-04-01T00:00:00.000',
  'sex': 'Male',
  'age_group': '50 - 59 Years',
  'race_ethnicity_combined': 'Hispanic/Latino',
  'count': '371'},
 {'cdc_case_earliest_dt': '2021-04-01T00:00:00.000',
  'sex': 'Male',
  'age_group': '20 - 29 Years',
  'race_ethnicity_combined': 'Black, Non-Hispanic',
  'count': '385'},
 {'cdc_case_earliest_dt': '2021-04-01T00:00:00.000',
  'sex': 'Female',
  'age_group': '20 - 29 Years',
  'race_ethnicity_combined': 'Unknown',
  'count': '1624'},
 {'cdc_case_earliest_dt': '2021-04-01T00:00:00.000',
  'sex': 'Female',
  'age_group': '0 - 9 Years',
  'race_ethnicity_combined': 'White, Non-Hispanic',
  'count': '490'},
 {'cdc_case_earliest_dt': '2021-04-01T00:00:00.000',
  'sex': 'Male',
  'age_group': '60 - 69 Years',
  'race_ethnicity_combin

In [13]:
import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession

In [14]:
spark = SparkSession \
    .builder \
    .appName("spark_to_postgres") \
    .getOrCreate()

In [28]:
df = spark.read.json(spark.sparkContext.parallelize(cases_per_date))

In [29]:
df.printSchema()

root
 |-- age_group: string (nullable = true)
 |-- cdc_case_earliest_dt: string (nullable = true)
 |-- count: string (nullable = true)
 |-- race_ethnicity_combined: string (nullable = true)
 |-- sex: string (nullable = true)



In [30]:
df.count()

242

## Quick analysis

In [46]:
set( [a["sex"] for a in res_sex_age_race_date])

{'Female', 'Male', 'Missing', 'NA', 'Other', 'Unknown'}

In [47]:
set( [a["age_group"] for a in res_sex_age_race_date])

{'0 - 9 Years',
 '10 - 19 Years',
 '20 - 29 Years',
 '30 - 39 Years',
 '40 - 49 Years',
 '50 - 59 Years',
 '60 - 69 Years',
 '70 - 79 Years',
 '80+ Years',
 'Missing',
 'NA'}

In [49]:
set( [a["race_ethnicity_combined"] for a in res_sex_age_race_date])

{'American Indian/Alaska Native, Non-Hispanic',
 'Asian, Non-Hispanic',
 'Black, Non-Hispanic',
 'Hispanic/Latino',
 'Missing',
 'Multiple/Other, Non-Hispanic',
 'NA',
 'Native Hawaiian/Other Pacific Islander, Non-Hispanic',
 'Unknown',
 'White, Non-Hispanic'}

In [50]:
set( [a["death_yn"] for a in res_sex_age_ethnicity_death])

NameError: name 'res_sex_age_ethnicity_death' is not defined

In [18]:
df_death = pd.DataFrame(res_sex_age_ethnicity_death)
df_death = df_death.astype({"count" : "int"})

In [19]:
df_death.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1320 entries, 0 to 1319
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   sex                      1320 non-null   object
 1   age_group                1320 non-null   object
 2   race_ethnicity_combined  1320 non-null   object
 3   death_yn                 1320 non-null   object
 4   count                    1320 non-null   int64 
dtypes: int64(1), object(4)
memory usage: 51.7+ KB


In [35]:
df_death.groupby(["sex", "death_yn"]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,count
sex,death_yn,Unnamed: 2_level_1
Female,Missing,3846666
Female,No,5489794
Female,Unknown,1114997
Female,Yes,169984
Male,Missing,3534902
Male,No,4939349
Male,Unknown,1019183
Male,Yes,201449
Missing,Missing,35643
Missing,No,12761


In [34]:
df_death.groupby(["sex", "death_yn"]).sum().loc[ (["Female", "Male"],["Yes", "No"]) ]

KeyError: "None of [Index(['Yes', 'No'], dtype='object')] are in the [columns]"