# Transmission and Disposition

Page: https://data.sfgov.org/COVID-19/COVID-19-Cases-Summarized-by-Date-Transmission-and/tvq9-ec9w
Download link: https://data.sfgov.org/api/views/tvq9-ec9w/rows.csv?accessType=DOWNLOAD
Filename: COVID-19_Cases_Summarized_by_Date__Transmission_and_Case_Disposition.csv

There is multiple entries per day - each entry is a combination of disposition and transmission

Disposition
* confirmed - cases where lab testing was positive
* deaths - those cases of infection resulting in death

Transmission
Subject asked if they have been in close contact with a known COVID-19 case
* From Contact - known close contact with known COVID-19 case
* Community - no known close contact with a known COVID-19 case
* Unknown - question was not asked




# Age Group
Page: https://data.sfgov.org/COVID-19/COVID-19-Cases-Summarized-by-Age-Group/sunc-2t3k
Download link: https://data.sfgov.org/api/views/sunc-2t3k/rows.csv?accessType=DOWNLOAD
Filename: COVID-19_Cases_Summarized_by_Age_Group

Mutliple entries per day - one per age group

Age groups:

0-4
5-10
11-13
14-17
18-20
21-24
25-29
30-39
40-49
50-59
60-69
70-79
80+
Unknown

Unknowns may become known over time as more information is gathered.

Age groups resulting in 5 or fewer cumulative cases are dropped for privacy reasons.


# Race and Ethnicity

Page: https://data.sfgov.org/COVID-19/COVID-19-Cases-Summarized-by-Age-Group/sunc-2t3k
Download link: https://data.sfgov.org/api/views/vqqm-nsqg/rows.csv?accessType=DOWNLOAD
Filename: COVID-19__Cases_Summarized_by_Race_and_Ethnicity.csv

Multiple per day - one entry per ethnicity



# Gender
Page: https://data.sfgov.org/COVID-19/COVID-19-Cases-Summarized-by-Gender/nhy6-gqam
Download link: https://data.sfgov.org/api/views/nhy6-gqam/rows.csv?accessType=DOWNLOAD
Filename: COVID-19__Cases_Summarized_by_Gender.csv




In [20]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pprint import PrettyPrinter as pp

sns.set()

# Transmission and Disposition

In [21]:
TRANSMISSION_DATAFILE="COVID-19_Cases_Summarized_by_Date__Transmission_and_Case_Disposition.csv"

transmission_df = pd.read_csv(TRANSMISSION_DATAFILE, parse_dates = ['Specimen Collection Date', 'Last Updated At'])

transmission_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   Specimen Collection Date  918 non-null    datetime64[ns]
 1   Case Disposition          918 non-null    object        
 2   Transmission Category     918 non-null    object        
 3   Case Count                918 non-null    int64         
 4   Last Updated At           918 non-null    datetime64[ns]
dtypes: datetime64[ns](2), int64(1), object(2)
memory usage: 36.0+ KB


In [22]:
transmission_df.describe()

Unnamed: 0,Case Count
count,918.0
mean,26.246187
std,31.075915
min,1.0
25%,2.0
50%,17.0
75%,37.0
max,207.0


### Rename Columns

Use list comprehension

In [24]:
new_column_names = [col.lower().replace(' ', '_') for col in transmission_df.columns]
transmission_df.columns = new_column_names
transmission_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   specimen_collection_date  918 non-null    datetime64[ns]
 1   case_disposition          918 non-null    object        
 2   transmission_category     918 non-null    object        
 3   case_count                918 non-null    int64         
 4   last_updated_at           918 non-null    datetime64[ns]
dtypes: datetime64[ns](2), int64(1), object(2)
memory usage: 36.0+ KB


### Data Discovery

In [25]:
transmission_df.groupby('case_disposition').

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f94a6782880>

In [None]:
transmission_df.transmission_category.unique()

# Race and Ethnicity

In [10]:
RACE_DATAFILE = 'COVID-19__Cases_Summarized_by_Race_and_Ethnicity.csv'
race_df = pd.read_csv(RACE_DATAFILE, parse_dates = ['Specimen Collection Date', 'Last Updated at'])
new_columns = [col.lower().replace(' ','_').replace('/', '_') for col in race_df.columns]
race_df.columns = new_columns
race_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2553 entries, 0 to 2552
Data columns (total 5 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   specimen_collection_date    2553 non-null   datetime64[ns]
 1   race_ethnicity              2553 non-null   object        
 2   new_confirmed_cases         2553 non-null   int64         
 3   cumulative_confirmed_cases  2553 non-null   int64         
 4   last_updated_at             2553 non-null   datetime64[ns]
dtypes: datetime64[ns](2), int64(2), object(1)
memory usage: 99.9+ KB


In [7]:
race_df.describe()

Unnamed: 0,New Confirmed Cases,Cumulative Confirmed Cases
count,2553.0,2553.0
mean,9.350176,929.321191
std,17.431538,1615.75078
min,0.0,5.0
25%,0.0,35.0
50%,3.0,264.0
75%,10.0,1142.0
max,154.0,10328.0


In [19]:
pp.pprint(race_df.race_ethnicity.unique())

TypeError: pprint() missing 1 required positional argument: 'object'