# Covid19 Heath Disparity:


- `data` from [Li et al. (2021). Identifying US County-level characteristics associated with high COVID-19 burden](https://bmcpublichealth.biomedcentral.com/articles/10.1186/s12889-021-11060-9) -  [github](https://github.com/lin-lab/COVID-Health-Disparities/blob/main/Data/covariates.csv)

### From paper:

We obtained demographic, socioeconomic and comorbidity data from a COVID-19 GitHub repository that drew from the US Department of Agriculture, Area Health Resources Files, County Health Rankings and Roadmaps, Centers for Disease Control and Prevention, and Kaiser News Health [24]. We obtained COVID-19 county cases and deaths from 1/22/20–12/21/20 from USA Facts [25] and additional demographic data from the US Census Bureau [26]. USA Facts is a non-profit organization providing data about government tax revenues, expenditures, and outcomes [25]. Area Health Resources Files is a part of the federal government’s Health Resources & Services administrations that includes data on population characteristics, economics, hospital utilization, and more [27]. County Health Rankings & Roadmaps is a collaboration between the Robert Wood Johnson Foundation and University of Wisconsin that provides local community health data [28]. All data used in analyses are publicly available and can be found on our lab GitHub page (https://github.com/lin-lab/COVID-Health-Disparities).

County-level cumulative and weekly COVID-19 cases and deaths as of 12/21/20 were directly obtained from USA Facts [29]. USA Facts aggregates data from the Centers of Disease Control and state and local public health agencies. County-level data were confirmed by referencing state and local agencies.

Demographic variables were obtained from the US Census Bureau and US Department of Agriculture and included county percentage ages 20–29 years, percentage ages 60+ years, percentage male, and metro/nonmetro status classification. US Department of Agriculture rural-urban continuum codes were grouped into three categories for the metro/nonmetro categorical variable: metro, population ≥ 1 million (code 1); metro or near metro, population 20,000 to 1 million (codes 2–4); nonmetro, population < 20,000 (codes 5–9) [30].

`County-level population distribution by race/ethnicity, including Black/African American, Hispanic/Latino, American Indian/Native Alaskan, Asian, Native Hawaiian/Pacific Islander proportions, were directly obtained from 2019 US Census Bureau estimates [26]. County residential racial segregation indices of dissimilarity were obtained from County Health Rankings & Roadmaps [31]. These indices were originally calculated from data from US Census tracts from the American Community Survey 2014–2018. Counties with less than 100 Black/non-white residents had the index of dissimilarity set to be equal to 1.`

- White Black Segregation Index 2014-2018 County Health Rankings & Roadmaps
- White non-White Segregation Index 2014-2018 County Health Rankings & Roadmaps

`Socioeconomic variables were obtained from Area Health Resource Files [27] and included average household size, percentage of individuals between 18 and 64 years old without health insurance, percentage in poverty, percentage of people aged > 25 years without a high school diploma, and percentage of people working in education/health care/social assistance.`

*Socioeconomic*
- Average Household Size 2010 Area Health Resources Files
- No Health Insurance, 18-64 years(%) 2017 Area Health Resources Files
- Poverty (%) 2017 Area Health Resources Files
- No High School Diploma, 25+ years (%) 2013-17 Area Health Resources Files
- Education, Health Care, Social Assistance Workers (%) 2013-17 Area Health Resources Files

`Prevalence rates for several comorbidities were obtained from County Health Rankings & Roadmap [28]. Comorbidities included county-level percentages for: smoking, obesity, asthma, cancer, chronic obstructive pulmonary disease, diabetes, heart failure, hypertension, kidney disease, and stroke. Kaiser News provided total intensive care unit beds and nursing home beds.` But health data from 2017.


*Health*
- Smokers (%) 2017 County Health Rankings & Roadmaps
- Obesity (%) 2017 County Health Rankings & Roadmaps
- Asthma (%) 2017 County Health Rankings & Roadmaps
- Cancer (%) 2017 County Health Rankings & Roadmaps
- COPD (%) 2017 County Health Rankings & Roadmaps
- Diabetes (%) 2017 County Health Rankings & Roadmaps
- Heart Failure (%) 2017 County Health Rankings & Roadmaps
- Hypertension (%) 2017 County Health Rankings & Roadmaps
- Kidney Disease (%) 2017 County Health Rankings & Roadmaps
- Stroke (%) 2017 County Health Rankings & Roadmaps
- ICU Beds 2017 log(x+1) Kaiser Health News
- Nursing Home Beds 2017 log(x+1) Kaiser Health News


Log transformations were applied to heavily skewed variables. Additional covariate information is available in Additional Table 1.




## Interesting variables to include:

- `PopDensity`, `RuralCont`, `EconArea`
- `HouseholdSize` , `noHealthInsurance`, `Poverty`, `noHighSchool`, `PercentEduHealthSoc`
- `SES`, `Household`, `HousingType`,
- `Smoking`, `Obesity`, `Asthma`, `Cancer`, `COPD`, `Diabetes`, `Stroke`, `HF`, `HTN`, `KD`
- `WBSeg`,  `WNWSeg`                 



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [10]:
data['HousingType']

0       0.3741
1       0.3359
2       0.9889
3       0.7189
4       0.1741
         ...  
3137    0.4120
3138    0.6266
3139    0.6657
3140    0.2751
3141    0.6581
Name: HousingType, Length: 3142, dtype: float64

In [3]:
data = pd.read_csv('../data/raw/covid19_health_disparities.csv')

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3142 entries, 0 to 3141
Data columns (total 51 columns):
fips                       3142 non-null int64
state                      3142 non-null object
county                     3142 non-null object
stateName                  3142 non-null object
CountyNamew.StateAbbrev    3113 non-null object
CensusRegionName           3142 non-null object
CensusDivisionName         3113 non-null object
tot_deaths                 3142 non-null int64
tot_cases                  3142 non-null int64
rate_cases                 3142 non-null float64
rate_deaths                3142 non-null float64
PopSize                    3142 non-null int64
PopDensity                 3113 non-null float64
logPopDensity              3113 non-null float64
Pop2029                    3142 non-null float64
Pop6099                    3142 non-null float64
Male                       3142 non-null float64
logWhite                   3142 non-null float64
logBlack                 

In [6]:
#data.isnull().sum()
data.shape

(3142, 51)

In [11]:
disp = data[['fips','PopDensity', 'Male', 'RuralCont' ,'EconArea','HouseholdSize','noHealthInsurance',
            'PercentEduHealthSoc', 'SES', 'Household' , 'HousingType', 'Asthma', 'Cancer', 'COPD',
            'Stroke', 'HF', 'HTN', 'KD', 'WBSeg', 'WNWSeg']]

In [13]:
disp.to_csv('../data/clean/covid19_heath_disparity.csv', index = False)