# Closing the Gap Study Revisited
By: Andrew Clark, Rahn Lieberman, Ryan Shuhart, Thomas Rogers

The below is from how the orginal study in "Closing the Gap: Reducing Racial and Ethnic Disparities" was conducted:

   This brief draws on the 2012–2013 Behavioral Risk Factor Surveillance System (BRFSS), an annual survey conducted by the Centers for Disease Control and Prevention in partnership with state governments. The surveys included landline and cellular telephone interviews with more than 400,000 adults age 18 and older across all 50 states. In performing our analysis, we combined two years of data to ensure an adequate sample size in each of the socioeconomic strata, including income, race and ethnicity, and insurance status. <font color="red">We restricted our analysis to adults under age 65. </font>
   BRFSS asks adults whether they did not visit a doctor when needed within the previous 12 months because of costs, and whether they have one or more than one person they think of as their personal doctor or health care provider.
    Our analysis classifies respondents’ socioeconomic (SES) characteristics as follows:
    
    • Race/ethnicity: white (non-Hispanic), black (non-Hispanic), or Hispanic (any race).
    • Income in three income groups:
      1. Low income: below 200 percent of the federal poverty level (income in 2012 of less than $22,340 if single, or 
      less than $46,100 for a family of four).
      2. Middle income: 200 percent to 399 percent of poverty (income in 2012 of $22,340 up to $44,680 if single, or 
      $46,100 to $92,200 for a family of four).
      3. Higher income: 400 percent of poverty or higher (income in 2012 at or above $44,680 if single, or $92,200 for 
      a family of four).
    • Insurance status: insured or not at the time of the questionnaire.
    
        Exhibit 2 reports unadjusted point estimates, stratified by race/ethnicity. Exhibits 3 and 4 report adjusted means, to account for differences in respondents’ age, sex, income, and health status. We adjusted estimates using survey-design adjusted logistic regressions in Stata (v.12.1).
    
        Unadjusted point estimates were still subject to uncertainty because of the sample design. Each estimate has survey design–adjusted 95 percent confidence intervals of about 1 to 2 percentage points. Statistical significance associated with SES-adjusted point estimates is noted in Exhibits 3 and 4.

## Identifying the Variables of Interest
We are most interested to know how the variables in our dataset relate to self-reported health quality.

We'll work to reduce the dataset and create an imputed variable from the self-reported measure of health.

The question of interest is, "Would You Say in General That Your Health is: (1) excellent, (2) very good, (3) good, (4) fair, (5) poor." Choices 7 and 9 were "unsure" and "not asked", respectively. This is in the GENHLTH variable.

'_IMPMRTL' - Imputed Marital Status (This value is the reported marital status or an imputed marital status, if the respondent refused to give a marital status. The value of the imputed marital status will be computed from the sample if the respondent refused to give a marital status.)

In [169]:
# Python Modules
import pandas as pd

In [177]:
brfss = pd.read_csv('brfss2012_2014.zip', encoding = "ISO-8859-1", compression='zip')
print("Starting lenght is: ", len(brfss))

# Age 18 to 64 - Excludes 65 or older, refused, or missing
brfss = brfss[brfss['_AGE65YR'] == 1] 

# Exclude blank, 'Don't know', 'Not Sure', or 'Refused'
brfss = brfss[((brfss['GENHLTH'].notnull()) & (~brfss['GENHLTH'].isin([7,9])))] 

# Reduce Ethnicity to White, Black, or Hispanic (ex. Asian 2%, American Indian/Alaskan Native 1.55%, other 2.8%)
brfss = brfss[brfss['_IMPRACE'].isin([1,2,5])]
# Has Health plan --Excludes 'Don't know', 'Not Sure', or 'Refused'. drops .6%
brfss = brfss[brfss['HLTHPLN1'].isin([1,2])]

# Translate GENHLTH to binary classification of
# Combining the “excellent”, “very good” and “good” responses as measures of “good or better” (1) health 
# and the “fair” and “poor” measures as “fair and poor” (0).
brfss.loc[(brfss['GENHLTH'] < 4), '_Health'] = 1
brfss.loc[(brfss['GENHLTH'] >= 4), '_Health'] = 0

brfss.info()

Starting lenght is:  464664
<class 'pandas.core.frame.DataFrame'>
Int64Index: 275270 entries, 2 to 464663
Data columns (total 39 columns):
IDATE       275270 non-null int64
SEQNO       275270 non-null int64
GENHLTH     275270 non-null float64
HLTHPLN1    275270 non-null int64
PERSDOC2    275269 non-null float64
MEDCOST     275269 non-null float64
CHECKUP1    275269 non-null float64
EXERANY2    275268 non-null float64
SLEPTIM1    275270 non-null int64
CVDINFR4    275270 non-null int64
CVDCRHD4    275270 non-null int64
CVDSTRK3    275270 non-null int64
ASTHMA3     275270 non-null int64
ASTHNOW     38486 non-null float64
CHCSCNCR    275269 non-null float64
CHCOCNCR    275270 non-null float64
CHCCOPD1    275266 non-null float64
HAVARTH3    275269 non-null float64
ADDEPEV2    275269 non-null float64
CHCKIDNY    275269 non-null float64
LASTDEN3    275270 non-null int64
RMVTETH3    275270 non-null int64
INCOME2     272707 non-null float64
USEEQUIP    266888 non-null float64
BLIND       266477

In [84]:
# drop Missing Marital and refused to answer
len(brfss[~((brfss['MARITAL'].isnull()) | (brfss['MARITAL']==9))])

# drop 

302345

In [102]:
# brfss['_INC_LEVEL']=0
# # income in 2012 of less than $22,340 if single, or less than $46,100 for a family of four
# brfss.loc[((brfss['INCOME2'].isin([1,2,3,4])) & (brfss['_IMPMRTL'] != 1)), '_INC_LEVEL'] = 1
# brfss.loc[((brfss['INCOME2'].isin([1,2,3,4,5,6])) & (brfss['_IMPMRTL'] == 1)), '_INC_LEVEL'] = 1

# # income in 2012 of $22,340 up to $44,680 if single, or  $46,100 to $92,200 for a family of four
# brfss.loc[((brfss['INCOME2'].isin([5,6])) & (brfss['_IMPMRTL'] != 1)), '_INC_LEVEL'] = 2
# brfss.loc[((brfss['INCOME2'].isin([7,8])) & (brfss['_IMPMRTL'] != 1)), '_INC_LEVEL'] = 2

# # income in 2012 at or above $44,680 if single, or $92,200 for a family of four
# brfss.loc[((brfss['INCOME2'].isin([7,8])) & (brfss['_IMPMRTL'] != 1)), '_INC_LEVEL'] = 3

In [187]:
len(x)

21000

In [156]:
brfss.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 275270 entries, 2 to 464663
Data columns (total 37 columns):
GENHLTH     275270 non-null float64
HLTHPLN1    275270 non-null int64
PERSDOC2    275269 non-null float64
MEDCOST     275269 non-null float64
CHECKUP1    275269 non-null float64
EXERANY2    275268 non-null float64
SLEPTIM1    275270 non-null int64
CVDINFR4    275270 non-null int64
CVDCRHD4    275270 non-null int64
CVDSTRK3    275270 non-null int64
ASTHMA3     275270 non-null int64
ASTHNOW     38486 non-null float64
CHCSCNCR    275269 non-null float64
CHCOCNCR    275270 non-null float64
CHCCOPD1    275266 non-null float64
HAVARTH3    275269 non-null float64
ADDEPEV2    275269 non-null float64
CHCKIDNY    275269 non-null float64
LASTDEN3    275270 non-null int64
RMVTETH3    275270 non-null int64
INCOME2     272707 non-null float64
USEEQUIP    266888 non-null float64
BLIND       266477 non-null float64
DECIDE      266150 non-null float64
DIFFWALK    265895 non-null float64
DIFFDR