# Closing the Gap Study Revisited
By: Andrew Clark, Rahn Lieberman, Ryan Shuhart, Thomas Rogers

The below is from how the orginal study in "Closing the Gap: Reducing Racial and Ethnic Disparities" was conducted:

   This brief draws on the 2012–2013 Behavioral Risk Factor Surveillance System (BRFSS), an annual survey conducted by the Centers for Disease Control and Prevention in partnership with state governments. The surveys included landline and cellular telephone interviews with more than 400,000 adults age 18 and older across all 50 states. In performing our analysis, we combined two years of data to ensure an adequate sample size in each of the socioeconomic strata, including income, race and ethnicity, and insurance status. <font color="red">We restricted our analysis to adults under age 65. </font>
   BRFSS asks adults whether they did not visit a doctor when needed within the previous 12 months because of costs, and whether they have one or more than one person they think of as their personal doctor or health care provider.
    Our analysis classifies respondents’ socioeconomic (SES) characteristics as follows:
    
    • Race/ethnicity: white (non-Hispanic), black (non-Hispanic), or Hispanic (any race).
    • Income in three income groups:
      1. Low income: below 200 percent of the federal poverty level (income in 2012 of less than $22,340 if single, or 
      less than $46,100 for a family of four).
      2. Middle income: 200 percent to 399 percent of poverty (income in 2012 of $22,340 up to $44,680 if single, or 
      $46,100 to $92,200 for a family of four).
      3. Higher income: 400 percent of poverty or higher (income in 2012 at or above $44,680 if single, or $92,200 for 
      a family of four).
    • Insurance status: insured or not at the time of the questionnaire.
    
        Exhibit 2 reports unadjusted point estimates, stratified by race/ethnicity. Exhibits 3 and 4 report adjusted means, to account for differences in respondents’ age, sex, income, and health status. We adjusted estimates using survey-design adjusted logistic regressions in Stata (v.12.1).
    
        Unadjusted point estimates were still subject to uncertainty because of the sample design. Each estimate has survey design–adjusted 95 percent confidence intervals of about 1 to 2 percentage points. Statistical significance associated with SES-adjusted point estimates is noted in Exhibits 3 and 4.

## Question of Interest
"Would You Say in General That Your Health is: (1) excellent, (2) very good, (3) good, (4) fair, (5) poor." Choices 7 and 9 were "unsure" and "not asked", respectively. This is in the GENHLTH variable.

In [26]:
# Python Modules
import pandas as pd
# seaborn and matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
brfss = pd.read_csv('.\\data\\brfss2012_2014.zip', encoding = "ISO-8859-1", compression='zip')
print("Starting lenght is %.f " % len(brfss))

# Age 18 to 64 - Excludes 65 or older, refused, or missing
brfss = brfss[brfss['_AGE65YR'] == 1] 

# Exclude blank, 'Don't know', 'Not Sure', or 'Refused'
brfss = brfss[((brfss['GENHLTH'].notnull()) & (~brfss['GENHLTH'].isin([7,9])))] 

# Reduce Ethnicity to White, Black, or Hispanic (ex. Asian 2%, American Indian/Alaskan Native 1.55%, other 2.8%)
brfss = brfss[brfss['_IMPRACE'].isin([1,2,5])]
# Has Health plan --Excludes 'Don't know', 'Not Sure', or 'Refused'. drops .6%
brfss = brfss[brfss['HLTHPLN1'].isin([1,2])]

# Translate GENHLTH to binary classification of
# Combining the “excellent”, “very good” and “good” responses as measures of “good or better” (1) health 
# and the “fair” and “poor” measures as “fair and poor” (0).
brfss.loc[(brfss['GENHLTH'] < 4), '_Health'] = 1
brfss.loc[(brfss['GENHLTH'] >= 4), '_Health'] = 0

# Extract survey year from sequence. IYEAR sometimes went into the next year. 
# This is one way to put all the years data into the releas
brfss['Rec_Year'] = brfss['SEQNO'].astype(str).str[:4].astype(int)

brfss.info()

Starting lenght is 1432124 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 867987 entries, 0 to 1432123
Data columns (total 29 columns):
_AGE65YR    867987 non-null float64
GENHLTH     867987 non-null float64
HLTHPLN1    867987 non-null float64
PERSDOC2    867986 non-null float64
MEDCOST     867984 non-null float64
CHECKUP1    867985 non-null float64
EXERANY2    848814 non-null float64
CVDINFR4    867985 non-null float64
CVDCRHD4    867987 non-null float64
CVDSTRK3    867987 non-null float64
ASTHMA3     867987 non-null float64
ASTHNOW     120875 non-null float64
CHCSCNCR    867983 non-null float64
CHCOCNCR    867987 non-null float64
CHCCOPD1    867981 non-null float64
HAVARTH3    867984 non-null float64
ADDEPEV2    867985 non-null float64
CHCKIDNY    867983 non-null float64
USEEQUIP    851069 non-null float64
SMOKE100    847083 non-null float64
SMOKDAY2    359144 non-null float64
STOPSMK2    161152 non-null float64
LASTSMK2    197171 non-null float64
USENOW3     846261 non-null floa

In [78]:
df_hlth_race_yr = brfss.groupby(['_Health','_IMPRACE', 'Rec_Year'])
df_hlth_yr = brfss.groupby(['Rec_Year','_Health'])
df_yr = brfss.groupby(['Rec_Year'])

In [163]:
#brfss.groupby(['Rec_Year','_Health']).size().unstack()
# df_hlth_yr['Count'].last() / df_hlth_yr['Count'].sum()
df_hlth_yr = pd.DataFrame({'Count':brfss.groupby(['Rec_Year','_Health']).size()}).reset_index()
grouped1 = df_hlth_yr.groupby(['Rec_Year','_Health'])
grouped2 = df_hlth_yr.groupby(['Rec_Year'])
#grouped['Count'].last() #/ 
print(grouped1['Count'].sum())
print(grouped2['Count'].sum())
(grouped1['Count'].sum() / grouped2['Count'].sum())#.unstack()

Rec_Year  _Health
2012      0.0         48227
          1.0        244486
2013      0.0         49297
          1.0        250707
2014      0.0         43798
          1.0        231472
Name: Count, dtype: int64
Rec_Year
2012    292713
2013    300004
2014    275270
Name: Count, dtype: int64


Rec_Year  _Health
2012      0.0        0.164759
          1.0        0.835241
2013      0.0        0.164321
          1.0        0.835679
2014      0.0        0.159109
          1.0        0.840891
Name: Count, dtype: float64

In [161]:
x = pd.DataFrame({'Count':brfss.groupby(['Rec_Year','_Health']).size()})
y = pd.DataFrame({'Count':brfss.groupby(['Rec_Year','_Health']).size()}).reset_index().groupby(['Rec_Year'])
#(x['Count'].sum() / y['Count'].sum()).unstack()
print(x)
print(grouped1['Count'].sum())

                   Count
Rec_Year _Health        
2012     0.0       48227
         1.0      244486
2013     0.0       49297
         1.0      250707
2014     0.0       43798
         1.0      231472
Rec_Year  _Health
2012      0.0         48227
          1.0        244486
2013      0.0         49297
          1.0        250707
2014      0.0         43798
          1.0        231472
Name: Count, dtype: int64


In [153]:
print(grouped1['Count'].sum())
print(df_hlth_yr['Count'])

Rec_Year  _Health
2012      0.0         48227
          1.0        244486
2013      0.0         49297
          1.0        250707
2014      0.0         43798
          1.0        231472
Name: Count, dtype: int64
Rec_Year  _Health
2012      0.0         48227
          1.0        244486
2013      0.0         49297
          1.0        250707
2014      0.0         43798
          1.0        231472
Name: Count, dtype: int64


In [115]:
df_yr.size()#.transform(lambda x: x/sum(x))#.reset_index().transpose()

Rec_Year
2012    292713
2013    300004
2014    275270
dtype: int64

In [99]:
df_hlth_yr.size().unstack().reset_index() / df_yr.size().reset_index()

  other.columns, how=join, level=level, return_indexers=True)


Unnamed: 0,Rec_Year,0.0,1.0
0,1.0,0.164759,
1,1.0,0.164321,
2,1.0,0.159109,


In [102]:
df_hlth_yr.size().apply(lambda x: float(x) / df_hlth_yr.size().sum()*100)

Rec_Year  _Health
2012      0.0         5.556189
          1.0        28.167012
2013      0.0         5.679463
          1.0        28.883728
2014      0.0         5.045928
          1.0        26.667681
dtype: float64

In [73]:
(df_hlth.size().unstack() / brfss.groupby(['_Health','Rec_Year']).size().unstack())#.reset_index().plot()

Unnamed: 0_level_0,Rec_Year,2012,2013,2014
_Health,_IMPRACE,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,1.0,0.696353,0.705824,0.705603
0.0,2.0,0.146868,0.137615,0.130394
0.0,5.0,0.156779,0.156561,0.164003
1.0,1.0,0.820382,0.823128,0.823344
1.0,2.0,0.092013,0.087078,0.083937
1.0,5.0,0.087604,0.089794,0.09272


In [74]:
%matplotlib inline
