#  Data

From the [CDC Behavioral Risk Factor Surveillance System](https://www.cdc.gov/brfss/about/index.htm)

[2017 Survey](https://www.cdc.gov/brfss/annual_data/annual_2017.html)

The so-called [codebook (PDF)](https://www.cdc.gov/brfss/annual_data/2017/pdf/codebook17_llcp-v2-508.pdf) has all the details for each field.

The data are in the form of a fixed width [ASCII file](https://www.cdc.gov/brfss/annual_data/2017/files/LLCP2017ASC.zip) which is a royal pain to read. Here's the [file layout](https://www.cdc.gov/brfss/annual_data/2017/llcp_varlayout_17_onecolumn.html) in tabular form.  


In [1]:
import numpy as np
import pandas as pd

In [2]:
FF = pd.read_csv('/home/vpoduri/DataFiles/field_format.csv')
FF['PyStart'] = FF['Starting Column']-1
FF['PyStop'] = FF['PyStart']+FF['Field Length']

FF[FF['Variable Name'].str.contains('WEIGHT2|WTKG3|HEIGHT3|HTIN4|HTM4')]

# WTKG3 = WEIGHT2 / 2.2046      (weight in kg)
# HTM4 = HEIGHT3 * 0.0254       (height in m)


Unnamed: 0,Starting Column,Variable Name,Field Length,PyStart,PyStop
69,183,WEIGHT2,4,182,186
70,187,HEIGHT3,4,186,190
288,2034,HTIN4,3,2033,2036
289,2037,HTM4,3,2036,2039
290,2040,WTKG3,5,2039,2044


In [3]:
# Read only the columns we are interested in
# Weight in lb and kg and height in inches and m

WH =pd.DataFrame([(line[182:186],line[2039:2044],line[2033:2036],line[2036:2039]) 
       for line in open('/home/vpoduri/DataFiles/BRFSS_2017')])

WH.columns = ['WEIGHT2','WTKG3','HTIN4','HTM4'] # Weight(lb), Weight(kg), Height(in), Height(m)

# Clean some data
WH = WH.replace(r'^\s*$|^777|^9',np.nan,regex=True)
WH = WH.dropna()
WH.reset_index(drop=True,inplace=True)

# Transform to sensible numeric values
WH = WH.transform(lambda x: x.astype(int))
WH[['WTKG3','HTM4']] = WH[['WTKG3','HTM4']]/100


WH.info()
WH[0:5]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414032 entries, 0 to 414031
Data columns (total 4 columns):
WEIGHT2    414032 non-null int64
WTKG3      414032 non-null float64
HTIN4      414032 non-null int64
HTM4       414032 non-null float64
dtypes: float64(2), int64(2)
memory usage: 12.6 MB


Unnamed: 0,WEIGHT2,WTKG3,HTIN4,HTM4
0,162,73.48,65,1.65
1,211,95.71,71,1.8
2,195,88.45,74,1.88
3,170,77.11,67,1.7
4,140,63.5,65,1.65


In [4]:
WH.describe()

Unnamed: 0,WEIGHT2,WTKG3,HTIN4,HTM4
count,414032.0,414032.0,414032.0,414032.0
mean,180.109873,81.69618,66.93452,1.700209
std,45.597974,20.683077,4.16689,0.106127
min,50.0,22.68,36.0,0.91
25%,150.0,68.04,64.0,1.63
50%,175.0,79.38,67.0,1.7
75%,203.0,92.08,70.0,1.78
max,604.0,273.97,93.0,2.36


In [5]:
WH.to_pickle('/home/vpoduri/DataFiles/WH.pkl')