# CSC 593

## Week 7

### Merge Errors

#### Not resolved automatically
`git checkout --ours PATH/FILE`

#### Resolved automatically (vim)
`:wq`

### Pandas 2

In [None]:
import numpy as np
import pandas as pd

The BRFSS (Behavioral Risk Factor Surveillance System) data is too big to put into Github. This cell downloads it from the CDC's website and unzips it into your `data` folder.

In [None]:
#Setup for examples.
from urllib.request import urlretrieve
import zipfile
from pathlib import Path

zf = '../data/brfss/LLCP2018ASC.zip'
if not Path(zf).exists():
    Path('../data/brfss').mkdir(exist_ok=True)

    urlretrieve('https://www.cdc.gov/brfss/annual_data/2018/pdf/overview-2018-508.pdf', '../data/brfss/overview-2018-508.pdf')
    urlretrieve('https://www.cdc.gov/brfss/annual_data/2018/pdf/codebook18_llcp-v2-508.pdf', '../data/brfss/codebook18_llcp-v2-508.pdf')
    
    urlretrieve('https://www.cdc.gov/brfss/annual_data/2018/files/LLCP2018ASC.zip', zf)

fwff = '../data/brfss/LLCP2018.ASC'
if not Path(fwff).exists():
    with zipfile.ZipFile(zf) as z:
        z.extractall('../data/brfss')

Load the BRFSS data and set a couple of data types explicitly. (More supported data types are listed at https://docs.scipy.org/doc/numpy/user/basics.types.html)

In [None]:
names= ['state', 'imonth', 'iday', 
        'iyear', 'dispcode','genhlth', 
        'physhlth',
        'menthlth', 'poorhlth', 'hlthpln1',
        'persdoc2', 'medcost', 'checkup1',
        'WEIGHT2', 'HEIGHT3']
cols = [
    (1, 3),
    (18, 20),
    (20, 22),
    (22, 27),
    (31, 35),
    (89, 90),
    (90, 92),
    (92, 94),
    (94, 96),
    (96, 97),
    (97, 98),
    (98, 99),
    (99, 100),
    (176, 180),
    (180, 184)
]
types= {
    'WEIGHT2': str, 
    'HEIGHT3': str,
}
brfss = pd.read_fwf(fwff, 
                    names=names,
                    colspecs=cols,
                    dtype=types)

#### Searching

In [None]:
#Get an individual column.
brfss['WEIGHT2']

In [None]:
#For multiple columns, use a list as a subscript.
brfss[['WEIGHT2', 'HEIGHT3']]

The `loc()` and `iloc()` methods (see the table on p. 144-5 of *Python for Data Analysis*)

In [None]:
#Get the first row.
brfss.loc[0]

In [None]:
#Get WEIGHT2 from the third row.
brfss.loc[2, 'WEIGHT2']

In [None]:
#Same thing, but using the integer index instead of the column name.
brfss.iloc[2,-2]

In [None]:
brfss.at[2, 'WEIGHT2']

In [None]:
brfss.iat[2, -2]

In [None]:
brfss.loc[:100, ['HEIGHT3', 'WEIGHT2']]

In [None]:
#Find rows based on a value
brfss[brfss['WEIGHT2']=='9999']

The [`shape()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html) method gives you the height and width of your DataFrame.

In [None]:
print(brfss.shape)
#Drop any rows without weight
brfss.dropna(subset=['WEIGHT2'], inplace=True)
print(brfss.shape)

https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.isin.html

In [None]:
brfss[np.isin(brfss['WEIGHT2'], ['7777', '9999'], invert=True)].shape

#### Derived Fields

In [None]:
brfss = brfss[np.isin(brfss['WEIGHT2'], ['7777', '9999'], invert=True)]
brfss = brfss[~brfss.WEIGHT2.str.startswith('1')]

In [None]:
brfss['wtunit'] = brfss.WEIGHT2.str[0].astype(np.uint8)
brfss['wt'] = brfss.WEIGHT2.str[1:].astype(np.uint16)
brfss

[`np.where()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html) provides the equivalent of an *if-then-else* statement on each observation in a DataFrame:

In [None]:
lbsperkg = 2.205
brfss['wtlbs'] = np.where(brfss.wtunit==9, brfss.wt*lbsperkg, brfss.wt).astype(np.int16)

In [None]:
brfss[brfss.wtunit==9]

##### Practice

Create a `htinches` column from the `brfss.HEIGHT3` column. 

1. Remove rows where `HEIGHT3` is 7777 ("Don't know/Not sure"),  9999 ("Refused"), or NaN.
2. If the first character of `HEIGHT3` is '9', multiply the remaining three digits by `cmtoin` (defined below) to get height in inches.
3. If the first character of `HEIGHT3` is '0', the second character is feet, and the third and fourth are inches ('0601' means six feet, one inch). Convert this to inches.

See page 36 of the codebook for details on the `HEIGHT3` field.

In [None]:
cmtoin = 0.3937

#### Summary statistics and aggregation

In [None]:
brfss.groupby(['persdoc2', 'poorhlth']).size() #or .mean()

In [None]:
# | means 'or'
# ph=1 if you were sick more than 5 days, 0 otherwise:
brfss['ph'] = np.where((brfss['poorhlth'] > 30) | (brfss['poorhlth'] <= 5) | (brfss['poorhlth'].isnull()), 0, 1)
brfss

In [None]:
#brfss.groupby(['persdoc2', 'ph']).describe()
brfss.ph.groupby(brfss.persdoc2).size()

In [None]:
#Calculate percentages instead of raw numbers.
docph = brfss.groupby(['persdoc2', 'ph']).size()
docph.groupby(level=0).apply(lambda x: 100 * x / float(x.sum()))

We can bin or categorize numeric variables with [`pd.cut()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html).

In [None]:
## 88 means 'none'; want to bin it separately from "Don't know" and "refused"
brfss.loc[brfss.poorhlth==88, 'poorhlth']=51
bins = [1, 5, 10, 20, 30, 51, 70]
brfss['phcats'] = pd.cut(brfss.poorhlth, bins, 
                         labels=['less than 5', 'less than 10', 
                                 'less than 20', 'more than 20', 'none', "don't know/refused"])
brfss

##### Practice

Create a new column that divides `iday` into 3 bins (1-10, 11-20, 21+).

2) Group `brfss` by `hlthpln1` and `medcost` and create a table like the one above (for `persdoc2` and `ph`) with percentages for each subgroup.