##### Week 5:
Building a Classifier
- Overview of Machine Learning
- Feature Engineering
- One hot encoding to encode categorical variables for use in a model
- Creating training and test data

Coding tasks:
The file brfss.csv contains a subset of the responses and variables from the [2019 Behavioral Risk Factor Surveillance System (BRFSS)](https://www.cdc.gov/brfss/). This dataset can be downloaded using this link: [https://drive.google.com/file/d/1acJKmT2aFf2nZl_VYLE897yx0LPNajoY/view?usp=sharing](https://drive.google.com/file/d/1acJKmT2aFf2nZl_VYLE897yx0LPNajoY/view?usp=sharing).

A detailed Codebook can be found [here](https://www.cdc.gov/brfss/annual_data/2019/pdf/codebook19_llcp-v2-508.HTML).

Our target variable is the CHECKUP1 column, which contains the responses to the question "About how long has it been since you last visited a doctor for a routine checkup?   [A routine checkup is a general physical exam, not an exam for a specific injury, illness, or condition.]" Specifically, we want to try and predict if someone gives an answer besides "Within past year (anytime less than 12 months ago)".

First, create a new coumn, "target" by converting this to a binary outcome. After you do this, drop the CHECKUP1 column from your dataframe so that you don't accidentally make predictions based off of it.

Then, experiment with making a logistic regression model to predict the target variable using one or more of the other columns. Note that you will need to convert the precictor columns into dummy variable prior to fitting a model. What do you find?

In [1]:
import pandas as pd

In [41]:
BRFSS = pd.read_csv("../data/brfss.csv")

In [42]:
BRFSS.head(5)

Unnamed: 0,GENHLTH,HLTHPLN1,PERSDOC2,MEDCOST,CHECKUP1,_RFHYPE5,TOLDHI2,CVDINFR4,CVDCRHD4,CVDSTRK3,...,EXERANY2,_METSTAT,_URBSTAT,_IMPRACE,_RFBMI5,_RFSMOK3,_RFBING5,_RFDRHV7,_TOTINDA,target
0,Good,Yes,"Yes, only one",No,Within past year (anytime less than 12 months ...,Yes,Yes,No,No,No,...,No,"Metropolitan counties (_URBNRRL = 1,2,3,4)","Urban counties (_URBNRRL = 1,2,3,4,5)","Black, Non-Hispanic",Yes,No,No,No,No physical activity or exercise in last 30 days,True
1,Fair,Yes,"Yes, only one",No,Within past year (anytime less than 12 months ...,No,No,No,No,No,...,Yes,"Metropolitan counties (_URBNRRL = 1,2,3,4)","Urban counties (_URBNRRL = 1,2,3,4,5)","White, Non-Hispanic",No,No,No,No,Had physical activity or exercise,True
2,Good,Yes,More than one,No,Within past year (anytime less than 12 months ...,Yes,No,No,No,No,...,Yes,"Metropolitan counties (_URBNRRL = 1,2,3,4)","Urban counties (_URBNRRL = 1,2,3,4,5)","Black, Non-Hispanic",Yes,No,No,No,Had physical activity or exercise,True
3,Very good,Yes,"Yes, only one",No,Within past year (anytime less than 12 months ...,No,No,No,No,No,...,Yes,"Nonmetropolitan counties (_URBNRRL = 5,6)",Rural counties (_URBNRRL = 6),"White, Non-Hispanic",Yes,Yes,No,No,Had physical activity or exercise,True
4,Poor,Yes,"Yes, only one",No,Within past year (anytime less than 12 months ...,No,Yes,No,No,No,...,No,"Metropolitan counties (_URBNRRL = 1,2,3,4)","Urban counties (_URBNRRL = 1,2,3,4,5)","White, Non-Hispanic",No,No,No,No,No physical activity or exercise in last 30 days,True


In [43]:
BRFSS.shape

(262049, 40)

In [44]:
BRFSS.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 262049 entries, 0 to 262048
Data columns (total 40 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   GENHLTH   262049 non-null  object
 1   HLTHPLN1  262049 non-null  object
 2   PERSDOC2  262049 non-null  object
 3   MEDCOST   262049 non-null  object
 4   CHECKUP1  262049 non-null  object
 5   _RFHYPE5  262049 non-null  object
 6   TOLDHI2   262049 non-null  object
 7   CVDINFR4  262049 non-null  object
 8   CVDCRHD4  262049 non-null  object
 9   CVDSTRK3  262049 non-null  object
 10  ASTHMA3   262049 non-null  object
 11  CHCSCNCR  262049 non-null  object
 12  CHCOCNCR  262049 non-null  object
 13  CHCCOPD2  262049 non-null  object
 14  ADDEPEV3  262049 non-null  object
 15  CHCKDNY2  262049 non-null  object
 16  DIABETE4  262049 non-null  object
 17  HAVARTH4  262049 non-null  object
 18  MARITAL   262049 non-null  object
 19  EDUCA     262049 non-null  object
 20  RENTHOM1  262049 non-null 

In [45]:
BRFSS['CHECKUP1'].value_counts()

Within past year (anytime less than 12 months ago)         215875
Within past 2 years (1 year but less than 2 years ago)      24212
Within past 5 years (2 years but less than 5 years ago)     11880
5 or more years ago                                          9325
Never                                                         757
Name: CHECKUP1, dtype: int64

In [46]:
BRFSS['MEDCOST'].value_counts()

No     237293
Yes     24756
Name: MEDCOST, dtype: int64

In [47]:
pd.crosstab(BRFSS['MEDCOST'], BRFSS['CHECKUP1'])
#crosstab useful for comparing two categorical variables 

CHECKUP1,5 or more years ago,Never,Within past 2 years (1 year but less than 2 years ago),Within past 5 years (2 years but less than 5 years ago),Within past year (anytime less than 12 months ago)
MEDCOST,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
No,7224,611,20611,9430,199417
Yes,2101,146,3601,2450,16458


In [48]:
pd.crosstab(BRFSS['MEDCOST'], BRFSS['CHECKUP1'], normalize = True)
#crosstab useful for comparing two categorical variables 

CHECKUP1,5 or more years ago,Never,Within past 2 years (1 year but less than 2 years ago),Within past 5 years (2 years but less than 5 years ago),Within past year (anytime less than 12 months ago)
MEDCOST,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
No,0.027567,0.002332,0.078653,0.035986,0.760991
Yes,0.008018,0.000557,0.013742,0.009349,0.062805


In [49]:
target_info = [] 
for i in BRFSS['CHECKUP1'] :
    if i == "Within past year (anytime less than 12 months ago)" :
        target_info.append("NO")
    else :
        target_info.append("YES")
BRFSS["target_info"] = target_info

In [50]:
BRFSS.shape

(262049, 41)

In [51]:
BRFSS.head()

Unnamed: 0,GENHLTH,HLTHPLN1,PERSDOC2,MEDCOST,CHECKUP1,_RFHYPE5,TOLDHI2,CVDINFR4,CVDCRHD4,CVDSTRK3,...,_METSTAT,_URBSTAT,_IMPRACE,_RFBMI5,_RFSMOK3,_RFBING5,_RFDRHV7,_TOTINDA,target,target_info
0,Good,Yes,"Yes, only one",No,Within past year (anytime less than 12 months ...,Yes,Yes,No,No,No,...,"Metropolitan counties (_URBNRRL = 1,2,3,4)","Urban counties (_URBNRRL = 1,2,3,4,5)","Black, Non-Hispanic",Yes,No,No,No,No physical activity or exercise in last 30 days,True,NO
1,Fair,Yes,"Yes, only one",No,Within past year (anytime less than 12 months ...,No,No,No,No,No,...,"Metropolitan counties (_URBNRRL = 1,2,3,4)","Urban counties (_URBNRRL = 1,2,3,4,5)","White, Non-Hispanic",No,No,No,No,Had physical activity or exercise,True,NO
2,Good,Yes,More than one,No,Within past year (anytime less than 12 months ...,Yes,No,No,No,No,...,"Metropolitan counties (_URBNRRL = 1,2,3,4)","Urban counties (_URBNRRL = 1,2,3,4,5)","Black, Non-Hispanic",Yes,No,No,No,Had physical activity or exercise,True,NO
3,Very good,Yes,"Yes, only one",No,Within past year (anytime less than 12 months ...,No,No,No,No,No,...,"Nonmetropolitan counties (_URBNRRL = 5,6)",Rural counties (_URBNRRL = 6),"White, Non-Hispanic",Yes,Yes,No,No,Had physical activity or exercise,True,NO
4,Poor,Yes,"Yes, only one",No,Within past year (anytime less than 12 months ...,No,Yes,No,No,No,...,"Metropolitan counties (_URBNRRL = 1,2,3,4)","Urban counties (_URBNRRL = 1,2,3,4,5)","White, Non-Hispanic",No,No,No,No,No physical activity or exercise in last 30 days,True,NO


In [52]:
BRFSS.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 262049 entries, 0 to 262048
Data columns (total 41 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   GENHLTH      262049 non-null  object
 1   HLTHPLN1     262049 non-null  object
 2   PERSDOC2     262049 non-null  object
 3   MEDCOST      262049 non-null  object
 4   CHECKUP1     262049 non-null  object
 5   _RFHYPE5     262049 non-null  object
 6   TOLDHI2      262049 non-null  object
 7   CVDINFR4     262049 non-null  object
 8   CVDCRHD4     262049 non-null  object
 9   CVDSTRK3     262049 non-null  object
 10  ASTHMA3      262049 non-null  object
 11  CHCSCNCR     262049 non-null  object
 12  CHCOCNCR     262049 non-null  object
 13  CHCCOPD2     262049 non-null  object
 14  ADDEPEV3     262049 non-null  object
 15  CHCKDNY2     262049 non-null  object
 16  DIABETE4     262049 non-null  object
 17  HAVARTH4     262049 non-null  object
 18  MARITAL      262049 non-null  object
 19  ED

In [53]:
BRFSS.tail(15)

Unnamed: 0,GENHLTH,HLTHPLN1,PERSDOC2,MEDCOST,CHECKUP1,_RFHYPE5,TOLDHI2,CVDINFR4,CVDCRHD4,CVDSTRK3,...,_METSTAT,_URBSTAT,_IMPRACE,_RFBMI5,_RFSMOK3,_RFBING5,_RFDRHV7,_TOTINDA,target,target_info
262034,Very good,Yes,"Yes, only one",No,Within past year (anytime less than 12 months ...,Yes,No,No,No,No,...,"Metropolitan counties (_URBNRRL = 1,2,3,4)","Urban counties (_URBNRRL = 1,2,3,4,5)","White, Non-Hispanic",Yes,No,No,No,Had physical activity or exercise,True,NO
262035,Very good,Yes,"Yes, only one",No,5 or more years ago,No,No,No,No,No,...,"Metropolitan counties (_URBNRRL = 1,2,3,4)","Urban counties (_URBNRRL = 1,2,3,4,5)","White, Non-Hispanic",Yes,No,No,No,Had physical activity or exercise,False,YES
262036,Very good,Yes,"Yes, only one",No,Within past year (anytime less than 12 months ...,Yes,Yes,No,No,No,...,"Metropolitan counties (_URBNRRL = 1,2,3,4)","Urban counties (_URBNRRL = 1,2,3,4,5)","White, Non-Hispanic",Yes,No,No,No,Had physical activity or exercise,True,NO
262037,Very good,Yes,No,No,Within past year (anytime less than 12 months ...,No,Yes,No,No,No,...,"Metropolitan counties (_URBNRRL = 1,2,3,4)","Urban counties (_URBNRRL = 1,2,3,4,5)","White, Non-Hispanic",No,No,No,No,Had physical activity or exercise,True,NO
262038,Very good,No,More than one,No,Within past year (anytime less than 12 months ...,No,No,No,No,No,...,"Nonmetropolitan counties (_URBNRRL = 5,6)","Urban counties (_URBNRRL = 1,2,3,4,5)","Other race, Non-Hispanic",Yes,No,No,No,Had physical activity or exercise,True,NO
262039,Excellent,Yes,No,No,Within past year (anytime less than 12 months ...,No,No,No,No,No,...,"Nonmetropolitan counties (_URBNRRL = 5,6)",Rural counties (_URBNRRL = 6),"White, Non-Hispanic",No,No,No,No,Had physical activity or exercise,True,NO
262040,Very good,Yes,"Yes, only one",No,Within past year (anytime less than 12 months ...,Yes,Yes,No,No,No,...,"Metropolitan counties (_URBNRRL = 1,2,3,4)","Urban counties (_URBNRRL = 1,2,3,4,5)","White, Non-Hispanic",No,No,No,No,Had physical activity or exercise,True,NO
262041,Good,Yes,"Yes, only one",No,Within past year (anytime less than 12 months ...,No,No,No,No,No,...,"Nonmetropolitan counties (_URBNRRL = 5,6)","Urban counties (_URBNRRL = 1,2,3,4,5)","White, Non-Hispanic",Yes,No,No,No,Had physical activity or exercise,True,NO
262042,Very good,Yes,"Yes, only one",No,Within past year (anytime less than 12 months ...,Yes,Yes,No,No,No,...,"Nonmetropolitan counties (_URBNRRL = 5,6)",Rural counties (_URBNRRL = 6),"White, Non-Hispanic",Yes,No,No,No,Had physical activity or exercise,True,NO
262043,Good,Yes,More than one,No,Within past year (anytime less than 12 months ...,Yes,Yes,No,Yes,No,...,"Nonmetropolitan counties (_URBNRRL = 5,6)",Rural counties (_URBNRRL = 6),"White, Non-Hispanic",No,No,No,No,Had physical activity or exercise,True,NO


In [54]:
#categorical_variables = ['CHECKUP1']
#BRFSS = pd.get_dummies(BRFSS, columns = categorical_variables)

In [55]:
BRFSS.head()

Unnamed: 0,GENHLTH,HLTHPLN1,PERSDOC2,MEDCOST,CHECKUP1,_RFHYPE5,TOLDHI2,CVDINFR4,CVDCRHD4,CVDSTRK3,...,_METSTAT,_URBSTAT,_IMPRACE,_RFBMI5,_RFSMOK3,_RFBING5,_RFDRHV7,_TOTINDA,target,target_info
0,Good,Yes,"Yes, only one",No,Within past year (anytime less than 12 months ...,Yes,Yes,No,No,No,...,"Metropolitan counties (_URBNRRL = 1,2,3,4)","Urban counties (_URBNRRL = 1,2,3,4,5)","Black, Non-Hispanic",Yes,No,No,No,No physical activity or exercise in last 30 days,True,NO
1,Fair,Yes,"Yes, only one",No,Within past year (anytime less than 12 months ...,No,No,No,No,No,...,"Metropolitan counties (_URBNRRL = 1,2,3,4)","Urban counties (_URBNRRL = 1,2,3,4,5)","White, Non-Hispanic",No,No,No,No,Had physical activity or exercise,True,NO
2,Good,Yes,More than one,No,Within past year (anytime less than 12 months ...,Yes,No,No,No,No,...,"Metropolitan counties (_URBNRRL = 1,2,3,4)","Urban counties (_URBNRRL = 1,2,3,4,5)","Black, Non-Hispanic",Yes,No,No,No,Had physical activity or exercise,True,NO
3,Very good,Yes,"Yes, only one",No,Within past year (anytime less than 12 months ...,No,No,No,No,No,...,"Nonmetropolitan counties (_URBNRRL = 5,6)",Rural counties (_URBNRRL = 6),"White, Non-Hispanic",Yes,Yes,No,No,Had physical activity or exercise,True,NO
4,Poor,Yes,"Yes, only one",No,Within past year (anytime less than 12 months ...,No,Yes,No,No,No,...,"Metropolitan counties (_URBNRRL = 1,2,3,4)","Urban counties (_URBNRRL = 1,2,3,4,5)","White, Non-Hispanic",No,No,No,No,No physical activity or exercise in last 30 days,True,NO


In [56]:
BRFSS.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 262049 entries, 0 to 262048
Data columns (total 41 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   GENHLTH      262049 non-null  object
 1   HLTHPLN1     262049 non-null  object
 2   PERSDOC2     262049 non-null  object
 3   MEDCOST      262049 non-null  object
 4   CHECKUP1     262049 non-null  object
 5   _RFHYPE5     262049 non-null  object
 6   TOLDHI2      262049 non-null  object
 7   CVDINFR4     262049 non-null  object
 8   CVDCRHD4     262049 non-null  object
 9   CVDSTRK3     262049 non-null  object
 10  ASTHMA3      262049 non-null  object
 11  CHCSCNCR     262049 non-null  object
 12  CHCOCNCR     262049 non-null  object
 13  CHCCOPD2     262049 non-null  object
 14  ADDEPEV3     262049 non-null  object
 15  CHCKDNY2     262049 non-null  object
 16  DIABETE4     262049 non-null  object
 17  HAVARTH4     262049 non-null  object
 18  MARITAL      262049 non-null  object
 19  ED

In [57]:
BRFSS = BRFSS.drop(columns="CHECKUP1")


In [58]:
BRFSS.shape

(262049, 40)

In [59]:
BRFSS['GENHLTH'].value_counts()

Very good    90955
Good         81809
Excellent    42566
Fair         34316
Poor         12403
Name: GENHLTH, dtype: int64

In [60]:
BRFSS['HLTHPLN1'].value_counts()

Yes    244656
No      17393
Name: HLTHPLN1, dtype: int64

In [69]:
categorical_variables = ['GENHLTH', 'TOLDHI2', '_RFHYPE5', 'HLTHPLN1']
BRFSS = pd.get_dummies(BRFSS, columns = categorical_variables)

In [70]:
X = BRFSS[['HLTHPLN1_No', 'HLTHPLN1_Yes']]                 # Predictor variable (as a DataFrame)
y = BRFSS['target']  

In [85]:
X.head(10)

Unnamed: 0,HLTHPLN1_No,HLTHPLN1_Yes
0,0,1
1,0,1
2,0,1
3,0,1
4,0,1
5,0,1
6,0,1
7,0,1
8,0,1
9,0,1


In [72]:
from sklearn.model_selection import train_test_split

In [73]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify = y,     # Keep the same proportions of the target in the training and test data
                                                    test_size = 0.25,
                                                    random_state = 321)

In [74]:
from sklearn.linear_model import LogisticRegression

In [75]:
logreg = LogisticRegression()         # Create a logistic regression model
logreg.fit(X_train, y_train)          # Fit it to the training data

LogisticRegression()

In [76]:
logreg.intercept_

array([0.61749877])

In [77]:
logreg.coef_

array([[-0.45281777,  1.06994915]])

In [86]:
y_pred_prob = logreg.predict_proba(X_test)

In [87]:
y_pred_prob

array([[0.15611176, 0.84388824],
       [0.15611176, 0.84388824],
       [0.15611176, 0.84388824],
       ...,
       [0.15611176, 0.84388824],
       [0.15611176, 0.84388824],
       [0.15611176, 0.84388824]])

In [88]:
i = 35

print('Patient Information:\n{}'.format(X_test.iloc[i]))
print('---------------------------------')
print('Predicted Probability of Doctor visit: {}'.format(y_pred_prob[i]))
print('Actual: {}'.format(y_test.iloc[i]))

Patient Information:
HLTHPLN1_No     0
HLTHPLN1_Yes    1
Name: 231294, dtype: uint8
---------------------------------
Predicted Probability of Doctor visit: [0.15611176 0.84388824]
Actual: False


In [83]:
BRFSS.head(10)

Unnamed: 0,PERSDOC2,MEDCOST,CVDINFR4,CVDCRHD4,CVDSTRK3,ASTHMA3,CHCSCNCR,CHCOCNCR,CHCCOPD2,ADDEPEV3,...,GENHLTH_Fair,GENHLTH_Good,GENHLTH_Poor,GENHLTH_Very good,TOLDHI2_No,TOLDHI2_Yes,_RFHYPE5_No,_RFHYPE5_Yes,HLTHPLN1_No,HLTHPLN1_Yes
0,"Yes, only one",No,No,No,No,No,No,No,No,No,...,0,1,0,0,0,1,0,1,0,1
1,"Yes, only one",No,No,No,No,No,No,No,No,No,...,1,0,0,0,1,0,1,0,0,1
2,More than one,No,No,No,No,No,No,No,No,No,...,0,1,0,0,1,0,0,1,0,1
3,"Yes, only one",No,No,No,No,Yes,No,No,Yes,No,...,0,0,0,1,1,0,1,0,0,1
4,"Yes, only one",No,No,No,No,Yes,No,No,Yes,No,...,0,0,1,0,0,1,1,0,0,1
5,"Yes, only one",No,No,No,No,No,No,No,No,No,...,0,0,0,1,1,0,1,0,0,1
6,"Yes, only one",No,No,No,No,No,No,No,No,No,...,0,0,0,0,0,1,1,0,0,1
7,No,No,No,No,No,No,No,No,No,No,...,0,0,0,0,1,0,1,0,0,1
8,No,No,No,No,No,No,No,No,No,No,...,0,1,0,0,1,0,1,0,0,1
9,"Yes, only one",No,No,No,No,No,No,No,No,No,...,0,0,0,1,1,0,1,0,0,1


In [None]:
X = BRFSS.drop(columns = ['CHECKUP1'])
y = heart['AHD']