#Final Project
#DSC 540: Advanced Machine Learning
**Author**: Alex Teboul

**Professor**: Casey Bennett

**Data Source**: https://www.kaggle.com/cdc/behavioral-risk-factor-surveillance-system#2015.csv



## About this Project

**Objective:** The goal of this project was to build predicitve models for coronary heart disease using the 2015 BRFSS dataset. This project follows the processes and algorithms explored in DePaul University graduate course DSC 540: Advanced Machine Learning with professor Casey Bennett. In this Google Colab notebook, I go through the process of getting the 2015 BRFSS dataset, selecting features for exploration in my predictive models based irisk factors identified in past heart disease research, exploratory data analysis, model testing, and reporting on results. Methods explored in this notebook are: Random Forests, Gradient Boosting, AdaBoost, and Neural Networks.


1.   **Part 1:** Getting and Cleaning the Data
*   Get the BRFSS dataset from my local google drive
*   Select a Relevant Subset of Features
*   Cleaning the Data (Missing Values, Modifying Values, Make Feature Names More Readable, Save Finalized Dataset to CSV)
2.   **Part 2:** Model Building

Random Forests
*   Random Forest - w/ Feature Selection - Full Dataset
*   Random Forest - w/o Feature Selection - Full Dataset
*   Random Forest - w/ and w/o Feature Selection - 50-50 Balanced Dataset
*   Random Forest - w/ Feature Selection - 60-40 Balanced Dataset

AdaBoost, GradientBoost, and Neural Networks
*   AdaBoost, GradientBoost, and Neural Network - w/o Feature Selection - Full Dataset
*   AdaBoost, GradientBoost, and Neural Network - w/ Feature Selection - Full Dataset
*   AdaBoost, GradientBoost, and Neural Network - w/ Feature Selection - 50-50 Dataset
*   AdaBoost, GradientBoost, and Neural Network - w/ Feature Selection - 60-40 Dataset

Support Vector Machines: *Too Slow - Never Finishes
*   RBF-SVM - w/ Feature Selection 50-50 Dataset




#Part 1: Getting and Cleaning the Data

In [1]:
#imports
import pandas as pd
import numpy as np
import random
random.seed(1)

###Get the BRFSS dataset from my local google drive

In [2]:
#connect to my local google drive
#from google.colab import drive
#drive.mount('/content/drive')

In [3]:
#read in the dataset
brfss_2015_dataset_source = 'brfss_for_bda_2021.csv'
brfss_2015_dataset = pd.read_csv(brfss_2015_dataset_source)

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
#check that all the data loaded in
brfss_2015_dataset.shape

#Start with 444,456 records and 330 features. Each record is an individual's responses to the survey.

(12338, 414)

In [5]:
#check that the data loaded in is in the correct format
pd.set_option('display.max_columns', 500)
brfss_2015_dataset.head()

Unnamed: 0,_STATE,_GEOSTR,_DENSTR2,PRECALL,SECSCRFL,REPNUM,REPDEPTH,FMONTH,IDATE,IMONTH,IDAY,IYEAR,DISPCODE,SEQNO,_PSU,NATTMPTS,NRECSEL,NRECSTR,PVTRESD1,COLGHOUS,STATERES,CELLFON3,LADULT,NUMADULT,CADULT,CCLGHOUS,CSTATE,RSPSTATE,LANDLINE,HHADULT,GENHLTH,PHYSHLTH,MENTHLTH,POORHLTH,HLTHPLN1,PERSDOC2,MEDCOST,CHECKUP1,BPHIGH4,BPMEDS,BLOODCHO,CHOLCHK,TOLDHI2,CVDINFR4,CVDCRHD4,CVDSTRK3,ASTHMA3,ASTHNOW,CHCSCNCR,CHCOCNCR,CHCCOPD1,HAVARTH3,ADDEPEV2,CHCKIDNY,DIABETE3,DIABAGE2,SEX,AGE,HISPANC3,MRACE1,ORACE3,MARITAL,EDUCA,RENTHOM1,CTYCODE1,ZIPCODE,NUMHHOL2,NUMPHON2,CPDEMO1,VETERAN3,EMPLOY1,CHILDREN,INCOME2,INTERNET,WEIGHT2,HEIGHT3,PREGNANT,QLACTLM2,USEEQUIP,BLIND,DECIDE,DIFFWALK,DIFFDRES,DIFFALON,SMOKE100,SMOKDAY2,STOPSMK2,LASTSMK2,USENOW3,ALCDAY5,AVEDRNK2,DRNK3GE5,MAXDRNKS,FRUITJU1,FRUIT1,FVBEANS,FVGREEN,FVORANG,VEGETAB1,EXERANY2,EXRACT11,EXEROFT1,EXERHMM1,EXRACT21,EXEROFT2,EXERHMM2,STRENGTH,LMTJOIN3,ARTHDIS2,ARTHSOCL,JOINPAIN,SEATBELT,FLUSHOT6,FLSHTMY2,IMFVPLAC,PNEUVAC3,HIVTST6,HIVTSTD3,WHRTST10,PDIABTST,PREDIAB1,INSULIN,BLDSUGAR,FEETCHK2,DOCTDIAB,CHKHEMO3,FEETCHK,EYEEXAM,DIABEYE,DIABEDU,CAREGIV1,CRGVREL1,CRGVLNG1,CRGVHRS1,CRGVPRB1,CRGVPERS,CRGVHOUS,CRGVMST2,CRGVEXPT,CIMEMLOS,CDHOUSE,CDASSIST,CDHELP,CDSOCIAL,CDDISCUS,ARTTODAY,ARTHWGT,ARTHEXER,ARTHEDU,BLDSTOOL,LSTBLDS3,HADSIGM3,HADSGCO1,LASTSIG3,TYPEWORK,TYPEINDS,SXORIENT,TRNSGNDR,RCSBIRTH,RCSGENDR,RCHISLA1,RCSRACE1,RCSBRAC2,RCSRLTN2,CASTHDX2,CASTHNO2,ADHISPA,CHHISPA,QSTVER,QSTLANG,EXACTOT1,EXACTOT2,_MSACODE,MSCODE,_STSTR,_STRWT,_RAW,_WT2,_RAWRAKE,_WT2RAKE,_REGION,_IMPAGE,_IMPRACE,_IMPNPH,_IMPEDUC,_IMPMRTL,_IMPHOME,O_STATE,_CHISPNC,_CRACE1,_CPRACE,_IMPCAGE,_IMPCRAC,_IMPCSEX,_RAWCH,_WT2CH,_CLCM1V1,_CLCM2V1,_CLCM3V1,_CLCM4V1,_CLCM5V1,_CLCWTV1,_DUALUSE,_DUALCOR,_LLCPM01,_LLCPM02,_LLCPM03,_LLCPM04,_LLCPM05,_LLCPM06,_LLCPM07,_LLCPM08,_LLCPM09,_LLCPM10,_LLCPM11,_LLCPM12,_LLCPM13,_LLCPM14,_LLCPM15,_LLCPM16,_LLCPWT,_LCM01V1,_LCM02V1,_LCM03V1,_LCM04V1,_LCM05V1,_LCM06V1,_LCM07V1,_LCM08V1,_LCPWTV1,_LCM01V2,_LCM02V2,_LCM03V2,_LCM04V2,_LCM05V2,_LCM06V2,_LCM07V2,_LCM08V2,_LCM09V2,_LCM10V2,_LCM11V2,_LCM12V2,_LCPWTV2,_RFHLTH,_HCVU651,_RFHYPE5,_CHOLCHK,_RFCHOL,_MICHD,_LTASTH1,_CASTHM1,_ASTHMS1,_DRDXAR1,_MRACE1,_M_RACE,_HISPANC,_RACE,_RACEG21,_RACEGR3,_RACE_G1,_AGEG5YR,_AGE65YR,_AGE80,_AGE_G,HTIN4,HTM4,WTKG3,_BMI5,_BMI5CAT,_RFBMI5,_CHLDCNT,_EDUCAG,_INCOMG,_SMOKER3,_RFSMOK3,DRNKANY5,DROCDY3_,_RFBING5,_DRNKWEK,_RFDRHV5,FTJUDA1_,FRUTDA1_,BEANDAY_,GRENDAY_,ORNGDAY_,VEGEDA1_,_MISFRTN,_MISVEGN,_FRTRESP,_VEGRESP,_FRUTSUM,_VEGESUM,_FRTLT1,_VEGLT1,_FRT16,_VEG23,_FRUITEX,_VEGETEX,_TOTINDA,METVL11_,METVL21_,MAXVO2_,FC60_,ACTIN11_,ACTIN21_,PADUR1_,PADUR2_,PAFREQ1_,PAFREQ2_,_MINAC11,_MINAC21,STRFREQ_,PAMISS1_,PAMIN11_,PAMIN21_,PA1MIN_,PAVIG11_,PAVIG21_,PA1VIGM_,_PACAT1,_PAINDX1,_PA150R2,_PA300R2,_PA30021,_PASTRNG,_PAREC1,_PASTAE1,_LMTACT1,_LMTWRK1,_LMTSCL1,_RFSEAT2,_RFSEAT3,_FLSHOT6,_PNEUMO2,_AIDTST3,MEDICARE,HLTHCVR1,DELAYMED,NOCOV121,LSTCOVRG,DRVISITS,MEDSCOST,CARERCVD,MEDBILL1,EMPLSTYR,JOBINJMT,DAYSRTRN,WHOPAIDT,OTHRPAID,EMPAWARE,MISNERVS,MISHOPLS,MISRSTLS,MISDEPRD,MISEFFRT,MISWTLES,MISNOWRK,MISTMNT,MISTRHLP,MISPHLPF,SSBSUGR1,SSBFRUT2,HCVHEAR,HCVTEST,HCVLASTT,HCVINPTR,HCVINPTO,HCVINPTA,HCVPRIMR,HCVPRIMO,HCVPRIMA,HEALTHCL1,LIFECHG,LASTDENT1,RMVTEETH1,DIFFHEAR,FRUITVEG,NOVEGFRU,NOVFOTHR,STRSRENT,STRSMEAL,dsripreg,REGION,PPS_1,PPS_3,PPS_8,PPS_9,PPS_14,PPS_16,PPS_19,PPS_20,PPS_21,PPS_22,PPS_23,PPS_25,PPS_27,PPS_32,PPS_33,PPS_34,PPS_36,PPS_39,PPS_40,PPS_43,PPS_44,PPS_45,PPS_46,PPS_48,PPS_52,childage,cracorg1,_prace1,mracasc1,_impcty,mracorg1
0,New York,207,D,To be called,,40187,5,April,4092015,April,9,2015,1200,2015000012,2015000012,1,24887,13469550.0,Missing,Missing,,Missing,Missing,,"Yes, Male Respondent",Missing,No,New York,No,2.0,Very good,,,Not asked or Missing,Yes,No,No,Within past 2 years (1 year but less than 2 ye...,Yes,No,Yes,Within the past 5 years (2 years but less than...,No,No,No,No,No,Not asked or Missing,No,No,No,No,No,No,No,Not asked or Missing,Male,Age 25 - 34,Data do not meet the criteria for statistical ...,Data do not meet the criteria for statistical ...,D,Never married,College 4 years or more (College graduate),Rent,D,Data do not meet the criteria for statistical ...,Not asked or Missing,Not asked or Missing,Not asked or Missing,Yes,Employed for wages,,"$75,000 or more",Yes,D,D,Not asked or Missing,No,No,No,No,No,No,No,No,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not at all,Days per week,Number of drinks,Number of Times,Number of drinks,Times per week,Times per week,Times per month,Times per month,Times per month,Times per week,Yes,Running,Times per week,Hours and Minutes,Bicycling machine exercise,Times per month,Hours and Minutes,Times per week,Not asked or Missing,Missing,Missing,Missing,Nearly always,No,Not asked or Missing,Not asked or Missing,No,Yes,Don't know/Not sure,Clinic,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Data do not meet the criteria for statistical ...,Data do not meet the criteria for statistical ...,Not asked or Missing,D,Data do not meet the criteria for statistical ...,Not asked or Missing,Data do not meet the criteria for statistical ...,Data do not meet the criteria for statistical ...,D,Not asked or Missing,Not asked or Missing,Not asked or Missing,D,D,Core only cellphone (collected out of state),English,,,Data do not meet the criteria for statistical ...,D,362079,541.228186,,541.228186,1.0,541.228186,7,Age 25 to 34,D,,College 4 years or more (College graduate),Never married,Rent,Alabama,D,D,D,D,D,Missing,,,,,,,,,No Dual Phone Use,,2,1,4,2,2,1,1,1,7,42,13,18,6,16,33,11,2117.541967,,,,,,,,,,,,,,,,,,,,,,,Good or Better Health,Have health care coverage,Yes,Had cholesterol checked in past 5 years,No,Did not report having MI or CHD,No,No,Never,Not diagnosed with arthritis,D,D,D,D,D,D,D,Age 30 to 34,Age 18 to 64,Imputed Age 30 to 34,Age 25 to 34,D,D,D,1 or greater,Normal Weight,No,No children in household,Graduated from College or Technical School,"$50,000 or more",Never smoked,No,Yes,Drink-Occasions per day,Yes,600,1,29.0,29.0,3.0,33.0,10.0,71.0,No missing fruit responses,No missing vegetable responses,Included - Not Missing Fruit Responses,Included - Not Missing Vegetable Responses,58.0,117.0,Consumed fruit less than one time per day,Consumed vegetables one or more times per day,Included - Values are in accepted range,Included - Values are in accepted range,No missing values and in accepted range,No missing values and in accepted range,Had physical activity or exercise,60.0,68.0,4130,708,1.0,1.0,30.0,45.0,1000.0,467.0,30.0,21.0,1000.0,0,30.0,21.0,51.0,0.0,0.0,0.0,Insufficiently Active,Did Not Meet Aerobic Recommendations,1-149 minutes (or vigorous equivalent minutes...,1-300 minutes (or vigorous equivalent minutes...,0-300 minutes (or vigorous equivalent minutes...,Did not meet muscle strengthening recommendations,Did not meet Either Guideline,Did Not Meet Both Guidelines,Not told they have arthritis,Not told they have arthritis,Not told they have arthritis,Always or Almost Always Wear Seat Belt,Don't Always Wear Seat Belt,Age Less Than 65,Age Less Than 65,Yes,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,,Not asked or Missing,Not asked or Missing,New York City,New York City (NYC),No,No,No,No,No,No,No,No,No,No,No,Yes,No,No,No,Yes,No,Yes,No,No,No,No,No,No,Yes,D,Data do not meet the criteria for statistical ...,D,D,D,Data do not meet the criteria for statistical ...
1,New York,207,D,To be called,,60025,21,June,6232015,June,23,2015,1200,2015000013,2015000013,4,24887,13469550.0,Missing,Missing,,Missing,Missing,,"Yes, Female Respondent",Missing,No,New York,No,2.0,Good,,Number of days,,Yes,"Yes, only one",Yes,Within past year (anytime less than 12 months ...,Yes,Yes,Yes,Within the past year (anytime less than 12 mon...,No,Don't know/Not sure,No,Yes,No,Not asked or Missing,No,No,No,No,No,No,No,Not asked or Missing,Female,Age 45 - 54,Data do not meet the criteria for statistical ...,Data do not meet the criteria for statistical ...,D,Separated,College 1 year to 3 years (Some college or tec...,Other arrangement,D,Data do not meet the criteria for statistical ...,Not asked or Missing,Not asked or Missing,Not asked or Missing,No,Employed for wages,Number of children,Don't know/Not sure,Yes,D,D,Not asked or Missing,No,No,No,No,No,No,No,No,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not at all,No drinks in past 30 days,Not asked or Missing,Not asked or Missing,Not asked or Missing,Times per month,Times per month,Times per month,Times per month,Times per month,Times per month,Yes,Running,Times per month,Don't know/Not sure,Walking,Times per month,Don't know/Not sure,Times per month,Not asked or Missing,Missing,Missing,Missing,Always,Yes,Month / Year,A hospital (Example: inpatient),Yes,Yes,Don't know/Not sure,Clinic,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Data do not meet the criteria for statistical ...,Data do not meet the criteria for statistical ...,Not asked or Missing,D,Data do not meet the criteria for statistical ...,Not asked or Missing,Data do not meet the criteria for statistical ...,Data do not meet the criteria for statistical ...,D,Not asked or Missing,Not asked or Missing,Not asked or Missing,D,D,Core only cellphone (collected out of state),English,,,Data do not meet the criteria for statistical ...,D,362079,541.228186,,541.228186,1.0,541.228186,7,Age 45 to 54,D,,College 1 year to 3 years (Some college or tec...,Separated,Other arrangement,Alabama,D,D,D,D,D,Missing,,,,,,,,,No Dual Phone Use,,11,2,3,3,2,6,5,1,7,44,14,19,6,17,35,12,2230.075513,,,,,,,,,,,,,,,,,,,,,,,Good or Better Health,Have health care coverage,Yes,Had cholesterol checked in past 5 years,No,Not asked or Missing,No,No,Never,Not diagnosed with arthritis,D,D,D,D,D,D,D,Age 45 to 49,Age 18 to 64,Imputed Age 45 to 49,Age 45 to 54,D,D,D,1 or greater,Obese,Yes,One child in household,Attended College or Technical School,Don't know/Not sure/Missing,Never smoked,No,No,No Drink-Occasions per day,No,0,1,13.0,83.0,67.0,50.0,33.0,50.0,No missing fruit responses,No missing vegetable responses,Included - Not Missing Fruit Responses,Included - Not Missing Vegetable Responses,96.0,200.0,Consumed fruit less than one time per day,Consumed vegetables one or more times per day,Included - Values are in accepted range,Included - Values are in accepted range,No missing values and in accepted range,No missing values and in accepted range,Had physical activity or exercise,60.0,35.0,3061,525,2.0,1.0,,,5833.0,5833.0,,,2333.0,1,,,,,0.0,0.0,Don't know/Not Sure/Refused/Missing,Don't know/Not Sure/Refused/Missing,Don't know/Not Sure/Refused/Missing,Don't know/Not Sure/Refused/Missing,Don't know/Not Sure/Refused/Missing,Meet muscle strengthening recommendations,Don't know/Not Sure/Refused/Missing,Don't know/Not Sure/Refused/Missing,Not told they have arthritis,Not told they have arthritis,Not told they have arthritis,Always or Almost Always Wear Seat Belt,Always Wear Seat Belt,Age Less Than 65,Age Less Than 65,Yes,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,,Not asked or Missing,Not asked or Missing,New York City,New York City (NYC),No,No,No,No,No,No,No,No,No,No,No,Yes,No,No,No,Yes,No,Yes,No,No,No,No,No,No,Yes,D,Data do not meet the criteria for statistical ...,D,D,D,Data do not meet the criteria for statistical ...
2,New York,203,D,To be called,,120050,3,December,12282015,December,28,2015,1200,2015000014,2015000014,3,3593,1568383.0,Missing,Missing,,Missing,Missing,,"Yes, Female Respondent",Missing,No,New York,No,1.0,Very good,,,Not asked or Missing,Yes,"Yes, only one",No,Within past year (anytime less than 12 months ...,No,Not asked or Missing,Yes,Within the past year (anytime less than 12 mon...,Yes,No,No,No,No,Not asked or Missing,No,No,No,No,No,No,No,Not asked or Missing,Female,Age 25 - 34,Data do not meet the criteria for statistical ...,Data do not meet the criteria for statistical ...,D,Never married,College 4 years or more (College graduate),Rent,D,Data do not meet the criteria for statistical ...,Not asked or Missing,Not asked or Missing,Not asked or Missing,No,Employed for wages,,Don't know/Not sure,Yes,D,D,No,No,No,No,No,No,No,No,No,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not at all,Days in past 30 days,Number of drinks,,Number of drinks,Never,Times per week,Never,Times per week,Times per week,Times per week,Yes,Running,Times per month,Hours and Minutes,Calisthenics,Times per month,Hours and Minutes,Never,Not asked or Missing,Missing,Missing,Missing,Always,No,Not asked or Missing,Not asked or Missing,Yes,Yes,Code month and year,Private doctor or HMO,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Data do not meet the criteria for statistical ...,Data do not meet the criteria for statistical ...,Not asked or Missing,D,Data do not meet the criteria for statistical ...,Not asked or Missing,Data do not meet the criteria for statistical ...,Data do not meet the criteria for statistical ...,D,Not asked or Missing,Not asked or Missing,Not asked or Missing,D,D,Core only cellphone (collected out of state),English,,,Data do not meet the criteria for statistical ...,D,362039,436.510631,,436.510631,1.0,436.510631,3,Age 25 to 34,D,,College 4 years or more (College graduate),Never married,Rent,Alabama,D,D,D,D,D,Missing,,,,,,,,,No Dual Phone Use,,9,2,4,2,2,6,3,1,3,19,6,6,4,11,25,8,1205.094503,,,,,,,,,,,,,,,,,,,,,,,Good or Better Health,Have health care coverage,No,Had cholesterol checked in past 5 years,Yes,Did not report having MI or CHD,No,No,Never,Not diagnosed with arthritis,D,D,D,D,D,D,D,Age 30 to 34,Age 18 to 64,Imputed Age 30 to 34,Age 25 to 34,D,D,D,1 or greater,Overweight,Yes,No children in household,Graduated from College or Technical School,Don't know/Not sure/Missing,Never smoked,No,Yes,Drink-Occasions per day,No,70,1,0.0,43.0,0.0,43.0,29.0,43.0,No missing fruit responses,No missing vegetable responses,Included - Not Missing Fruit Responses,Included - Not Missing Vegetable Responses,43.0,115.0,Consumed fruit less than one time per day,Consumed vegetables one or more times per day,Included - Values are in accepted range,Included - Values are in accepted range,No missing values and in accepted range,No missing values and in accepted range,Had physical activity or exercise,60.0,38.0,3690,633,1.0,1.0,90.0,15.0,467.0,467.0,42.0,7.0,0.0,0,42.0,7.0,49.0,0.0,0.0,0.0,Insufficiently Active,Did Not Meet Aerobic Recommendations,1-149 minutes (or vigorous equivalent minutes...,1-300 minutes (or vigorous equivalent minutes...,0-300 minutes (or vigorous equivalent minutes...,Did not meet muscle strengthening recommendations,Did not meet Either Guideline,Did Not Meet Both Guidelines,Not told they have arthritis,Not told they have arthritis,Not told they have arthritis,Always or Almost Always Wear Seat Belt,Always Wear Seat Belt,Age Less Than 65,Age Less Than 65,Yes,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,,Not asked or Missing,Not asked or Missing,Finger Lakes,NYS exclusive of NYC,No,No,No,Yes,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,D,Data do not meet the criteria for statistical ...,D,D,D,Data do not meet the criteria for statistical ...
3,New York,206,D,To be called,,30066,28,March,3182015,March,18,2015,1200,2015000015,2015000015,3,4465,876699.8,Missing,Missing,,Missing,Missing,,"Yes, Male Respondent",Missing,No,New York,No,3.0,Excellent,,,Not asked or Missing,No,No,No,Within past year (anytime less than 12 months ...,No,Not asked or Missing,Don't know/Not Sure,Not asked or Missing,Not asked or Missing,No,No,No,No,Not asked or Missing,No,No,No,No,No,No,No,Not asked or Missing,Male,Age 18 - 24,Data do not meet the criteria for statistical ...,Data do not meet the criteria for statistical ...,D,Never married,Grade 12 or GED (High school graduate),Rent,D,Data do not meet the criteria for statistical ...,Not asked or Missing,Not asked or Missing,Not asked or Missing,No,Employed for wages,Number of children,Don't know/Not sure,Yes,D,D,Not asked or Missing,No,No,No,No,No,No,No,No,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not at all,No drinks in past 30 days,Not asked or Missing,Not asked or Missing,Not asked or Missing,Never,Times per month,Don't know/Not sure,Don't know/Not sure,Don't know/Not sure,Don't know/Not sure,Yes,Walking,Don't know/Not sure,Hours and Minutes,No other activity,Not asked or Missing,Not asked or Missing,Never,Not asked or Missing,Missing,Missing,Missing,Always,No,Not asked or Missing,Not asked or Missing,Don't know/Not Sure,No,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Data do not meet the criteria for statistical ...,Data do not meet the criteria for statistical ...,Not asked or Missing,D,Data do not meet the criteria for statistical ...,Not asked or Missing,Data do not meet the criteria for statistical ...,Data do not meet the criteria for statistical ...,D,Not asked or Missing,Not asked or Missing,Not asked or Missing,D,D,Core only cellphone (collected out of state),English,,,Data do not meet the criteria for statistical ...,D,362069,196.349347,,196.349347,1.0,196.349347,6,Age 18 to 24,D,,Grade 12 or GED (High school graduate),Never married,Rent,Alaska,D,D,D,D,D,Missing,,,,,,,,,No Dual Phone Use,,1,1,2,2,2,1,1,1,6,34,11,16,10,30,58,19,1776.4061,,,,,,,,,,,,,,,,,,,,,,,Good or Better Health,Do not have health care coverage,No,Don't know/Not Sure Or Refused/Missing,Missing,Did not report having MI or CHD,No,No,Never,Not diagnosed with arthritis,D,D,D,D,D,D,D,Age 18 to 24,Age 18 to 64,Imputed Age 18 to 24,Age 18 to 24,D,D,D,1 or greater,Overweight,Yes,One child in household,Graduated High School,Don't know/Not sure/Missing,Never smoked,No,No,No Drink-Occasions per day,No,0,1,0.0,50.0,,,,,No missing fruit responses,"Has 1, 2, 3, or 4 missing vegetable responses",Included - Not Missing Fruit Responses,Not Included - Missing Vegetable Responses,50.0,,Consumed fruit less than one time per day,"Don´t know, refused or missing values",Included - Values are in accepted range,Included - Values are in accepted range,No missing values and in accepted range,Missing Vegetable responses,Had physical activity or exercise,35.0,0.0,5010,859,1.0,0.0,60.0,,,,,0.0,0.0,1,,0.0,0.0,0.0,0.0,0.0,Don't know/Not Sure/Refused/Missing,Don't know/Not Sure/Refused/Missing,Don't know/Not Sure/Refused/Missing,Don't know/Not Sure/Refused/Missing,Don't know/Not Sure/Refused/Missing,Did not meet muscle strengthening recommendations,Don't know/Not Sure/Refused/Missing,Don't know/Not Sure/Refused/Missing,Not told they have arthritis,Not told they have arthritis,Not told they have arthritis,Always or Almost Always Wear Seat Belt,Always Wear Seat Belt,Age Less Than 65,Age Less Than 65,No,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,,Not asked or Missing,Not asked or Missing,North Country,NYS exclusive of NYC,No,No,No,No,No,No,No,No,No,No,Yes,No,No,No,No,No,No,No,No,No,No,No,No,No,No,D,Data do not meet the criteria for statistical ...,D,D,D,Data do not meet the criteria for statistical ...
4,New York,203,D,To be called,,110026,23,November,11292015,November,29,2015,1200,2015000016,2015000016,8,3593,1568383.0,Missing,Missing,,Missing,Missing,,"Yes, Female Respondent",Missing,No,New York,No,1.0,Excellent,Number of days,Number of days,Number of days,Yes,"Yes, only one",No,Within past year (anytime less than 12 months ...,No,Not asked or Missing,No,Not asked or Missing,Not asked or Missing,No,No,No,No,Not asked or Missing,No,No,No,No,Yes,No,No,Not asked or Missing,Female,Age 25 - 34,Data do not meet the criteria for statistical ...,Data do not meet the criteria for statistical ...,D,Never married,College 4 years or more (College graduate),Rent,D,Data do not meet the criteria for statistical ...,Not asked or Missing,Not asked or Missing,Not asked or Missing,No,Employed for wages,,"Less than $50,000 ($35,000 to less than $50,000)",Yes,D,D,No,No,No,No,No,No,No,No,No,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not at all,Days per week,Number of drinks,Number of Times,Number of drinks,Never,Times per day,Times per day,Times per day,Times per week,Times per day,Yes,Running,Times per week,Hours and Minutes,Elliptical/EFX machine exercise,Times per week,Hours and Minutes,Times per week,Not asked or Missing,Missing,Missing,Missing,Always,No,Not asked or Missing,Not asked or Missing,Yes,Yes,Unknown month and known year,Private doctor or HMO,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Data do not meet the criteria for statistical ...,Data do not meet the criteria for statistical ...,Not asked or Missing,D,Data do not meet the criteria for statistical ...,Not asked or Missing,Data do not meet the criteria for statistical ...,Data do not meet the criteria for statistical ...,D,Not asked or Missing,Not asked or Missing,Not asked or Missing,D,D,Core only cellphone (collected out of state),English,,,Data do not meet the criteria for statistical ...,D,362039,436.510631,,436.510631,1.0,436.510631,3,Age 25 to 34,D,,College 4 years or more (College graduate),Never married,Rent,Alaska,D,D,D,D,D,Missing,,,,,,,,,No Dual Phone Use,,9,3,4,2,2,8,3,1,3,19,6,7,4,11,25,8,1273.934863,,,,,,,,,,,,,,,,,,,,,,,Good or Better Health,Have health care coverage,No,Have never had cholesterol checked,Missing,Did not report having MI or CHD,No,No,Never,Not diagnosed with arthritis,D,D,D,D,D,D,D,Age 25 to 29,Age 18 to 64,Imputed Age 25 to 29,Age 25 to 34,D,D,D,1 or greater,Normal Weight,No,No children in household,Graduated from College or Technical School,"$35,000 to less than $50,000",Never smoked,No,Yes,Drink-Occasions per day,Yes,400,1,0.0,300.0,100.0,200.0,71.0,100.0,No missing fruit responses,No missing vegetable responses,Included - Not Missing Fruit Responses,Included - Not Missing Vegetable Responses,300.0,471.0,Consumed fruit one or more times per day,Consumed vegetables one or more times per day,Included - Values are in accepted range,Included - Values are in accepted range,No missing values and in accepted range,No missing values and in accepted range,Had physical activity or exercise,60.0,50.0,3764,645,1.0,1.0,30.0,40.0,5000.0,2000.0,150.0,80.0,2000.0,0,150.0,80.0,230.0,0.0,0.0,0.0,Active,Meet Aerobic Recommendations,150+ minutes (or vigorous equivalent minutes) ...,1-300 minutes (or vigorous equivalent minutes...,0-300 minutes (or vigorous equivalent minutes...,Meet muscle strengthening recommendations,Met Both Guidelines,Met Both Guidelines,Not told they have arthritis,Not told they have arthritis,Not told they have arthritis,Always or Almost Always Wear Seat Belt,Always Wear Seat Belt,Age Less Than 65,Age Less Than 65,Yes,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,Not asked or Missing,,Not asked or Missing,Not asked or Missing,Finger Lakes,NYS exclusive of NYC,No,No,No,Yes,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,D,Data do not meet the criteria for statistical ...,D,D,D,Data do not meet the criteria for statistical ...


### Select Relevant Subset of Features

The dataset originally has 330 features (columns), but based on heart disease research regarding factors influencing heart disease, only select features are included in this analysis.



#### Important Risk Factors
Research in the field has identified the following as **important risk factors** for heart disease (not in strict order of importance):

*   blood pressure (high)
*   cholesterol (high)
*   smoking
*   diabetes
*   obesity
*   age
*   sex
*   race
*   diet
*   exercise
*   alcohol consumption
*   BMI
*   Household Income
*   Marital Status
*   Sleep
*   Time since last checkup
*   Education
*   Health care coverage
*   Mental Health



#### Selected Subset of Features from BRFSS 2015
Given these risk factors, I tried to select features (columns/questions) in the BRFSS related to these risk factors. To help understand what the columns mean, I consult the BRFSS 2015 Codebook to see the questions and information about the questions. I try to match the variable names in the codebook to the variable names in the dataset I downloaded from Kaggle. I also reference some of the same features chosen for a research paper by Zidian Xie et al for *Building Risk Prediction Models for Type 2 Diabetes Using Machine Learning Techniques* using the 2014 BRFSS. Diabetes and Heart Disease outcomes are strongly correlated, with the primary cause of death for diabetics being heart disease complications. Given this information, it is a useful starting point.

**BRFSS 2015 Codebook:** https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf

**Relevant Research Paper using BRFSS for Diabetes ML:** https://www.cdc.gov/pcd/issues/2019/19_0109.htm


The **selected features** from the BRFSS 2015 dataset are:

**Response Variable / Dependent Variable:**
*   Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI) --> _MICHD


**Independent Variables:**

**High Blood Pressure**
*   Adults who have been told they have high blood pressure by a doctor, nurse, or other health professional --> _RFHYPE5

**High Cholesterol**
*   Have you EVER been told by a doctor, nurse or other health professional that your blood cholesterol is high? --> TOLDHI2
*   Cholesterol check within past five years --> _CHOLCHK

**BMI**
*   Body Mass Index (BMI) --> _BMI5

**Smoking**
*   Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] --> SMOKE100

**Other Chronic Health Conditions**
*   (Ever told) you had a stroke. --> CVDSTRK3
*   (Ever told) you have diabetes (If "Yes" and respondent is female, ask "Was this only when you were pregnant?". If Respondent says pre-diabetes or borderline diabetes, use response code 4.) --> DIABETE3

**Physical Activity**
*   Adults who reported doing physical activity or exercise during the past 30 days other than their regular job --> _TOTINDA

**Diet**
*   Consume Fruit 1 or more times per day --> _FRTLT1
*   Consume Vegetables 1 or more times per day --> _VEGLT1

**Alcohol Consumption**
*   Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) --> _RFDRHV5

**Health Care**
*   Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service?  --> HLTHPLN1
*   Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? --> MEDCOST

**Health General and Mental Health**
*   Would you say that in general your health is: --> GENHLTH
*   Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good? --> MENTHLTH
*   Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? --> PHYSHLTH
*   Do you have serious difficulty walking or climbing stairs? --> DIFFWALK

**Demographics**
*   Indicate sex of respondent. --> SEX
*   Fourteen-level age category --> _AGEG5YR
*   What is the highest grade or year of school you completed? --> EDUCA
*   Is your annual household income from all sources: (If respondent refuses at any income level, code "Refused.") --> INCOME2

####Get Subset of Features

In [6]:
# select specific columns
brfss_df_selected = brfss_2015_dataset[['_MICHD', 
                                         '_RFHYPE5',  
                                         'TOLDHI2', '_CHOLCHK', 
                                         '_BMI5','_BMI5CAT', 
                                         'SMOKE100', 
                                         'CVDSTRK3', 'DIABETE3', 
                                         '_TOTINDA', 
                                         '_FRTLT1', '_VEGLT1', 
                                         '_RFDRHV5', 
                                         'HLTHPLN1', 'MEDCOST', 
                                         'GENHLTH', 'MENTHLTH', 'PHYSHLTH', 'DIFFWALK', 
                                         'SEX', '_AGEG5YR', 'EDUCA', 'INCOME2' ]]

In [7]:
brfss_df_selected.shape

(12338, 23)

In [8]:
brfss_df_selected.head()

Unnamed: 0,_MICHD,_RFHYPE5,TOLDHI2,_CHOLCHK,_BMI5,_BMI5CAT,SMOKE100,CVDSTRK3,DIABETE3,_TOTINDA,_FRTLT1,_VEGLT1,_RFDRHV5,HLTHPLN1,MEDCOST,GENHLTH,MENTHLTH,PHYSHLTH,DIFFWALK,SEX,_AGEG5YR,EDUCA,INCOME2
0,Did not report having MI or CHD,Yes,No,Had cholesterol checked in past 5 years,1 or greater,Normal Weight,No,No,No,Had physical activity or exercise,Consumed fruit less than one time per day,Consumed vegetables one or more times per day,1,Yes,No,Very good,,,No,Male,Age 30 to 34,College 4 years or more (College graduate),"$75,000 or more"
1,Not asked or Missing,Yes,No,Had cholesterol checked in past 5 years,1 or greater,Obese,No,Yes,No,Had physical activity or exercise,Consumed fruit less than one time per day,Consumed vegetables one or more times per day,1,Yes,Yes,Good,Number of days,,No,Female,Age 45 to 49,College 1 year to 3 years (Some college or tec...,Don't know/Not sure
2,Did not report having MI or CHD,No,Yes,Had cholesterol checked in past 5 years,1 or greater,Overweight,No,No,No,Had physical activity or exercise,Consumed fruit less than one time per day,Consumed vegetables one or more times per day,1,Yes,No,Very good,,,No,Female,Age 30 to 34,College 4 years or more (College graduate),Don't know/Not sure
3,Did not report having MI or CHD,No,Not asked or Missing,Don't know/Not Sure Or Refused/Missing,1 or greater,Overweight,No,No,No,Had physical activity or exercise,Consumed fruit less than one time per day,"Don´t know, refused or missing values",1,No,No,Excellent,,,No,Male,Age 18 to 24,Grade 12 or GED (High school graduate),Don't know/Not sure
4,Did not report having MI or CHD,No,Not asked or Missing,Have never had cholesterol checked,1 or greater,Normal Weight,No,No,No,Had physical activity or exercise,Consumed fruit one or more times per day,Consumed vegetables one or more times per day,1,Yes,No,Excellent,Number of days,Number of days,No,Female,Age 25 to 29,College 4 years or more (College graduate),"Less than $50,000 ($35,000 to less than $50,000)"


### Cleaning the Data

####Missing Values

In [9]:
brfss_df_selected.to_csv('export_report_v1.csv')


In [10]:
brfss_df_diabet_selected = brfss_2015_dataset[['GENHLTH', 
                                         '_AGEG5YR',  
                                         '_BMI5CAT', 'CHECKUP1', 
                                         'INCOME2', 
                                         'EMPLOY1', 
                                         'SEX', 'MARITAL', 
                                         '_EDUCAG', 
                                         'CVDCRHD4', 'HLTHCVR1', 
                                         'MENTHLTH', 
                                         'CHCKIDNY', 'USEEQUIP', 
                                         '_TOTINDA', 'ADDEPEV2', 'RENTHOM1', 'EXERANY2', 
                                         'BLIND', 'DECIDE', 'HLTHPLN1', 'DIABETE3','_SMOKER3' ]]


In [11]:
brfss_df_diabet_selected.head()

Unnamed: 0,GENHLTH,_AGEG5YR,_BMI5CAT,CHECKUP1,INCOME2,EMPLOY1,SEX,MARITAL,_EDUCAG,CVDCRHD4,HLTHCVR1,MENTHLTH,CHCKIDNY,USEEQUIP,_TOTINDA,ADDEPEV2,RENTHOM1,EXERANY2,BLIND,DECIDE,HLTHPLN1,DIABETE3,_SMOKER3
0,Very good,Age 30 to 34,Normal Weight,Within past 2 years (1 year but less than 2 ye...,"$75,000 or more",Employed for wages,Male,Never married,Graduated from College or Technical School,No,Not asked or Missing,,No,No,Had physical activity or exercise,No,Rent,Yes,No,No,Yes,No,Never smoked
1,Good,Age 45 to 49,Obese,Within past year (anytime less than 12 months ...,Don't know/Not sure,Employed for wages,Female,Separated,Attended College or Technical School,No,Not asked or Missing,Number of days,No,No,Had physical activity or exercise,No,Other arrangement,Yes,No,No,Yes,No,Never smoked
2,Very good,Age 30 to 34,Overweight,Within past year (anytime less than 12 months ...,Don't know/Not sure,Employed for wages,Female,Never married,Graduated from College or Technical School,No,Not asked or Missing,,No,No,Had physical activity or exercise,No,Rent,Yes,No,No,Yes,No,Never smoked
3,Excellent,Age 18 to 24,Overweight,Within past year (anytime less than 12 months ...,Don't know/Not sure,Employed for wages,Male,Never married,Graduated High School,No,Not asked or Missing,,No,No,Had physical activity or exercise,No,Rent,Yes,No,No,No,No,Never smoked
4,Excellent,Age 25 to 29,Normal Weight,Within past year (anytime less than 12 months ...,"Less than $50,000 ($35,000 to less than $50,000)",Employed for wages,Female,Never married,Graduated from College or Technical School,No,Not asked or Missing,Number of days,No,No,Had physical activity or exercise,Yes,Rent,Yes,No,No,Yes,No,Never smoked


In [12]:
brfss_df_diabet_selected.to_csv('export_report_v2.csv')

# STOP HERE

In [13]:
#Drop Missing Values - knocks 100,000 rows out right away
brfss_df_selected = brfss_df_selected.dropna()
brfss_df_selected.shape

(12338, 23)

In [14]:
brfss_df_selected.head()

Unnamed: 0,_MICHD,_RFHYPE5,TOLDHI2,_CHOLCHK,_BMI5,_BMI5CAT,SMOKE100,CVDSTRK3,DIABETE3,_TOTINDA,_FRTLT1,_VEGLT1,_RFDRHV5,HLTHPLN1,MEDCOST,GENHLTH,MENTHLTH,PHYSHLTH,DIFFWALK,SEX,_AGEG5YR,EDUCA,INCOME2
0,Did not report having MI or CHD,Yes,No,Had cholesterol checked in past 5 years,1 or greater,Normal Weight,No,No,No,Had physical activity or exercise,Consumed fruit less than one time per day,Consumed vegetables one or more times per day,1,Yes,No,Very good,,,No,Male,Age 30 to 34,College 4 years or more (College graduate),"$75,000 or more"
1,Not asked or Missing,Yes,No,Had cholesterol checked in past 5 years,1 or greater,Obese,No,Yes,No,Had physical activity or exercise,Consumed fruit less than one time per day,Consumed vegetables one or more times per day,1,Yes,Yes,Good,Number of days,,No,Female,Age 45 to 49,College 1 year to 3 years (Some college or tec...,Don't know/Not sure
2,Did not report having MI or CHD,No,Yes,Had cholesterol checked in past 5 years,1 or greater,Overweight,No,No,No,Had physical activity or exercise,Consumed fruit less than one time per day,Consumed vegetables one or more times per day,1,Yes,No,Very good,,,No,Female,Age 30 to 34,College 4 years or more (College graduate),Don't know/Not sure
3,Did not report having MI or CHD,No,Not asked or Missing,Don't know/Not Sure Or Refused/Missing,1 or greater,Overweight,No,No,No,Had physical activity or exercise,Consumed fruit less than one time per day,"Don´t know, refused or missing values",1,No,No,Excellent,,,No,Male,Age 18 to 24,Grade 12 or GED (High school graduate),Don't know/Not sure
4,Did not report having MI or CHD,No,Not asked or Missing,Have never had cholesterol checked,1 or greater,Normal Weight,No,No,No,Had physical activity or exercise,Consumed fruit one or more times per day,Consumed vegetables one or more times per day,1,Yes,No,Excellent,Number of days,Number of days,No,Female,Age 25 to 29,College 4 years or more (College graduate),"Less than $50,000 ($35,000 to less than $50,000)"


####Modifying Values

In [15]:
# _MICHD
#Change 2 to 0 because this means did not have MI or CHD
brfss_df_selected['_MICHD'] = brfss_df_selected['_MICHD'].replace(['Did not report having MI or CHD','Reported having MI or CHD'], [0,1])
brfss_df_selected = brfss_df_selected[brfss_df_selected._MICHD != 'Not asked or Missing']
brfss_df_selected._MICHD.unique()

array([0, 1], dtype=object)

In [16]:
#1 _RFHYPE5
#Change 1 to 0 so it represetnts No high blood pressure and 2 to 1 so it represents high blood pressure
brfss_df_selected['_RFHYPE5'] = brfss_df_selected['_RFHYPE5'].replace({'No':0, 'Yes':1})
brfss_df_selected = brfss_df_selected[brfss_df_selected._RFHYPE5 != 9]
brfss_df_selected._RFHYPE5.unique()

array([1, 0, "Don't know/Not Sure/Refused/Missing"], dtype=object)

In [17]:
#2 TOLDHI2
# Change 2 to 0 because it is No
# Remove all 7 (dont knows)
# Remove all 9 (refused)
brfss_df_selected['TOLDHI2'] = brfss_df_selected['TOLDHI2'].replace(["No","Yes"],[0,1])
brfss_df_selected = brfss_df_selected[brfss_df_selected.TOLDHI2 != "Don't know/Not Sure"]
brfss_df_selected = brfss_df_selected[brfss_df_selected.TOLDHI2 != 'Not asked or Missing']
brfss_df_selected = brfss_df_selected[brfss_df_selected.TOLDHI2 != 'Refused']
brfss_df_selected.TOLDHI2.unique()

array([0, 1], dtype=object)

In [18]:
#3 _CHOLCHK
# Change 3 to 0 and 2 to 0 for Not checked cholesterol in past 5 years
# Remove 9
brfss_df_selected['_CHOLCHK'] = brfss_df_selected['_CHOLCHK'].replace(["Did not have cholesterol checked in past 5 years","Have never had cholesterol checked","Had cholesterol checked in past 5 years"],[0,0,1])
brfss_df_selected = brfss_df_selected[brfss_df_selected._CHOLCHK != "Don’t know/Not Sure Or Refused/Missing"]
brfss_df_selected._CHOLCHK.unique()

array([1, 0, "Don't know/Not Sure Or Refused/Missing"], dtype=object)

In [19]:
#4 _BMI5 (no changes, just note that these are BMI * 100. So for example a BMI of 4018 is really 40.18)
brfss_df_selected['_BMI5CAT'] = brfss_df_selected['_BMI5CAT'].replace(["Underweight","Normal Weight","Overweight","Obese"],[1,2,3])
brfss_df_selected = brfss_df_selected[brfss_df_selected._BMI5CAT != "Don’t know/Refused/Missing"]
#brfss_df_selected['_BMI5CAT'] = brfss_df_selected['_BMI5CAT'].div(100).round(0)
#brfss_df_selected._BMI5CAT.unique()

ValueError: Replacement lists must match in length. Expecting 4 got 3 

In [None]:
#### Burdasin

#5 SMOKE100
# Change 2 to 0 because it is No
# Remove all 7 (dont knows)
# Remove all 9 (refused)
brfss_df_selected['SMOKE100'] = brfss_df_selected['SMOKE100'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.SMOKE100 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.SMOKE100 != 9]
brfss_df_selected.SMOKE100.unique()

In [None]:
#6 CVDSTRK3
# Change 2 to 0 because it is No
# Remove all 7 (dont knows)
# Remove all 9 (refused)
brfss_df_selected['CVDSTRK3'] = brfss_df_selected['CVDSTRK3'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.CVDSTRK3 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.CVDSTRK3 != 9]
brfss_df_selected.CVDSTRK3.unique()

In [None]:
#7 DIABETE3
# going to make this ordinal. 0 is for no diabetes or only during pregnancy, 1 is for pre-diabetes or borderline diabetes, 2 is for yes diabetes
# Remove all 7 (dont knows)
# Remove all 9 (refused)
brfss_df_selected['DIABETE3'] = brfss_df_selected['DIABETE3'].replace({2:0, 3:0, 1:2, 4:1})
brfss_df_selected = brfss_df_selected[brfss_df_selected.DIABETE3 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.DIABETE3 != 9]
brfss_df_selected.DIABETE3.unique()

In [None]:
#8 _TOTINDA
# 1 for physical activity
# change 2 to 0 for no physical activity
# Remove all 9 (don't know/refused)
brfss_df_selected['_TOTINDA'] = brfss_df_selected['_TOTINDA'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected._TOTINDA != 9]
brfss_df_selected._TOTINDA.unique()

In [None]:
#9 _FRTLT1
# Change 2 to 0. this means no fruit consumed per day. 1 will mean consumed 1 or more pieces of fruit per day 
# remove all dont knows and missing 9
brfss_df_selected['_FRTLT1'] = brfss_df_selected['_FRTLT1'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected._FRTLT1 != 9]
brfss_df_selected._FRTLT1.unique()

In [None]:
#10 _VEGLT1
# Change 2 to 0. this means no vegetables consumed per day. 1 will mean consumed 1 or more pieces of vegetable per day 
# remove all dont knows and missing 9
brfss_df_selected['_VEGLT1'] = brfss_df_selected['_VEGLT1'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected._VEGLT1 != 9]
brfss_df_selected._VEGLT1.unique()

In [None]:
#11 _RFDRHV5
# Change 1 to 0 (1 was no for heavy drinking). change all 2 to 1 (2 was yes for heavy drinking)
# remove all dont knows and missing 9
brfss_df_selected['_RFDRHV5'] = brfss_df_selected['_RFDRHV5'].replace({1:0, 2:1})
brfss_df_selected = brfss_df_selected[brfss_df_selected._RFDRHV5 != 9]
brfss_df_selected._RFDRHV5.unique()

In [None]:
#12 HLTHPLN1
# 1 is yes, change 2 to 0 because it is No health care access
# remove 7 and 9 for don't know or refused
brfss_df_selected['HLTHPLN1'] = brfss_df_selected['HLTHPLN1'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.HLTHPLN1 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.HLTHPLN1 != 9]
brfss_df_selected.HLTHPLN1.unique()

In [None]:
#13 MEDCOST
# Change 2 to 0 for no, 1 is already yes
# remove 7 for don/t know and 9 for refused
brfss_df_selected['MEDCOST'] = brfss_df_selected['MEDCOST'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.MEDCOST != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.MEDCOST != 9]
brfss_df_selected.MEDCOST.unique()

In [None]:
#14 GENHLTH
# This is an ordinal variable that I want to keep (1 is Excellent -> 5 is Poor)
# Remove 7 and 9 for don't know and refused
brfss_df_selected = brfss_df_selected[brfss_df_selected.GENHLTH != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.GENHLTH != 9]
brfss_df_selected.GENHLTH.unique()

In [None]:
#15 MENTHLTH
# already in days so keep that, scale will be 0-30
# change 88 to 0 because it means none (no bad mental health days)
# remove 77 and 99 for don't know not sure and refused
brfss_df_selected['MENTHLTH'] = brfss_df_selected['MENTHLTH'].replace({88:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.MENTHLTH != 77]
brfss_df_selected = brfss_df_selected[brfss_df_selected.MENTHLTH != 99]
brfss_df_selected.MENTHLTH.unique()

In [None]:
#16 PHYSHLTH
# already in days so keep that, scale will be 0-30
# change 88 to 0 because it means none (no bad mental health days)
# remove 77 and 99 for don't know not sure and refused
brfss_df_selected['PHYSHLTH'] = brfss_df_selected['PHYSHLTH'].replace({88:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.PHYSHLTH != 77]
brfss_df_selected = brfss_df_selected[brfss_df_selected.PHYSHLTH != 99]
brfss_df_selected.PHYSHLTH.unique()

In [None]:
#17 DIFFWALK
# change 2 to 0 for no. 1 is already yes
# remove 7 and 9 for don't know not sure and refused
brfss_df_selected['DIFFWALK'] = brfss_df_selected['DIFFWALK'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.DIFFWALK != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.DIFFWALK != 9]
brfss_df_selected.DIFFWALK.unique()

In [None]:
#18 SEX
# in other words - is respondent male (somewhat arbitrarily chose this change because men are at higher risk for heart disease)
# change 2 to 0 (female as 0). Male is 1
brfss_df_selected['SEX'] = brfss_df_selected['SEX'].replace({2:0})
brfss_df_selected.SEX.unique()

In [None]:
#19 _AGEG5YR
# already ordinal. 1 is 18-24 all the way up to 13 wis 80 and older. 5 year increments.
# remove 14 because it is don't know or missing
brfss_df_selected = brfss_df_selected[brfss_df_selected._AGEG5YR != 14]
brfss_df_selected._AGEG5YR.unique()

In [None]:
#20 EDUCA
# This is already an ordinal variable with 1 being never attended school or kindergarten only up to 6 being college 4 years or more
# Scale here is 1-6
# Remove 9 for refused:
brfss_df_selected = brfss_df_selected[brfss_df_selected.EDUCA != 9]
brfss_df_selected.EDUCA.unique()

In [None]:
#21 INCOME2
# Variable is already ordinal with 1 being less than $10,000 all the way up to 8 being $75,000 or more
# Remove 77 and 99 for don't know and refused
brfss_df_selected = brfss_df_selected[brfss_df_selected.INCOME2 != 77]
brfss_df_selected = brfss_df_selected[brfss_df_selected.INCOME2 != 99]
brfss_df_selected.INCOME2.unique()

In [None]:
#Check the shape of the dataset now: We have 253,680 cleaned rows and 22 columns (1 of which is our dependent variable)
brfss_df_selected.shape

In [None]:
#Let's see what the data looks like after Modifying Values
brfss_df_selected.head(10)

In [None]:
 #Check Class Sizes
 brfss_df_selected.groupby(['_MICHD']).size()

####Make Feature Names More Readable

In [None]:
 #Rename the columns to make them more readable
 brfss = brfss_df_selected.rename(columns = {'_MICHD':'HeartDiseaseorAttack', 
                                         '_RFHYPE5':'HighBP',  
                                         'TOLDHI2':'HighChol', '_CHOLCHK':'CholCheck', 
                                         '_BMI5':'BMI', 
                                         'SMOKE100':'Smoker', 
                                         'CVDSTRK3':'Stroke', 'DIABETE3':'Diabetes', 
                                         '_TOTINDA':'PhysActivity', 
                                         '_FRTLT1':'Fruits', '_VEGLT1':"Veggies", 
                                         '_RFDRHV5':'HvyAlcoholConsump', 
                                         'HLTHPLN1':'AnyHealthcare', 'MEDCOST':'NoDocbcCost', 
                                         'GENHLTH':'GenHlth', 'MENTHLTH':'MentHlth', 'PHYSHLTH':'PhysHlth', 'DIFFWALK':'DiffWalk', 
                                         'SEX':'Sex', '_AGEG5YR':'Age', 'EDUCA':'Education', 'INCOME2':'Income' })

In [None]:
#See the cleaned dataset with 
brfss.head(10)

In [None]:
#Double check shape of the dataset (rows and columns)
brfss.shape

In [None]:
 #Check how many respondents have had heart disease or a heart attack. Note the class imbalance!
 brfss.groupby(['HeartDiseaseorAttack']).size()

#### Save Finalized Dataset to CSV

In [None]:
#************************************************************************************************
brfss.to_csv('brfss2015_cleaned.csv', sep=",", index=False)
#************************************************************************************************

#### Get a BALANCED 50-50 Dataset Randomly Selected
*  The brfss dataset is clearly imbalanced. When training my models, I get about 90% accuracy on many models with AUC between 70 and 80. This may be caused by the models are learning the distribution in the data. 
*  To check these concerns, I will create a second dataset with a 50-50 balance for the HeartDiseaseorAttack response variable - just to compare performance. 
*  To do this, I will take a random sample of 23,893 instances of the 0 (or No heart Disease / Attack) and all of the 23,893 instances of the 1 (or Yes Heart Disease / Attack).
* The if the new dataset performs comparably, then I can rest assured that it
* With roughly 48,000 datapoints, I hope that this is sufficient to train the model and that the random selection will not greatly change the results. I have the random seed set to 1.

In [None]:
#Separate the 0 and 1

#Get the 1s
is1 = brfss['HeartDiseaseorAttack'] == 1
brfss_5050_1 = brfss[is1]

#Get the 0s
is0 = brfss['HeartDiseaseorAttack'] == 0
brfss_5050_0 = brfss[is0] 

#Select the 23893 random cases for 0
brfss_5050_0_rand1 = brfss_5050_0.take(np.random.permutation(len(brfss_5050_0))[:23893])

In [None]:
#Append the 23893 1s to the 23893 randomly selected 0s
brfss_5050 = brfss_5050_0_rand1.append(brfss_5050_1, ignore_index = True)

In [None]:
#Check that it worked. Now we have a dataset of 47,786 rows that is equally balanced with 50% 1 and 50% 0 for the target variable HeartDiseaseorAttack
brfss_5050

In [None]:
#See the classes are perfectly balanced now
brfss_5050.groupby(['HeartDiseaseorAttack']).size()

In [None]:
#Save the 50-50 balanced dataset to csv

#************************************************************************************************
brfss_5050.to_csv('brfss2015_5050_cleaned.csv', sep=",", index=False)
#************************************************************************************************

#### Also Get a 60-40 Dataset Randomly Selected

In [None]:
#Also make a 60-40 dataset
brfss_6040_0_rand1 = brfss_5050_0.take(np.random.permutation(len(brfss_5050_0))[:47786])
brfss_6040 = brfss_6040_0_rand1.append(brfss_5050_1, ignore_index = True)
#Save the 6040 balanced dataset to csv
#************************************************************************************************
brfss_6040.to_csv('brfss2015_6040_cleaned.csv', sep=",", index=False)
#************************************************************************************************
brfss_6040

#Part 2: Model Building

## Random Forests

### Random Forest - w/ Feature Selection - Full Dataset

* 10 trees & 50 trees Tested (n_estimator changes)
* RF 10 trees - 5-fold cv - with Feature Selection 
: 0.89 (+/- 0.00)  |   AUC: 0.71 (+/- 0.01)  |   Runtime: 9.93 seconds
* RF 50 trees - 5-fold cv - with Feature Selection 
 ACC: 0.89 (+/- 0.00)  |   AUC: 0.74 (+/- 0.01)  |   Runtime: 48.17 seconds
* RF 50 trees - 10-fold cv - with Feature Selection 
 ACC: 0.89 (+/- 0.00)  |   AUC: 0.74 (+/- 0.01)  |   Runtime: 103.57 seconds
* RF Selected Features: ['BMI', 'GenHlth', 'MentHlth', 'PhysHlth', 'Age', 'Education', 'Income']


In [None]:
#SciKit DSC540 HW1
'''created by Casey Bennett 2018, www.CaseyBennett.com'''

import sys
import csv
import math
from operator import itemgetter
import time

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.externals import joblib
from sklearn.feature_selection import RFE, VarianceThreshold, SelectFromModel
from sklearn.feature_selection import SelectKBest, mutual_info_regression, mutual_info_classif, chi2
from sklearn import metrics
from sklearn.model_selection import cross_validate, train_test_split
from sklearn.preprocessing import KBinsDiscretizer, scale


#Handle annoying warnings
import warnings, sklearn.exceptions
warnings.filterwarnings("ignore", category=sklearn.exceptions.ConvergenceWarning)


#############################################################################
#
# Global parameters
#
#####################

target_idx=0                                        #Index of Target variable
cross_val=1                                         #Control Switch for CV                                                                                      
norm_target=0                                       #Normalize target switch
norm_features=0                                     #Normalize target switch
binning=0                                           #Control Switch for Bin Target
bin_cnt=2                                           #If bin target, this sets number of classes
feat_select=1                                       #Control Switch for Feature Selection
fs_type=4                                           #Feature Selection type (1=Stepwise Backwards Removal, 2=Wrapper Select, 3=Univariate Selection)
lv_filter=0                                         #Control switch for low variance filter on features
feat_start=1                                        #Start column of features

#Set global model parameters
rand_st=1                                           #Set Random State variable for randomizing splits on runs


#############################################################################
#
# Load Data
#
#####################

file1= csv.reader(open('brfss2015_cleaned.csv'), delimiter=',', quotechar='"')

#Read Header Line
header=next(file1)            

#Read data
data=[]
target=[]
for row in file1:
    #Load Target
    if row[target_idx]=='':                         #If target is blank, skip row                       
        continue
    else:
        target.append(float(row[target_idx]))       #If pre-binned class, change float to int

    #Load row into temp array, cast columns  
    temp=[]
                 
    for j in range(feat_start,len(header)):
        if row[j]=='':
            temp.append(float())
        else:
            temp.append(float(row[j]))

    #Load temp into Data array
    data.append(temp)
  
#Test Print
print(header)
print(len(target),len(data))
print('\n')

data_np=np.asarray(data)
target_np=np.asarray(target)


#############################################################################
#
# Preprocess data
#
##########################################




#############################################################################
#
# Feature Selection
#
##########################################

#Low Variance Filter
if lv_filter==1:
    print('--LOW VARIANCE FILTER ON--', '\n')
    
    #LV Threshold
    sel = VarianceThreshold(threshold=0.5)                                          #Removes any feature with less than 20% variance
    fit_mod=sel.fit(data_np)
    fitted=sel.transform(data_np)
    sel_idx=fit_mod.get_support()

    #Get lists of selected and non-selected features (names and indexes)
    temp=[]
    temp_idx=[]
    temp_del=[]
    for i in range(len(data_np[0])):
        if sel_idx[i]==1:                                                           #Selected Features get added to temp header
            temp.append(header[i+feat_start])
            temp_idx.append(i)
        else:                                                                       #Indexes of non-selected features get added to delete array
            temp_del.append(i)

    print('Selected:', temp)
    print('Features (total, selected):', len(data_np[0]), len(temp))
    print('\n')

    #Filter selected columns from original dataset
    header = header[0:feat_start]
    for field in temp:
        header.append(field)
    data_np = np.delete(data_np, temp_del, axis=1)                                 #Deletes non-selected features by index


#Feature Selection
if feat_select==1:
    '''Three steps:
       1) Run Feature Selection
       2) Get lists of selected and non-selected features
       3) Filter columns from original dataset
       '''
    
    print('--FEATURE SELECTION ON--', '\n')
    
    ##1) Run Feature Selection #######
    #Wrapper Select via model
    if fs_type==2:
        clf = RandomForestClassifier( n_estimators=50, max_depth=None, min_samples_split=3,criterion='entropy', random_state=rand_st)                
        sel = SelectFromModel(clf, prefit=False, threshold='mean', max_features=None)                   
        print ('Wrapper Select: ')

        fit_mod=sel.fit(data_np, target_np)    
        sel_idx=fit_mod.get_support()

    if fs_type==4:
        clf= RandomForestClassifier( n_estimators=50, max_depth=None, min_samples_split=3,criterion='entropy', random_state=rand_st)
        clf.fit(data_np,target_np)
        sel_idx = []
        print('clf.feature_importances_ = ', clf.feature_importances_)
        for x in clf.feature_importances_:
          if x >= np.mean(clf.feature_importances_):
            sel_idx.append(1)
          else:
            sel_idx.append(0)

    ##2) Get lists of selected and non-selected features (names and indexes) #######
    temp=[]
    temp_idx=[]
    temp_del=[]
    for i in range(len(data_np[0])):
        if sel_idx[i]==1:                                                           #Selected Features get added to temp header
            temp.append(header[i+feat_start])
            temp_idx.append(i)
        else:                                                                       #Indexes of non-selected features get added to delete array
            temp_del.append(i)
    print('Selected:', temp)
    print('Features (total/selected):', len(data_np[0]), len(temp))
    print('\n')
            
               
    ##3) Filter selected columns from original dataset #########
    header = header[0:feat_start]
    for field in temp:
        header.append(field)
    data_np = np.delete(data_np, temp_del, axis=1)                                 #Deletes non-selected features by index
    

#############################################################################
#
# Train SciKit Models
#
##########################################

print('--ML Model Output--', '\n')

#Test/Train split
data_train, data_test, target_train, target_test = train_test_split(data_np, target_np, test_size=0.35)

####Classifiers####
if cross_val==0:    
    #SciKit Random Forest
    clf = RandomForestClassifier( n_estimators=50, max_depth=None, min_samples_split=3,criterion='entropy', random_state=rand_st)  
    clf.fit(data_train,target_train)

    scores_ACC = clf.score(data_test, target_test)                                                                                                                          
    print('Random Forest Acc:', scores_ACC)
    scores_AUC = metrics.roc_auc_score(target_test, clf.predict_proba(data_test)[:,1])                                                                                      
    print('Random Forest AUC:', scores_AUC)                                                                     #AUC only works with binary classes, not multiclass            
 
####Cross-Val Classifiers####
if cross_val==1:
    #Setup Crossval classifier scorers
    scorers = {'Accuracy': 'accuracy', 'roc_auc': 'roc_auc'}                                                                                                                
    
    #SciKit Random Forest - Cross Val
    start_ts=time.time()
    clf = RandomForestClassifier( n_estimators=50, max_depth=None, min_samples_split=3,criterion='entropy', random_state=rand_st)   
    scores = cross_validate(clf, data_np, target_np, scoring=scorers, cv=5)                                                                                                 

    scores_Acc = scores['test_Accuracy']                                                                                                                                    
    print("Random Forest Acc: %0.2f (+/- %0.2f)" % (scores_Acc.mean(), scores_Acc.std() * 2))                                                                                                    
    scores_AUC= scores['test_roc_auc']                                                                     #Only works with binary classes, not multiclass                  
    print("Random Forest AUC: %0.2f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))                           
    print("CV Runtime:", time.time()-start_ts)

###Random Forest - w/o Feature Selection - Full Dataset


* 10 trees & 50 trees Tested (n_estimator changes)
* RF 10 trees - 5-fold cv ACC: 0.89 (+/- 0.00)  |   AUC: 0.71 (+/- 0.01)  |   Runtime: 9.93 seconds
* RF 50 trees - 5-fold cv ACC: 0.90 (+/- 0.00)  |   AUC: 0.82 (+/- 0.01)  |   Runtime: 66.19 seconds

In [None]:
#SciKit DSC540 HW1
'''created by Casey Bennett 2018, www.CaseyBennett.com'''

import sys
import csv
import math
import numpy as np
from operator import itemgetter
import time

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.externals import joblib
from sklearn.feature_selection import RFE, VarianceThreshold, SelectFromModel
from sklearn.feature_selection import SelectKBest, mutual_info_regression, mutual_info_classif, chi2
from sklearn import metrics
from sklearn.model_selection import cross_validate, train_test_split
from sklearn.preprocessing import KBinsDiscretizer, scale


#Handle annoying warnings
import warnings, sklearn.exceptions
warnings.filterwarnings("ignore", category=sklearn.exceptions.ConvergenceWarning)


#############################################################################
#
# Global parameters
#
#####################

target_idx=0                                        #Index of Target variable
cross_val=1                                         #Control Switch for CV                                                                                      
norm_target=0                                       #Normalize target switch
norm_features=0                                     #Normalize target switch
binning=0                                           #Control Switch for Bin Target
bin_cnt=2                                           #If bin target, this sets number of classes
feat_select=0                                       #Control Switch for Feature Selection
fs_type=4                                           #Feature Selection type (1=Stepwise Backwards Removal, 2=Wrapper Select, 3=Univariate Selection)
lv_filter=0                                         #Control switch for low variance filter on features
feat_start=1                                        #Start column of features

#Set global model parameters
rand_st=1                                           #Set Random State variable for randomizing splits on runs


#############################################################################
#
# Load Data
#
#####################

file1= csv.reader(open('brfss2015_cleaned.csv'), delimiter=',', quotechar='"')

#Read Header Line
header=next(file1)            

#Read data
data=[]
target=[]
for row in file1:
    #Load Target
    if row[target_idx]=='':                         #If target is blank, skip row                       
        continue
    else:
        target.append(float(row[target_idx]))       #If pre-binned class, change float to int

    #Load row into temp array, cast columns  
    temp=[]
                 
    for j in range(feat_start,len(header)):
        if row[j]=='':
            temp.append(float())
        else:
            temp.append(float(row[j]))

    #Load temp into Data array
    data.append(temp)
  
#Test Print
print(header)
print(len(target),len(data))
print('\n')

data_np=np.asarray(data)
target_np=np.asarray(target)


#############################################################################
#
# Preprocess data
#
##########################################




#############################################################################
#
# Feature Selection
#
##########################################

#Low Variance Filter
if lv_filter==1:
    print('--LOW VARIANCE FILTER ON--', '\n')
    
    #LV Threshold
    sel = VarianceThreshold(threshold=0.5)                                          #Removes any feature with less than 20% variance
    fit_mod=sel.fit(data_np)
    fitted=sel.transform(data_np)
    sel_idx=fit_mod.get_support()

    #Get lists of selected and non-selected features (names and indexes)
    temp=[]
    temp_idx=[]
    temp_del=[]
    for i in range(len(data_np[0])):
        if sel_idx[i]==1:                                                           #Selected Features get added to temp header
            temp.append(header[i+feat_start])
            temp_idx.append(i)
        else:                                                                       #Indexes of non-selected features get added to delete array
            temp_del.append(i)

    print('Selected:', temp)
    print('Features (total, selected):', len(data_np[0]), len(temp))
    print('\n')

    #Filter selected columns from original dataset
    header = header[0:feat_start]
    for field in temp:
        header.append(field)
    data_np = np.delete(data_np, temp_del, axis=1)                                 #Deletes non-selected features by index


#Feature Selection
if feat_select==1:
    '''Three steps:
       1) Run Feature Selection
       2) Get lists of selected and non-selected features
       3) Filter columns from original dataset
       '''
    
    print('--FEATURE SELECTION ON--', '\n')
    
    ##1) Run Feature Selection #######
    #Wrapper Select via model
    if fs_type==2:
        clf = RandomForestClassifier( n_estimators=100, max_depth=None, min_samples_split=3,criterion='entropy', random_state=rand_st)                
        sel = SelectFromModel(clf, prefit=False, threshold='mean', max_features=None)                   
        print ('Wrapper Select: ')

        fit_mod=sel.fit(data_np, target_np)    
        sel_idx=fit_mod.get_support()

    if fs_type==4:
        clf= RandomForestClassifier( n_estimators=10, max_depth=None, min_samples_split=3,criterion='entropy', random_state=rand_st)
        clf.fit(data_np,target_np)
        sel_idx = []
        print('clf.feature_importances_ = ', clf.feature_importances_)
        for x in clf.feature_importances_:
          if x >= np.mean(clf.feature_importances_):
            sel_idx.append(1)
          else:
            sel_idx.append(0)

    ##2) Get lists of selected and non-selected features (names and indexes) #######
    temp=[]
    temp_idx=[]
    temp_del=[]
    for i in range(len(data_np[0])):
        if sel_idx[i]==1:                                                           #Selected Features get added to temp header
            temp.append(header[i+feat_start])
            temp_idx.append(i)
        else:                                                                       #Indexes of non-selected features get added to delete array
            temp_del.append(i)
    print('Selected:', temp)
    print('Features (total/selected):', len(data_np[0]), len(temp))
    print('\n')
            
               
    ##3) Filter selected columns from original dataset #########
    header = header[0:feat_start]
    for field in temp:
        header.append(field)
    data_np = np.delete(data_np, temp_del, axis=1)                                 #Deletes non-selected features by index
    

#############################################################################
#
# Train SciKit Models
#
##########################################

print('--ML Model Output--', '\n')

#Test/Train split
data_train, data_test, target_train, target_test = train_test_split(data_np, target_np, test_size=0.35)

####Classifiers####
if cross_val==0:    
    #SciKit Random Forest
    clf = RandomForestClassifier( n_estimators=10, max_depth=None, min_samples_split=3,criterion='entropy', random_state=rand_st)  
    clf.fit(data_train,target_train)

    scores_ACC = clf.score(data_test, target_test)                                                                                                                          
    print('Random Forest Acc:', scores_ACC)
    scores_AUC = metrics.roc_auc_score(target_test, clf.predict_proba(data_test)[:,1])                                                                                      
    print('Random Forest AUC:', scores_AUC)                                                                     #AUC only works with binary classes, not multiclass            
 
####Cross-Val Classifiers####
if cross_val==1:
    #Setup Crossval classifier scorers
    scorers = {'Accuracy': 'accuracy', 'roc_auc': 'roc_auc'}                                                                                                                
    
    #SciKit Random Forest - Cross Val
    start_ts=time.time()
    clf = RandomForestClassifier( n_estimators=100, max_depth=None, min_samples_split=3,criterion='entropy', random_state=rand_st)   
    scores = cross_validate(clf, data_np, target_np, scoring=scorers, cv=5)                                                                                                 

    scores_Acc = scores['test_Accuracy']                                                                                                                                    
    print("Random Forest Acc: %0.2f (+/- %0.2f)" % (scores_Acc.mean(), scores_Acc.std() * 2))                                                                                                    
    scores_AUC= scores['test_roc_auc']                                                                     #Only works with binary classes, not multiclass                  
    print("Random Forest AUC: %0.2f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))                           
    print("CV Runtime:", time.time()-start_ts)

###Random Forest - w/ and w/o Feature Selection - 50-50 Balanced Dataset

* 10 trees & 50 trees Tested (n_estimator changes)
* RF 10 trees - 5-fold cv ACC: 0.75 (+/- 0.01) |   AUC: 0.82 (+/- 0.01)  |
Runtime: 2.51 seconds
* RF 50 trees - 5-fold cv ACC: 0.76 (+/- 0.02)  |   AUC: 0.83 (+/- 0.01)  |   Runtime: 12.40 seconds
* RF 50 trees w/feat_select - 5-fold cv ACC: 0.72 (+/- 0.01)  |   AUC: 0.78 (+/- 0.01)  |   Runtime: 10.62 seconds
* Selected Features: ['HighBP', 'BMI', 'GenHlth', 'MentHlth', 'PhysHlth', 'Age', 'Education', 'Income']

Notes:
* clf features: ['HighBP', 'HighChol', 'CholCheck', 'BMI', 'Smoker', 'Stroke', 'Diabetes', 'PhysActivity', 'Fruits', 'Veggies', 'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'GenHlth', 'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age', 'Education', 'Income']
* clf.feature_importances_ =  [0.051, 0.038, 0.005, 0.146, 0.024, 0.025, 0.032, 0.022, 0.026, 0.022, 0.009, 0.007, 0.012, 0.093, 0.053, 0.072, 0.030, 0.029, 0.153, 0.056, 0.084]
* Age, BMI , GenHlth, Income, PhysHlth, Education, MentHlth, and HighBP all seem to play an important role. Though Age and BMI are the most important.



In [None]:
#SciKit DSC540 HW1
'''created by Casey Bennett 2018, www.CaseyBennett.com'''

import sys
import csv
import math
import numpy as np
from operator import itemgetter
import time

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.externals import joblib
from sklearn.feature_selection import RFE, VarianceThreshold, SelectFromModel
from sklearn.feature_selection import SelectKBest, mutual_info_regression, mutual_info_classif, chi2
from sklearn import metrics
from sklearn.model_selection import cross_validate, train_test_split
from sklearn.preprocessing import KBinsDiscretizer, scale


#Handle annoying warnings
import warnings, sklearn.exceptions
warnings.filterwarnings("ignore", category=sklearn.exceptions.ConvergenceWarning)


#############################################################################
#
# Global parameters
#
#####################

target_idx=0                                        #Index of Target variable
cross_val=1                                         #Control Switch for CV                                                                                      
norm_target=0                                       #Normalize target switch
norm_features=0                                     #Normalize target switch
binning=0                                           #Control Switch for Bin Target
bin_cnt=2                                           #If bin target, this sets number of classes
feat_select=1                                       #Control Switch for Feature Selection
fs_type=4                                           #Feature Selection type (1=Stepwise Backwards Removal, 2=Wrapper Select, 3=Univariate Selection)
lv_filter=0                                         #Control switch for low variance filter on features
feat_start=1                                        #Start column of features

#Set global model parameters
rand_st=1                                           #Set Random State variable for randomizing splits on runs


#############################################################################
#
# Load Data
#
#####################

file1= csv.reader(open('brfss2015_5050_cleaned.csv'), delimiter=',', quotechar='"')

#Read Header Line
header=next(file1)            

#Read data
data=[]
target=[]
for row in file1:
    #Load Target
    if row[target_idx]=='':                         #If target is blank, skip row                       
        continue
    else:
        target.append(float(row[target_idx]))       #If pre-binned class, change float to int

    #Load row into temp array, cast columns  
    temp=[]
                 
    for j in range(feat_start,len(header)):
        if row[j]=='':
            temp.append(float())
        else:
            temp.append(float(row[j]))

    #Load temp into Data array
    data.append(temp)
  
#Test Print
print(header)
print(len(target),len(data))
print('\n')

data_np=np.asarray(data)
target_np=np.asarray(target)


#############################################################################
#
# Preprocess data
#
##########################################




#############################################################################
#
# Feature Selection
#
##########################################

#Low Variance Filter
if lv_filter==1:
    print('--LOW VARIANCE FILTER ON--', '\n')
    
    #LV Threshold
    sel = VarianceThreshold(threshold=0.5)                                          #Removes any feature with less than 20% variance
    fit_mod=sel.fit(data_np)
    fitted=sel.transform(data_np)
    sel_idx=fit_mod.get_support()

    #Get lists of selected and non-selected features (names and indexes)
    temp=[]
    temp_idx=[]
    temp_del=[]
    for i in range(len(data_np[0])):
        if sel_idx[i]==1:                                                           #Selected Features get added to temp header
            temp.append(header[i+feat_start])
            temp_idx.append(i)
        else:                                                                       #Indexes of non-selected features get added to delete array
            temp_del.append(i)

    print('Selected:', temp)
    print('Features (total, selected):', len(data_np[0]), len(temp))
    print('\n')

    #Filter selected columns from original dataset
    header = header[0:feat_start]
    for field in temp:
        header.append(field)
    data_np = np.delete(data_np, temp_del, axis=1)                                 #Deletes non-selected features by index


#Feature Selection
if feat_select==1:
    '''Three steps:
       1) Run Feature Selection
       2) Get lists of selected and non-selected features
       3) Filter columns from original dataset
       '''
    
    print('--FEATURE SELECTION ON--', '\n')
    
    ##1) Run Feature Selection #######
    #Wrapper Select via model
    if fs_type==2:
        clf = RandomForestClassifier( n_estimators=50, max_depth=None, min_samples_split=3,criterion='entropy', random_state=rand_st)                
        sel = SelectFromModel(clf, prefit=False, threshold='mean', max_features=None)                   
        print ('Wrapper Select: ')

        fit_mod=sel.fit(data_np, target_np)    
        sel_idx=fit_mod.get_support()

    if fs_type==4:
        clf= RandomForestClassifier( n_estimators=50, max_depth=None, min_samples_split=3,criterion='entropy', random_state=rand_st)
        clf.fit(data_np,target_np)
        sel_idx = []
        print('clf.feature_importances_ = ', clf.feature_importances_)
        for x in clf.feature_importances_:
          if x >= np.mean(clf.feature_importances_):
            sel_idx.append(1)
          else:
            sel_idx.append(0)

    ##2) Get lists of selected and non-selected features (names and indexes) #######
    temp=[]
    temp_idx=[]
    temp_del=[]
    for i in range(len(data_np[0])):
        if sel_idx[i]==1:                                                           #Selected Features get added to temp header
            temp.append(header[i+feat_start])
            temp_idx.append(i)
        else:                                                                       #Indexes of non-selected features get added to delete array
            temp_del.append(i)
    print('Selected:', temp)
    print('Features (total/selected):', len(data_np[0]), len(temp))
    print('\n')
            
               
    ##3) Filter selected columns from original dataset #########
    header = header[0:feat_start]
    for field in temp:
        header.append(field)
    data_np = np.delete(data_np, temp_del, axis=1)                                 #Deletes non-selected features by index
    

#############################################################################
#
# Train SciKit Models
#
##########################################

print('--ML Model Output--', '\n')

#Test/Train split
data_train, data_test, target_train, target_test = train_test_split(data_np, target_np, test_size=0.35)

####Classifiers####
if cross_val==0:    
    #SciKit Random Forest
    clf = RandomForestClassifier( n_estimators=50, max_depth=None, min_samples_split=3,criterion='entropy', random_state=rand_st)  
    clf.fit(data_train,target_train)

    scores_ACC = clf.score(data_test, target_test)                                                                                                                          
    print('Random Forest Acc:', scores_ACC)
    scores_AUC = metrics.roc_auc_score(target_test, clf.predict_proba(data_test)[:,1])                                                                                      
    print('Random Forest AUC:', scores_AUC)                                                                     #AUC only works with binary classes, not multiclass            
 
####Cross-Val Classifiers####
if cross_val==1:
    #Setup Crossval classifier scorers
    scorers = {'Accuracy': 'accuracy', 'roc_auc': 'roc_auc'}                                                                                                                
    
    #SciKit Random Forest - Cross Val
    start_ts=time.time()
    clf = RandomForestClassifier( n_estimators=50, max_depth=None, min_samples_split=3,criterion='entropy', random_state=rand_st)   
    scores = cross_validate(clf, data_np, target_np, scoring=scorers, cv=5)                                                                                                 

    scores_Acc = scores['test_Accuracy']                                                                                                                                    
    print("Random Forest Acc: %0.2f (+/- %0.2f)" % (scores_Acc.mean(), scores_Acc.std() * 2))                                                                                                    
    scores_AUC= scores['test_roc_auc']                                                                     #Only works with binary classes, not multiclass                  
    print("Random Forest AUC: %0.2f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))                           
    print("CV Runtime:", time.time()-start_ts)

###Random Forest - w/ Feature Selection - 60-40 Balanced Dataset
RF 50 trees  
* 5-fold cv Acc: 0.73 (+/- 0.01) | AUC: 0.78 (+/- 0.01) | Runtime: 15.27
* Selected: ['HighBP', 'BMI', 'GenHlth', 'MentHlth', 'PhysHlth', 'Age', 'Education', 'Income']
* Features (total/selected): 21 8
* Same important features identified here. Age and BMI especially important




In [None]:
#SciKit DSC540 HW1
'''created by Casey Bennett 2018, www.CaseyBennett.com'''

import sys
import csv
import math
import numpy as np
from operator import itemgetter
import time

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.externals import joblib
from sklearn.feature_selection import RFE, VarianceThreshold, SelectFromModel
from sklearn.feature_selection import SelectKBest, mutual_info_regression, mutual_info_classif, chi2
from sklearn import metrics
from sklearn.model_selection import cross_validate, train_test_split
from sklearn.preprocessing import KBinsDiscretizer, scale


#Handle annoying warnings
import warnings, sklearn.exceptions
warnings.filterwarnings("ignore", category=sklearn.exceptions.ConvergenceWarning)


#############################################################################
#
# Global parameters
#
#####################

target_idx=0                                        #Index of Target variable
cross_val=1                                         #Control Switch for CV                                                                                      
norm_target=0                                       #Normalize target switch
norm_features=0                                     #Normalize target switch
binning=0                                           #Control Switch for Bin Target
bin_cnt=2                                           #If bin target, this sets number of classes
feat_select=1                                       #Control Switch for Feature Selection
fs_type=4                                           #Feature Selection type (1=Stepwise Backwards Removal, 2=Wrapper Select, 3=Univariate Selection)
lv_filter=0                                         #Control switch for low variance filter on features
feat_start=1                                        #Start column of features

#Set global model parameters
rand_st=1                                           #Set Random State variable for randomizing splits on runs


#############################################################################
#
# Load Data
#
#####################

file1= csv.reader(open('brfss2015_6040_cleaned.csv'), delimiter=',', quotechar='"')

#Read Header Line
header=next(file1)            

#Read data
data=[]
target=[]
for row in file1:
    #Load Target
    if row[target_idx]=='':                         #If target is blank, skip row                       
        continue
    else:
        target.append(float(row[target_idx]))       #If pre-binned class, change float to int

    #Load row into temp array, cast columns  
    temp=[]
                 
    for j in range(feat_start,len(header)):
        if row[j]=='':
            temp.append(float())
        else:
            temp.append(float(row[j]))

    #Load temp into Data array
    data.append(temp)
  
#Test Print
print(header)
print(len(target),len(data))
print('\n')

data_np=np.asarray(data)
target_np=np.asarray(target)


#############################################################################
#
# Preprocess data
#
##########################################




#############################################################################
#
# Feature Selection
#
##########################################

#Low Variance Filter
if lv_filter==1:
    print('--LOW VARIANCE FILTER ON--', '\n')
    
    #LV Threshold
    sel = VarianceThreshold(threshold=0.5)                                          #Removes any feature with less than 20% variance
    fit_mod=sel.fit(data_np)
    fitted=sel.transform(data_np)
    sel_idx=fit_mod.get_support()

    #Get lists of selected and non-selected features (names and indexes)
    temp=[]
    temp_idx=[]
    temp_del=[]
    for i in range(len(data_np[0])):
        if sel_idx[i]==1:                                                           #Selected Features get added to temp header
            temp.append(header[i+feat_start])
            temp_idx.append(i)
        else:                                                                       #Indexes of non-selected features get added to delete array
            temp_del.append(i)

    print('Selected:', temp)
    print('Features (total, selected):', len(data_np[0]), len(temp))
    print('\n')

    #Filter selected columns from original dataset
    header = header[0:feat_start]
    for field in temp:
        header.append(field)
    data_np = np.delete(data_np, temp_del, axis=1)                                 #Deletes non-selected features by index


#Feature Selection
if feat_select==1:
    '''Three steps:
       1) Run Feature Selection
       2) Get lists of selected and non-selected features
       3) Filter columns from original dataset
       '''
    
    print('--FEATURE SELECTION ON--', '\n')
    
    ##1) Run Feature Selection #######
    #Wrapper Select via model
    if fs_type==2:
        clf = RandomForestClassifier( n_estimators=50, max_depth=None, min_samples_split=3,criterion='entropy', random_state=rand_st)                
        sel = SelectFromModel(clf, prefit=False, threshold='mean', max_features=None)                   
        print ('Wrapper Select: ')

        fit_mod=sel.fit(data_np, target_np)    
        sel_idx=fit_mod.get_support()

    if fs_type==4:
        clf= RandomForestClassifier( n_estimators=50, max_depth=None, min_samples_split=3,criterion='entropy', random_state=rand_st)
        clf.fit(data_np,target_np)
        sel_idx = []
        print('clf.feature_importances_ = ', clf.feature_importances_)
        for x in clf.feature_importances_:
          if x >= np.mean(clf.feature_importances_):
            sel_idx.append(1)
          else:
            sel_idx.append(0)

    ##2) Get lists of selected and non-selected features (names and indexes) #######
    temp=[]
    temp_idx=[]
    temp_del=[]
    for i in range(len(data_np[0])):
        if sel_idx[i]==1:                                                           #Selected Features get added to temp header
            temp.append(header[i+feat_start])
            temp_idx.append(i)
        else:                                                                       #Indexes of non-selected features get added to delete array
            temp_del.append(i)
    print('Selected:', temp)
    print('Features (total/selected):', len(data_np[0]), len(temp))
    print('\n')
            
               
    ##3) Filter selected columns from original dataset #########
    header = header[0:feat_start]
    for field in temp:
        header.append(field)
    data_np = np.delete(data_np, temp_del, axis=1)                                 #Deletes non-selected features by index
    

#############################################################################
#
# Train SciKit Models
#
##########################################

print('--ML Model Output--', '\n')

#Test/Train split
data_train, data_test, target_train, target_test = train_test_split(data_np, target_np, test_size=0.35)

####Classifiers####
if cross_val==0:    
    #SciKit Random Forest
    clf = RandomForestClassifier( n_estimators=50, max_depth=None, min_samples_split=3,criterion='entropy', random_state=rand_st)  
    clf.fit(data_train,target_train)

    scores_ACC = clf.score(data_test, target_test)                                                                                                                          
    print('Random Forest Acc:', scores_ACC)
    scores_AUC = metrics.roc_auc_score(target_test, clf.predict_proba(data_test)[:,1])                                                                                      
    print('Random Forest AUC:', scores_AUC)                                                                     #AUC only works with binary classes, not multiclass            
 
####Cross-Val Classifiers####
if cross_val==1:
    #Setup Crossval classifier scorers
    scorers = {'Accuracy': 'accuracy', 'roc_auc': 'roc_auc'}                                                                                                                
    
    #SciKit Random Forest - Cross Val
    start_ts=time.time()
    clf = RandomForestClassifier( n_estimators=50, max_depth=None, min_samples_split=3,criterion='entropy', random_state=rand_st)   
    scores = cross_validate(clf, data_np, target_np, scoring=scorers, cv=5)                                                                                                 

    scores_Acc = scores['test_Accuracy']                                                                                                                                    
    print("Random Forest Acc: %0.2f (+/- %0.2f)" % (scores_Acc.mean(), scores_Acc.std() * 2))                                                                                                    
    scores_AUC= scores['test_roc_auc']                                                                     #Only works with binary classes, not multiclass                  
    print("Random Forest AUC: %0.2f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))                           
    print("CV Runtime:", time.time()-start_ts)

## AdaBoost, GradientBoost, and Neural Networks


### AdaBoost, GradientBoost, and Neural Network - w/o Feature Selection - Full Dataset
Gradient Boosting:
* Gradient Boosting - Acc: 0.91 (+/- 0.00)
* Gradient Boosting - AUC: 0.85 (+/- 0.01)
* GB - CV Runtime: 153.49 seconds

Ada Boost:
* Ada Boost - Acc: 0.91 (+/- 0.00)
* Ada Boost - AUC: 0.84 (+/- 0.01)
* Ada - CV Runtime: 92.66 seconds

Neural Network:
* Neural Network - Acc: 0.91 (+/- 0.00)
* Neural Network - AUC: 0.85 (+/- 0.01)
* NN - CV Runtime: 113.11 seconds


In [None]:
#SciKit DSC540 HW1
'''created by Casey Bennett 2018, www.CaseyBennett.com'''

import sys
import csv
import math
import numpy as np
from operator import itemgetter
import time

from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.externals import joblib
from sklearn.feature_selection import RFE, VarianceThreshold, SelectFromModel
from sklearn.feature_selection import SelectKBest, mutual_info_regression, mutual_info_classif, chi2
from sklearn import metrics
from sklearn.model_selection import cross_validate, train_test_split
from sklearn.preprocessing import KBinsDiscretizer, scale

#Handle annoying warnings
import warnings, sklearn.exceptions
warnings.filterwarnings("ignore", category=sklearn.exceptions.ConvergenceWarning)


#############################################################################
#
# Global parameters
#
#####################

target_idx=0                                        #Index of Target variable
cross_val=1                                         #Control Switch for CV                                                                                                                                                      
norm_target=0                                       #Normalize target switch
norm_features=0                                     #Normalize target switch
binning=1                                           #Control Switch for Bin Target
bin_cnt=2                                           #If bin target, this sets number of classes
feat_select=0                                       #Control Switch for Feature Selection                                                                                   
fs_type=2                                           #Feature Selection type (1=Stepwise Backwards Removal, 2=Wrapper Select, 3=Univariate Selection)                        
lv_filter=0                                         #Control switch for low variance filter on features
feat_start=1                                        #Start column of features
k_cnt=5                                             #Number of 'Top k' best ranked features to select, only applies for fs_types 1 and 3

#Set global model parameters
rand_st=1                                           #Set Random State variable for randomizing splits on runs


#############################################################################
#
# Load Data
#
#####################

file1= csv.reader(open('brfss2015_cleaned.csv'), delimiter=',', quotechar='"')

#Read Header Line
header=next(file1)            

#Read data
data=[]
target=[]
for row in file1:
    #Load Target
    if row[target_idx]=='':                         #If target is blank, skip row                       
        continue
    else:
        target.append(float(row[target_idx]))       #If pre-binned class, change float to int

    #Load row into temp array, cast columns  
    temp=[]
                 
    for j in range(feat_start,len(header)):
        if row[j]=='':
            temp.append(float())
        else:
            temp.append(float(row[j]))

    #Load temp into Data array
    data.append(temp)
  
#Test Print
print(header)
print(len(target),len(data))
print('\n')

data_np=np.asarray(data)
target_np=np.asarray(target)


#############################################################################
#
# Preprocess data
#
##########################################

if norm_target==1:
    #Target normalization for continuous values
    target_np=scale(target_np)

if norm_features==1:
    #Feature normalization for continuous values
    data_np=scale(data_np)

'''if binning==1:
    #Discretize Target variable with KBinsDiscretizer
    enc = KBinsDiscretizer(n_bins=[bin_cnt], encode='ordinal', strategy='quantile')                         #Strategy here is important, quantile creating equal bins, but kmeans prob being more valid "clusters"
    target_np_bin = enc.fit_transform(target_np.reshape(-1,1))

    #Get Bin min/max
    temp=[[] for x in range(bin_cnt+1)]
    for i in range(len(target_np)):
        for j in range(bin_cnt):
            if target_np_bin[i]==j:
                temp[j].append(target_np[i])

    for j in range(bin_cnt):
        print('Bin', j, ':', min(temp[j]), max(temp[j]), len(temp[j]))
    print('\n')

    #Convert Target array back to correct shape
    target_np=np.ravel(target_np_bin)'''


#############################################################################
#
# Feature Selection
#
##########################################

#Low Variance Filter
if lv_filter==1:
    print('--LOW VARIANCE FILTER ON--', '\n')
    
    #LV Threshold
    sel = VarianceThreshold(threshold=0.5)                                      #Removes any feature with less than 20% variance
    fit_mod=sel.fit(data_np)
    fitted=sel.transform(data_np)
    sel_idx=fit_mod.get_support()

    #Get lists of selected and non-selected features (names and indexes)
    temp=[]
    temp_idx=[]
    temp_del=[]
    for i in range(len(data_np[0])):
        if sel_idx[i]==1:                                                           #Selected Features get added to temp header
            temp.append(header[i+feat_start])
            temp_idx.append(i)
        else:                                                                       #Indexes of non-selected features get added to delete array
            temp_del.append(i)

    print('Selected', temp)
    print('Features (total, selected):', len(data_np[0]), len(temp))
    print('\n')

    #Filter selected columns from original dataset
    header = header[0:feat_start]
    for field in temp:
        header.append(field)
    data_np = np.delete(data_np, temp_del, axis=1)                                 #Deletes non-selected features by index


#Feature Selection
if feat_select==1:
    '''Three steps:
       1) Run Feature Selection
       2) Get lists of selected and non-selected features
       3) Filter columns from original dataset
       '''
    
    print('--FEATURE SELECTION ON--', '\n')
    
    ##1) Run Feature Selection #######
    if fs_type==1:
        #Stepwise Recursive Backwards Feature removal
        if binning==1:
            clf = RandomForestClassifier(n_estimators=200, max_depth=None, min_samples_split=3, criterion='entropy', random_state=rand_st)
            sel = RFE(clf, n_features_to_select=k_cnt, step=.1)
            print('Stepwise Recursive Backwards - Random Forest: ')
        if binning==0:
            rgr = RandomForestRegressor(n_estimators=500, max_depth=None, min_samples_split=3, criterion='mse', random_state=rand_st)
            sel = RFE(rgr, n_features_to_select=k_cnt, step=.1)
            print('Stepwise Recursive Backwards - Random Forest: ')
            
        fit_mod=sel.fit(data_np, target_np)
        print(sel.ranking_)
        sel_idx=fit_mod.get_support()      

    if fs_type==2:
        #Wrapper Select via model
        if binning==1:
            clf = GradientBoostingClassifier(n_estimators=100, loss='deviance', learning_rate=0.1, max_depth=3, min_samples_split=3, random_state=rand_st)
            sel = SelectFromModel(clf, prefit=False, threshold='mean', max_features=None)                                                           #to select only based on max_features, set to integer value and set threshold=-np.inf
            print ('Wrapper Select: ')
        if binning==0:
            rgr = '''Unused in this homework'''
            sel = SelectFromModel(rgr, prefit=False, threshold='mean', max_features=None)
            print ('Wrapper Select: ')
            
        fit_mod=sel.fit(data_np, target_np)    
        sel_idx=fit_mod.get_support()

    if fs_type==3:
        if binning==1:                                                              ######Only work if the Target is binned###########
            #Univariate Feature Selection - Chi-squared
            sel=SelectKBest(chi2, k=k_cnt)
            fit_mod=sel.fit(data_np, target_np)                                         #will throw error if any negative values in features, so turn off feature normalization, or switch to mutual_info_classif
            print ('Univariate Feature Selection - Chi2: ')
            sel_idx=fit_mod.get_support()

        if binning==0:                                                              ######Only work if the Target is continuous###########
            #Univariate Feature Selection - Mutual Info Regression
            sel=SelectKBest(mutual_info_regression, k=k_cnt)
            fit_mod=sel.fit(data_np, target_np)
            print ('Univariate Feature Selection - Mutual Info: ')
            sel_idx=fit_mod.get_support()

        #Print ranked variables out sorted
        temp=[]
        scores=fit_mod.scores_
        for i in range(feat_start, len(header)):            
            temp.append([header[i], float(scores[i-feat_start])])

        print('Ranked Features')
        temp_sort=sorted(temp, key=itemgetter(1), reverse=True)
        for i in range(len(temp_sort)):
            print(i, temp_sort[i][0], ':', temp_sort[i][1])
        print('\n')

    ##2) Get lists of selected and non-selected features (names and indexes) #######
    temp=[]
    temp_idx=[]
    temp_del=[]
    for i in range(len(data_np[0])):
        if sel_idx[i]==1:                                                           #Selected Features get added to temp header
            temp.append(header[i+feat_start])
            temp_idx.append(i)
        else:                                                                       #Indexes of non-selected features get added to delete array
            temp_del.append(i)
    print('Selected', temp)
    print('Features (total/selected):', len(data_np[0]), len(temp))
    print('\n')
            
                
    ##3) Filter selected columns from original dataset #########
    header = header[0:feat_start]
    for field in temp:
        header.append(field)
    data_np = np.delete(data_np, temp_del, axis=1)                                 #Deletes non-selected features by index)
    
    

#############################################################################
#
# Train SciKit Models
#
##########################################

print('--ML Model Output--', '\n')

#Test/Train split
data_train, data_test, target_train, target_test = train_test_split(data_np, target_np, test_size=0.35)

####Classifiers####
if binning==1 and cross_val==0:
    #SciKit
    '''Test/Train split unused in this homework, skip down to CV section'''
 

                                                                                                                         
 
####Cross-Val Classifiers####
if binning==1 and cross_val==1:
    #Setup Crossval classifier scorers
    scorers = {'Accuracy': 'accuracy', 'roc_auc': 'roc_auc'}                                                                                                                
    
    #SciKit Gradient Boosting - Cross Val
    start_ts=time.time()
    clf=GradientBoostingClassifier(n_estimators=100, loss='deviance', learning_rate=0.1, max_depth=3, min_samples_split=3, random_state=rand_st)
    scores= cross_validate(clf, data_np, target_np, scoring=scorers, cv=5)

    scores_Acc = scores['test_Accuracy']                                                                                                                                    
    print("Gradient Boosting - Random Forest Acc: %0.2f (+/- %0.2f)" % (scores_Acc.mean(), scores_Acc.std() * 2))                                                                                                    
    scores_AUC= scores['test_roc_auc']                                                                     #Only works with binary classes, not multiclass                  
    print("Gradient Boosting - Random Forest AUC: %0.2f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))                           
    print("GB - CV Runtime:", time.time()-start_ts)


    #SciKit Ada Boosting - Cross Val
    start_ts=time.time()
    clf=AdaBoostClassifier(n_estimators=100, base_estimator=None, learning_rate=0.1, random_state=rand_st)
    scores= cross_validate(clf, data_np, target_np, scoring=scorers, cv=5)

    scores_Acc = scores['test_Accuracy']                                                                                                                                    
    print("Ada Boost - Random Forest Acc: %0.2f (+/- %0.2f)" % (scores_Acc.mean(), scores_Acc.std() * 2))                                                                                                    
    scores_AUC= scores['test_roc_auc']                                                                     #Only works with binary classes, not multiclass                  
    print("Ada Boost - Random Forest AUC: %0.2f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))                           
    print("Ada - CV Runtime:", time.time()-start_ts)


    #SciKit Neural Network - Cross Val
    start_ts=time.time()
    clf=MLPClassifier(activation='logistic', solver='adam', alpha=0.0001, max_iter=1000, hidden_layer_sizes=(10,), random_state=rand_st)
    scores= cross_validate(clf, data_np, target_np, scoring=scorers, cv=5)

    scores_Acc = scores['test_Accuracy']                                                                                                                                    
    print("Neural Network - Acc: %0.2f (+/- %0.2f)" % (scores_Acc.mean(), scores_Acc.std() * 2))                                                                                                    
    scores_AUC= scores['test_roc_auc']                                                                     #Only works with binary classes, not multiclass                  
    print("Neural Network - AUC: %0.2f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))                           
    print("NN - CV Runtime:", time.time()-start_ts) 

### AdaBoost, GradientBoost, and Neural Network - w/ Feature Selection - Full Dataset
Gradient Boosting:
* Gradient Boosting - Acc: 0.91 (+/- 0.00)
* Gradient Boosting - AUC: 0.85 (+/- 0.01)
* GB - CV Runtime: 54.26 seconds

Ada Boost:
* Ada Boost - Acc: 0.91 (+/- 0.00)
* Ada Boost - AUC: 0.84 (+/- 0.01)
* Ada - CV Runtime: 49.80 seconds

Neural Network:
* Neural Network - Acc: 0.91 (+/- 0.00)
* Neural Network - AUC: 0.84 (+/- 0.01)
* NN - CV Runtime: 36.23 seconds

Notes:
* Selected Features: ['HighBP', 'HighChol', 'Stroke', 'GenHlth', 'DiffWalk', 'Sex', 'Age']
* Features (total/selected): 21 7
* Note that the selected features are different from the ones identified in the Random Forest feature important and selection.
* No significant change in ACC or AUC when using feature selection, just faster runtimes.


In [None]:
#SciKit DSC540 HW1
'''created by Casey Bennett 2018, www.CaseyBennett.com'''

import sys
import csv
import math
import numpy as np
from operator import itemgetter
import time

from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.externals import joblib
from sklearn.feature_selection import RFE, VarianceThreshold, SelectFromModel
from sklearn.feature_selection import SelectKBest, mutual_info_regression, mutual_info_classif, chi2
from sklearn import metrics
from sklearn.model_selection import cross_validate, train_test_split
from sklearn.preprocessing import KBinsDiscretizer, scale

#Handle annoying warnings
import warnings, sklearn.exceptions
warnings.filterwarnings("ignore", category=sklearn.exceptions.ConvergenceWarning)


#############################################################################
#
# Global parameters
#
#####################

target_idx=0                                        #Index of Target variable
cross_val=1                                         #Control Switch for CV                                                                                                                                                      
norm_target=0                                       #Normalize target switch
norm_features=0                                     #Normalize target switch
binning=1                                           #Control Switch for Bin Target
bin_cnt=2                                           #If bin target, this sets number of classes
feat_select=1                                       #Control Switch for Feature Selection                                                                                   
fs_type=2                                           #Feature Selection type (1=Stepwise Backwards Removal, 2=Wrapper Select, 3=Univariate Selection)                        
lv_filter=0                                         #Control switch for low variance filter on features
feat_start=1                                        #Start column of features
k_cnt=5                                             #Number of 'Top k' best ranked features to select, only applies for fs_types 1 and 3

#Set global model parameters
rand_st=1                                           #Set Random State variable for randomizing splits on runs


#############################################################################
#
# Load Data
#
#####################

file1= csv.reader(open('brfss2015_cleaned.csv'), delimiter=',', quotechar='"')

#Read Header Line
header=next(file1)            

#Read data
data=[]
target=[]
for row in file1:
    #Load Target
    if row[target_idx]=='':                         #If target is blank, skip row                       
        continue
    else:
        target.append(float(row[target_idx]))       #If pre-binned class, change float to int

    #Load row into temp array, cast columns  
    temp=[]
                 
    for j in range(feat_start,len(header)):
        if row[j]=='':
            temp.append(float())
        else:
            temp.append(float(row[j]))

    #Load temp into Data array
    data.append(temp)
  
#Test Print
print(header)
print(len(target),len(data))
print('\n')

data_np=np.asarray(data)
target_np=np.asarray(target)


#############################################################################
#
# Preprocess data
#
##########################################

if norm_target==1:
    #Target normalization for continuous values
    target_np=scale(target_np)

if norm_features==1:
    #Feature normalization for continuous values
    data_np=scale(data_np)

'''if binning==1:
    #Discretize Target variable with KBinsDiscretizer
    enc = KBinsDiscretizer(n_bins=[bin_cnt], encode='ordinal', strategy='quantile')                         #Strategy here is important, quantile creating equal bins, but kmeans prob being more valid "clusters"
    target_np_bin = enc.fit_transform(target_np.reshape(-1,1))

    #Get Bin min/max
    temp=[[] for x in range(bin_cnt+1)]
    for i in range(len(target_np)):
        for j in range(bin_cnt):
            if target_np_bin[i]==j:
                temp[j].append(target_np[i])

    for j in range(bin_cnt):
        print('Bin', j, ':', min(temp[j]), max(temp[j]), len(temp[j]))
    print('\n')

    #Convert Target array back to correct shape
    target_np=np.ravel(target_np_bin)'''


#############################################################################
#
# Feature Selection
#
##########################################

#Low Variance Filter
if lv_filter==1:
    print('--LOW VARIANCE FILTER ON--', '\n')
    
    #LV Threshold
    sel = VarianceThreshold(threshold=0.5)                                      #Removes any feature with less than 20% variance
    fit_mod=sel.fit(data_np)
    fitted=sel.transform(data_np)
    sel_idx=fit_mod.get_support()

    #Get lists of selected and non-selected features (names and indexes)
    temp=[]
    temp_idx=[]
    temp_del=[]
    for i in range(len(data_np[0])):
        if sel_idx[i]==1:                                                           #Selected Features get added to temp header
            temp.append(header[i+feat_start])
            temp_idx.append(i)
        else:                                                                       #Indexes of non-selected features get added to delete array
            temp_del.append(i)

    print('Selected', temp)
    print('Features (total, selected):', len(data_np[0]), len(temp))
    print('\n')

    #Filter selected columns from original dataset
    header = header[0:feat_start]
    for field in temp:
        header.append(field)
    data_np = np.delete(data_np, temp_del, axis=1)                                 #Deletes non-selected features by index


#Feature Selection
if feat_select==1:
    '''Three steps:
       1) Run Feature Selection
       2) Get lists of selected and non-selected features
       3) Filter columns from original dataset
       '''
    
    print('--FEATURE SELECTION ON--', '\n')
    
    ##1) Run Feature Selection #######
    if fs_type==1:
        #Stepwise Recursive Backwards Feature removal
        if binning==1:
            clf = RandomForestClassifier(n_estimators=200, max_depth=None, min_samples_split=3, criterion='entropy', random_state=rand_st)
            sel = RFE(clf, n_features_to_select=k_cnt, step=.1)
            print('Stepwise Recursive Backwards - Random Forest: ')
        if binning==0:
            rgr = RandomForestRegressor(n_estimators=500, max_depth=None, min_samples_split=3, criterion='mse', random_state=rand_st)
            sel = RFE(rgr, n_features_to_select=k_cnt, step=.1)
            print('Stepwise Recursive Backwards - Random Forest: ')
            
        fit_mod=sel.fit(data_np, target_np)
        print(sel.ranking_)
        sel_idx=fit_mod.get_support()      

    if fs_type==2:
        #Wrapper Select via model
        if binning==1:
            clf = GradientBoostingClassifier(n_estimators=100, loss='deviance', learning_rate=0.1, max_depth=3, min_samples_split=3, random_state=rand_st)
            sel = SelectFromModel(clf, prefit=False, threshold='mean', max_features=None)                                                           #to select only based on max_features, set to integer value and set threshold=-np.inf
            print ('Wrapper Select: ')
        if binning==0:
            rgr = '''Unused in this homework'''
            sel = SelectFromModel(rgr, prefit=False, threshold='mean', max_features=None)
            print ('Wrapper Select: ')
            
        fit_mod=sel.fit(data_np, target_np)    
        sel_idx=fit_mod.get_support()

    if fs_type==3:
        if binning==1:                                                              ######Only work if the Target is binned###########
            #Univariate Feature Selection - Chi-squared
            sel=SelectKBest(chi2, k=k_cnt)
            fit_mod=sel.fit(data_np, target_np)                                         #will throw error if any negative values in features, so turn off feature normalization, or switch to mutual_info_classif
            print ('Univariate Feature Selection - Chi2: ')
            sel_idx=fit_mod.get_support()

        if binning==0:                                                              ######Only work if the Target is continuous###########
            #Univariate Feature Selection - Mutual Info Regression
            sel=SelectKBest(mutual_info_regression, k=k_cnt)
            fit_mod=sel.fit(data_np, target_np)
            print ('Univariate Feature Selection - Mutual Info: ')
            sel_idx=fit_mod.get_support()

        #Print ranked variables out sorted
        temp=[]
        scores=fit_mod.scores_
        for i in range(feat_start, len(header)):            
            temp.append([header[i], float(scores[i-feat_start])])

        print('Ranked Features')
        temp_sort=sorted(temp, key=itemgetter(1), reverse=True)
        for i in range(len(temp_sort)):
            print(i, temp_sort[i][0], ':', temp_sort[i][1])
        print('\n')

    ##2) Get lists of selected and non-selected features (names and indexes) #######
    temp=[]
    temp_idx=[]
    temp_del=[]
    for i in range(len(data_np[0])):
        if sel_idx[i]==1:                                                           #Selected Features get added to temp header
            temp.append(header[i+feat_start])
            temp_idx.append(i)
        else:                                                                       #Indexes of non-selected features get added to delete array
            temp_del.append(i)
    print('Selected', temp)
    print('Features (total/selected):', len(data_np[0]), len(temp))
    print('\n')
            
                
    ##3) Filter selected columns from original dataset #########
    header = header[0:feat_start]
    for field in temp:
        header.append(field)
    data_np = np.delete(data_np, temp_del, axis=1)                                 #Deletes non-selected features by index)
    
    

#############################################################################
#
# Train SciKit Models
#
##########################################

print('--ML Model Output--', '\n')

#Test/Train split
data_train, data_test, target_train, target_test = train_test_split(data_np, target_np, test_size=0.35)

####Classifiers####
if binning==1 and cross_val==0:
    #SciKit
    '''Test/Train split unused in this homework, skip down to CV section'''
 

                                                                                                                         
 
####Cross-Val Classifiers####
if binning==1 and cross_val==1:
    #Setup Crossval classifier scorers
    scorers = {'Accuracy': 'accuracy', 'roc_auc': 'roc_auc'}                                                                                                                
    
    #SciKit Gradient Boosting - Cross Val
    start_ts=time.time()
    clf=GradientBoostingClassifier(n_estimators=100, loss='deviance', learning_rate=0.1, max_depth=3, min_samples_split=3, random_state=rand_st)
    scores= cross_validate(clf, data_np, target_np, scoring=scorers, cv=5)

    scores_Acc = scores['test_Accuracy']                                                                                                                                    
    print("Gradient Boosting - Random Forest Acc: %0.2f (+/- %0.2f)" % (scores_Acc.mean(), scores_Acc.std() * 2))                                                                                                    
    scores_AUC= scores['test_roc_auc']                                                                     #Only works with binary classes, not multiclass                  
    print("Gradient Boosting - Random Forest AUC: %0.2f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))                           
    print("GB - CV Runtime:", time.time()-start_ts)


    #SciKit Ada Boosting - Cross Val
    start_ts=time.time()
    clf=AdaBoostClassifier(n_estimators=100, base_estimator=None, learning_rate=0.1, random_state=rand_st)
    scores= cross_validate(clf, data_np, target_np, scoring=scorers, cv=5)

    scores_Acc = scores['test_Accuracy']                                                                                                                                    
    print("Ada Boost - Random Forest Acc: %0.2f (+/- %0.2f)" % (scores_Acc.mean(), scores_Acc.std() * 2))                                                                                                    
    scores_AUC= scores['test_roc_auc']                                                                     #Only works with binary classes, not multiclass                  
    print("Ada Boost - Random Forest AUC: %0.2f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))                           
    print("Ada - CV Runtime:", time.time()-start_ts)


    #SciKit Neural Network - Cross Val
    start_ts=time.time()
    clf=MLPClassifier(activation='logistic', solver='adam', alpha=0.0001, max_iter=1000, hidden_layer_sizes=(10,), random_state=rand_st)
    scores= cross_validate(clf, data_np, target_np, scoring=scorers, cv=5)

    scores_Acc = scores['test_Accuracy']                                                                                                                                    
    print("Neural Network - Acc: %0.2f (+/- %0.2f)" % (scores_Acc.mean(), scores_Acc.std() * 2))                                                                                                    
    scores_AUC= scores['test_roc_auc']                                                                     #Only works with binary classes, not multiclass                  
    print("Neural Network - AUC: %0.2f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))                           
    print("NN - CV Runtime:", time.time()-start_ts) 

### AdaBoost, GradientBoost, and Neural Network - w/ Feature Selection - 50-50 Dataset
Gradient Boosting:
* Gradient Boosting - Acc: 0.76 (+/- 0.01)
* Gradient Boosting - AUC: 0.84 (+/- 0.01)
* GB - CV Runtime: 8.77 seconds

Ada Boost:
* Ada Boost - Acc: 0.76 (+/- 0.01)
* Ada Boost - AUC: 0.83 (+/- 0.01)
* Ada - CV Runtime: 8.85 seconds

Neural Network:
* Neural Network - Acc: 0.76 (+/- 0.01)
* Neural Network - AUC: 0.84 (+/- 0.01)
* NN - CV Runtime: 18.77 seconds

Notes:
* Selected Features: ['HighBP', 'HighChol', 'GenHlth', 'Sex', 'Age']
* Features (total/selected): 21 5
* w/o feature selection was also tested, but there was no significant change in ACC or AUC, just runtime.

In [None]:
#SciKit DSC540 HW1
'''created by Casey Bennett 2018, www.CaseyBennett.com'''

import sys
import csv
import math
import numpy as np
from operator import itemgetter
import time

from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.externals import joblib
from sklearn.feature_selection import RFE, VarianceThreshold, SelectFromModel
from sklearn.feature_selection import SelectKBest, mutual_info_regression, mutual_info_classif, chi2
from sklearn import metrics
from sklearn.model_selection import cross_validate, train_test_split
from sklearn.preprocessing import KBinsDiscretizer, scale

#Handle annoying warnings
import warnings, sklearn.exceptions
warnings.filterwarnings("ignore", category=sklearn.exceptions.ConvergenceWarning)


#############################################################################
#
# Global parameters
#
#####################

target_idx=0                                        #Index of Target variable
cross_val=1                                         #Control Switch for CV                                                                                                                                                      
norm_target=0                                       #Normalize target switch
norm_features=0                                     #Normalize target switch
binning=1                                           #Control Switch for Bin Target
bin_cnt=2                                           #If bin target, this sets number of classes
feat_select=0                                       #Control Switch for Feature Selection                                                                                   
fs_type=2                                           #Feature Selection type (1=Stepwise Backwards Removal, 2=Wrapper Select, 3=Univariate Selection)                        
lv_filter=0                                         #Control switch for low variance filter on features
feat_start=1                                        #Start column of features
k_cnt=5                                             #Number of 'Top k' best ranked features to select, only applies for fs_types 1 and 3

#Set global model parameters
rand_st=1                                           #Set Random State variable for randomizing splits on runs


#############################################################################
#
# Load Data
#
#####################

file1= csv.reader(open('brfss2015_5050_cleaned.csv'), delimiter=',', quotechar='"')

#Read Header Line
header=next(file1)            

#Read data
data=[]
target=[]
for row in file1:
    #Load Target
    if row[target_idx]=='':                         #If target is blank, skip row                       
        continue
    else:
        target.append(float(row[target_idx]))       #If pre-binned class, change float to int

    #Load row into temp array, cast columns  
    temp=[]
                 
    for j in range(feat_start,len(header)):
        if row[j]=='':
            temp.append(float())
        else:
            temp.append(float(row[j]))

    #Load temp into Data array
    data.append(temp)
  
#Test Print
print(header)
print(len(target),len(data))
print('\n')

data_np=np.asarray(data)
target_np=np.asarray(target)


#############################################################################
#
# Preprocess data
#
##########################################

if norm_target==1:
    #Target normalization for continuous values
    target_np=scale(target_np)

if norm_features==1:
    #Feature normalization for continuous values
    data_np=scale(data_np)

'''if binning==1:
    #Discretize Target variable with KBinsDiscretizer
    enc = KBinsDiscretizer(n_bins=[bin_cnt], encode='ordinal', strategy='quantile')                         #Strategy here is important, quantile creating equal bins, but kmeans prob being more valid "clusters"
    target_np_bin = enc.fit_transform(target_np.reshape(-1,1))

    #Get Bin min/max
    temp=[[] for x in range(bin_cnt+1)]
    for i in range(len(target_np)):
        for j in range(bin_cnt):
            if target_np_bin[i]==j:
                temp[j].append(target_np[i])

    for j in range(bin_cnt):
        print('Bin', j, ':', min(temp[j]), max(temp[j]), len(temp[j]))
    print('\n')

    #Convert Target array back to correct shape
    target_np=np.ravel(target_np_bin)'''


#############################################################################
#
# Feature Selection
#
##########################################

#Low Variance Filter
if lv_filter==1:
    print('--LOW VARIANCE FILTER ON--', '\n')
    
    #LV Threshold
    sel = VarianceThreshold(threshold=0.5)                                      #Removes any feature with less than 20% variance
    fit_mod=sel.fit(data_np)
    fitted=sel.transform(data_np)
    sel_idx=fit_mod.get_support()

    #Get lists of selected and non-selected features (names and indexes)
    temp=[]
    temp_idx=[]
    temp_del=[]
    for i in range(len(data_np[0])):
        if sel_idx[i]==1:                                                           #Selected Features get added to temp header
            temp.append(header[i+feat_start])
            temp_idx.append(i)
        else:                                                                       #Indexes of non-selected features get added to delete array
            temp_del.append(i)

    print('Selected', temp)
    print('Features (total, selected):', len(data_np[0]), len(temp))
    print('\n')

    #Filter selected columns from original dataset
    header = header[0:feat_start]
    for field in temp:
        header.append(field)
    data_np = np.delete(data_np, temp_del, axis=1)                                 #Deletes non-selected features by index


#Feature Selection
if feat_select==1:
    '''Three steps:
       1) Run Feature Selection
       2) Get lists of selected and non-selected features
       3) Filter columns from original dataset
       '''
    
    print('--FEATURE SELECTION ON--', '\n')
    
    ##1) Run Feature Selection #######
    if fs_type==1:
        #Stepwise Recursive Backwards Feature removal
        if binning==1:
            clf = RandomForestClassifier(n_estimators=200, max_depth=None, min_samples_split=3, criterion='entropy', random_state=rand_st)
            sel = RFE(clf, n_features_to_select=k_cnt, step=.1)
            print('Stepwise Recursive Backwards - Random Forest: ')
        if binning==0:
            rgr = RandomForestRegressor(n_estimators=500, max_depth=None, min_samples_split=3, criterion='mse', random_state=rand_st)
            sel = RFE(rgr, n_features_to_select=k_cnt, step=.1)
            print('Stepwise Recursive Backwards - Random Forest: ')
            
        fit_mod=sel.fit(data_np, target_np)
        print(sel.ranking_)
        sel_idx=fit_mod.get_support()      

    if fs_type==2:
        #Wrapper Select via model
        if binning==1:
            clf = GradientBoostingClassifier(n_estimators=100, loss='deviance', learning_rate=0.1, max_depth=3, min_samples_split=3, random_state=rand_st)
            sel = SelectFromModel(clf, prefit=False, threshold='mean', max_features=None)                                                           #to select only based on max_features, set to integer value and set threshold=-np.inf
            print ('Wrapper Select: ')
        if binning==0:
            rgr = '''Unused in this homework'''
            sel = SelectFromModel(rgr, prefit=False, threshold='mean', max_features=None)
            print ('Wrapper Select: ')
            
        fit_mod=sel.fit(data_np, target_np)    
        sel_idx=fit_mod.get_support()

    if fs_type==3:
        if binning==1:                                                              ######Only work if the Target is binned###########
            #Univariate Feature Selection - Chi-squared
            sel=SelectKBest(chi2, k=k_cnt)
            fit_mod=sel.fit(data_np, target_np)                                         #will throw error if any negative values in features, so turn off feature normalization, or switch to mutual_info_classif
            print ('Univariate Feature Selection - Chi2: ')
            sel_idx=fit_mod.get_support()

        if binning==0:                                                              ######Only work if the Target is continuous###########
            #Univariate Feature Selection - Mutual Info Regression
            sel=SelectKBest(mutual_info_regression, k=k_cnt)
            fit_mod=sel.fit(data_np, target_np)
            print ('Univariate Feature Selection - Mutual Info: ')
            sel_idx=fit_mod.get_support()

        #Print ranked variables out sorted
        temp=[]
        scores=fit_mod.scores_
        for i in range(feat_start, len(header)):            
            temp.append([header[i], float(scores[i-feat_start])])

        print('Ranked Features')
        temp_sort=sorted(temp, key=itemgetter(1), reverse=True)
        for i in range(len(temp_sort)):
            print(i, temp_sort[i][0], ':', temp_sort[i][1])
        print('\n')

    ##2) Get lists of selected and non-selected features (names and indexes) #######
    temp=[]
    temp_idx=[]
    temp_del=[]
    for i in range(len(data_np[0])):
        if sel_idx[i]==1:                                                           #Selected Features get added to temp header
            temp.append(header[i+feat_start])
            temp_idx.append(i)
        else:                                                                       #Indexes of non-selected features get added to delete array
            temp_del.append(i)
    print('Selected', temp)
    print('Features (total/selected):', len(data_np[0]), len(temp))
    print('\n')
            
                
    ##3) Filter selected columns from original dataset #########
    header = header[0:feat_start]
    for field in temp:
        header.append(field)
    data_np = np.delete(data_np, temp_del, axis=1)                                 #Deletes non-selected features by index)
    
    

#############################################################################
#
# Train SciKit Models
#
##########################################

print('--ML Model Output--', '\n')

#Test/Train split
data_train, data_test, target_train, target_test = train_test_split(data_np, target_np, test_size=0.35)

####Classifiers####
if binning==1 and cross_val==0:
    #SciKit
    '''Test/Train split unused in this homework, skip down to CV section'''
 

                                                                                                                         
 
####Cross-Val Classifiers####
if binning==1 and cross_val==1:
    #Setup Crossval classifier scorers
    scorers = {'Accuracy': 'accuracy', 'roc_auc': 'roc_auc'}                                                                                                                
    
    #SciKit Gradient Boosting - Cross Val
    start_ts=time.time()
    clf=GradientBoostingClassifier(n_estimators=100, loss='deviance', learning_rate=0.1, max_depth=3, min_samples_split=3, random_state=rand_st)
    scores= cross_validate(clf, data_np, target_np, scoring=scorers, cv=5)

    scores_Acc = scores['test_Accuracy']                                                                                                                                    
    print("Gradient Boosting - Random Forest Acc: %0.2f (+/- %0.2f)" % (scores_Acc.mean(), scores_Acc.std() * 2))                                                                                                    
    scores_AUC= scores['test_roc_auc']                                                                     #Only works with binary classes, not multiclass                  
    print("Gradient Boosting - Random Forest AUC: %0.2f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))                           
    print("GB - CV Runtime:", time.time()-start_ts)


    #SciKit Ada Boosting - Cross Val
    start_ts=time.time()
    clf=AdaBoostClassifier(n_estimators=100, base_estimator=None, learning_rate=0.1, random_state=rand_st)
    scores= cross_validate(clf, data_np, target_np, scoring=scorers, cv=5)

    scores_Acc = scores['test_Accuracy']                                                                                                                                    
    print("Ada Boost - Random Forest Acc: %0.2f (+/- %0.2f)" % (scores_Acc.mean(), scores_Acc.std() * 2))                                                                                                    
    scores_AUC= scores['test_roc_auc']                                                                     #Only works with binary classes, not multiclass                  
    print("Ada Boost - Random Forest AUC: %0.2f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))                           
    print("Ada - CV Runtime:", time.time()-start_ts)


    #SciKit Neural Network - Cross Val
    start_ts=time.time()
    clf=MLPClassifier(activation='logistic', solver='adam', alpha=0.0001, max_iter=1000, hidden_layer_sizes=(10,), random_state=rand_st)
    scores= cross_validate(clf, data_np, target_np, scoring=scorers, cv=5)

    scores_Acc = scores['test_Accuracy']                                                                                                                                    
    print("Neural Network - Acc: %0.2f (+/- %0.2f)" % (scores_Acc.mean(), scores_Acc.std() * 2))                                                                                                    
    scores_AUC= scores['test_roc_auc']                                                                     #Only works with binary classes, not multiclass                  
    print("Neural Network - AUC: %0.2f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))                           
    print("NN - CV Runtime:", time.time()-start_ts) 

### AdaBoost, GradientBoost, and Neural Network - w/ Feature Selection - 60-40 Dataset
Gradient Boosting:
* Gradient Boosting - Acc: 0.78 (+/- 0.01)
* Gradient Boosting - AUC: 0.84 (+/- 0.01)
* GB - CV Runtime: 13.71 seconds

Ada Boost:
* Ada Boost - Acc: 0.77 (+/- 0.01)
* Ada Boost - AUC: 0.84 (+/- 0.01)
* Ada - CV Runtime: 13.24 seconds

Neural Network:
* Neural Network - Acc: 0.78 (+/- 0.01)
* Neural Network - AUC: 0.84 (+/- 0.01
* NN - CV Runtime: 21.01 seconds

Notes: 
* Selected Features['HighBP', 'HighChol', 'Stroke', 'GenHlth', 'Sex', 'Age']
* Features (total/selected): 21 6
* w/o feature selection was also tested, but there was no significant change in ACC or AUC, just runtime.
* Changes to GB max_depth did not improve ACC or AUC.



In [None]:
#SciKit DSC540 HW1
'''created by Casey Bennett 2018, www.CaseyBennett.com'''

import sys
import csv
import math
import numpy as np
from operator import itemgetter
import time

from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.externals import joblib
from sklearn.feature_selection import RFE, VarianceThreshold, SelectFromModel
from sklearn.feature_selection import SelectKBest, mutual_info_regression, mutual_info_classif, chi2
from sklearn import metrics
from sklearn.model_selection import cross_validate, train_test_split
from sklearn.preprocessing import KBinsDiscretizer, scale

#Handle annoying warnings
import warnings, sklearn.exceptions
warnings.filterwarnings("ignore", category=sklearn.exceptions.ConvergenceWarning)


#############################################################################
#
# Global parameters
#
#####################

target_idx=0                                        #Index of Target variable
cross_val=1                                         #Control Switch for CV                                                                                                                                                      
norm_target=0                                       #Normalize target switch
norm_features=0                                     #Normalize target switch
binning=1                                           #Control Switch for Bin Target
bin_cnt=2                                           #If bin target, this sets number of classes
feat_select=1                                       #Control Switch for Feature Selection                                                                                   
fs_type=2                                           #Feature Selection type (1=Stepwise Backwards Removal, 2=Wrapper Select, 3=Univariate Selection)                        
lv_filter=0                                         #Control switch for low variance filter on features
feat_start=1                                        #Start column of features
k_cnt=5                                             #Number of 'Top k' best ranked features to select, only applies for fs_types 1 and 3

#Set global model parameters
rand_st=1                                           #Set Random State variable for randomizing splits on runs


#############################################################################
#
# Load Data
#
#####################

file1= csv.reader(open('brfss2015_6040_cleaned.csv'), delimiter=',', quotechar='"')

#Read Header Line
header=next(file1)            

#Read data
data=[]
target=[]
for row in file1:
    #Load Target
    if row[target_idx]=='':                         #If target is blank, skip row                       
        continue
    else:
        target.append(float(row[target_idx]))       #If pre-binned class, change float to int

    #Load row into temp array, cast columns  
    temp=[]
                 
    for j in range(feat_start,len(header)):
        if row[j]=='':
            temp.append(float())
        else:
            temp.append(float(row[j]))

    #Load temp into Data array
    data.append(temp)
  
#Test Print
print(header)
print(len(target),len(data))
print('\n')

data_np=np.asarray(data)
target_np=np.asarray(target)


#############################################################################
#
# Preprocess data
#
##########################################

if norm_target==1:
    #Target normalization for continuous values
    target_np=scale(target_np)

if norm_features==1:
    #Feature normalization for continuous values
    data_np=scale(data_np)

'''if binning==1:
    #Discretize Target variable with KBinsDiscretizer
    enc = KBinsDiscretizer(n_bins=[bin_cnt], encode='ordinal', strategy='quantile')                         #Strategy here is important, quantile creating equal bins, but kmeans prob being more valid "clusters"
    target_np_bin = enc.fit_transform(target_np.reshape(-1,1))

    #Get Bin min/max
    temp=[[] for x in range(bin_cnt+1)]
    for i in range(len(target_np)):
        for j in range(bin_cnt):
            if target_np_bin[i]==j:
                temp[j].append(target_np[i])

    for j in range(bin_cnt):
        print('Bin', j, ':', min(temp[j]), max(temp[j]), len(temp[j]))
    print('\n')

    #Convert Target array back to correct shape
    target_np=np.ravel(target_np_bin)'''


#############################################################################
#
# Feature Selection
#
##########################################

#Low Variance Filter
if lv_filter==1:
    print('--LOW VARIANCE FILTER ON--', '\n')
    
    #LV Threshold
    sel = VarianceThreshold(threshold=0.5)                                      #Removes any feature with less than 20% variance
    fit_mod=sel.fit(data_np)
    fitted=sel.transform(data_np)
    sel_idx=fit_mod.get_support()

    #Get lists of selected and non-selected features (names and indexes)
    temp=[]
    temp_idx=[]
    temp_del=[]
    for i in range(len(data_np[0])):
        if sel_idx[i]==1:                                                           #Selected Features get added to temp header
            temp.append(header[i+feat_start])
            temp_idx.append(i)
        else:                                                                       #Indexes of non-selected features get added to delete array
            temp_del.append(i)

    print('Selected', temp)
    print('Features (total, selected):', len(data_np[0]), len(temp))
    print('\n')

    #Filter selected columns from original dataset
    header = header[0:feat_start]
    for field in temp:
        header.append(field)
    data_np = np.delete(data_np, temp_del, axis=1)                                 #Deletes non-selected features by index


#Feature Selection
if feat_select==1:
    '''Three steps:
       1) Run Feature Selection
       2) Get lists of selected and non-selected features
       3) Filter columns from original dataset
       '''
    
    print('--FEATURE SELECTION ON--', '\n')
    
    ##1) Run Feature Selection #######
    if fs_type==1:
        #Stepwise Recursive Backwards Feature removal
        if binning==1:
            clf = RandomForestClassifier(n_estimators=200, max_depth=None, min_samples_split=3, criterion='entropy', random_state=rand_st)
            sel = RFE(clf, n_features_to_select=k_cnt, step=.1)
            print('Stepwise Recursive Backwards - Random Forest: ')
        if binning==0:
            rgr = RandomForestRegressor(n_estimators=500, max_depth=None, min_samples_split=3, criterion='mse', random_state=rand_st)
            sel = RFE(rgr, n_features_to_select=k_cnt, step=.1)
            print('Stepwise Recursive Backwards - Random Forest: ')
            
        fit_mod=sel.fit(data_np, target_np)
        print(sel.ranking_)
        sel_idx=fit_mod.get_support()      

    if fs_type==2:
        #Wrapper Select via model
        if binning==1:
            clf = GradientBoostingClassifier(n_estimators=100, loss='deviance', learning_rate=0.1, max_depth=3, min_samples_split=3, random_state=rand_st)
            sel = SelectFromModel(clf, prefit=False, threshold='mean', max_features=None)                                                           #to select only based on max_features, set to integer value and set threshold=-np.inf
            print ('Wrapper Select: ')
        if binning==0:
            rgr = '''Unused in this homework'''
            sel = SelectFromModel(rgr, prefit=False, threshold='mean', max_features=None)
            print ('Wrapper Select: ')
            
        fit_mod=sel.fit(data_np, target_np)    
        sel_idx=fit_mod.get_support()

    if fs_type==3:
        if binning==1:                                                              ######Only work if the Target is binned###########
            #Univariate Feature Selection - Chi-squared
            sel=SelectKBest(chi2, k=k_cnt)
            fit_mod=sel.fit(data_np, target_np)                                         #will throw error if any negative values in features, so turn off feature normalization, or switch to mutual_info_classif
            print ('Univariate Feature Selection - Chi2: ')
            sel_idx=fit_mod.get_support()

        if binning==0:                                                              ######Only work if the Target is continuous###########
            #Univariate Feature Selection - Mutual Info Regression
            sel=SelectKBest(mutual_info_regression, k=k_cnt)
            fit_mod=sel.fit(data_np, target_np)
            print ('Univariate Feature Selection - Mutual Info: ')
            sel_idx=fit_mod.get_support()

        #Print ranked variables out sorted
        temp=[]
        scores=fit_mod.scores_
        for i in range(feat_start, len(header)):            
            temp.append([header[i], float(scores[i-feat_start])])

        print('Ranked Features')
        temp_sort=sorted(temp, key=itemgetter(1), reverse=True)
        for i in range(len(temp_sort)):
            print(i, temp_sort[i][0], ':', temp_sort[i][1])
        print('\n')

    ##2) Get lists of selected and non-selected features (names and indexes) #######
    temp=[]
    temp_idx=[]
    temp_del=[]
    for i in range(len(data_np[0])):
        if sel_idx[i]==1:                                                           #Selected Features get added to temp header
            temp.append(header[i+feat_start])
            temp_idx.append(i)
        else:                                                                       #Indexes of non-selected features get added to delete array
            temp_del.append(i)
    print('Selected', temp)
    print('Features (total/selected):', len(data_np[0]), len(temp))
    print('\n')
            
                
    ##3) Filter selected columns from original dataset #########
    header = header[0:feat_start]
    for field in temp:
        header.append(field)
    data_np = np.delete(data_np, temp_del, axis=1)                                 #Deletes non-selected features by index)
    
    

#############################################################################
#
# Train SciKit Models
#
##########################################

print('--ML Model Output--', '\n')

#Test/Train split
data_train, data_test, target_train, target_test = train_test_split(data_np, target_np, test_size=0.35)

####Classifiers####
if binning==1 and cross_val==0:
    #SciKit
    '''Test/Train split unused in this homework, skip down to CV section'''
 

                                                                                                                         
 
####Cross-Val Classifiers####
if binning==1 and cross_val==1:
    #Setup Crossval classifier scorers
    scorers = {'Accuracy': 'accuracy', 'roc_auc': 'roc_auc'}                                                                                                                
    
    #SciKit Gradient Boosting - Cross Val
    start_ts=time.time()
    clf=GradientBoostingClassifier(n_estimators=100, loss='deviance', learning_rate=0.1, max_depth=3, min_samples_split=3, random_state=rand_st)
    scores= cross_validate(clf, data_np, target_np, scoring=scorers, cv=5)

    scores_Acc = scores['test_Accuracy']                                                                                                                                    
    print("Gradient Boosting - Random Forest Acc: %0.2f (+/- %0.2f)" % (scores_Acc.mean(), scores_Acc.std() * 2))                                                                                                    
    scores_AUC= scores['test_roc_auc']                                                                     #Only works with binary classes, not multiclass                  
    print("Gradient Boosting - Random Forest AUC: %0.2f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))                           
    print("GB - CV Runtime:", time.time()-start_ts)


    #SciKit Ada Boosting - Cross Val
    start_ts=time.time()
    clf=AdaBoostClassifier(n_estimators=100, base_estimator=None, learning_rate=0.1, random_state=rand_st)
    scores= cross_validate(clf, data_np, target_np, scoring=scorers, cv=5)

    scores_Acc = scores['test_Accuracy']                                                                                                                                    
    print("Ada Boost - Random Forest Acc: %0.2f (+/- %0.2f)" % (scores_Acc.mean(), scores_Acc.std() * 2))                                                                                                    
    scores_AUC= scores['test_roc_auc']                                                                     #Only works with binary classes, not multiclass                  
    print("Ada Boost - Random Forest AUC: %0.2f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))                           
    print("Ada - CV Runtime:", time.time()-start_ts)


    #SciKit Neural Network - Cross Val
    start_ts=time.time()
    clf=MLPClassifier(activation='logistic', solver='adam', alpha=0.0001, max_iter=1000, hidden_layer_sizes=(10,), random_state=rand_st)
    scores= cross_validate(clf, data_np, target_np, scoring=scorers, cv=5)

    scores_Acc = scores['test_Accuracy']                                                                                                                                    
    print("Neural Network - Acc: %0.2f (+/- %0.2f)" % (scores_Acc.mean(), scores_Acc.std() * 2))                                                                                                    
    scores_AUC= scores['test_roc_auc']                                                                     #Only works with binary classes, not multiclass                  
    print("Neural Network - AUC: %0.2f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))                           
    print("NN - CV Runtime:", time.time()-start_ts) 

## Support Vector Machines - Dataset too Large - Stuck on Execution

In [None]:
#SciKit DSC540 HW4
'''created by Casey Bennett 2018, www.CaseyBennett.com'''

import sys
import csv
import math
import numpy as np
from operator import itemgetter
import time

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.externals import joblib
from sklearn.feature_selection import RFE, VarianceThreshold, SelectFromModel
from sklearn.feature_selection import SelectKBest, mutual_info_regression, mutual_info_classif, chi2
from sklearn import metrics
from sklearn.model_selection import cross_validate, train_test_split
from sklearn.preprocessing import KBinsDiscretizer, scale

#Handle annoying warnings
import warnings, sklearn.exceptions
warnings.filterwarnings("ignore", category=sklearn.exceptions.ConvergenceWarning)



#############################################################################
#
# Global parameters
#
#####################

target_idx=0                                        #Index of Target variable
cross_val=1                                         #Control Switch for CV                                                                                                                                                      
norm_target=0                                       #Normalize target switch
norm_features=0                                     #Normalize target switch
binning=1                                           #Control Switch for Bin Target
bin_cnt=2                                           #If bin target, this sets number of classes
feat_select=1                                       #Control Switch for Feature Selection                                                                                   
fs_type=2                                           #Feature Selection type (1=Stepwise Backwards Removal, 2=Wrapper Select, 3=Univariate Selection)                        
lv_filter=0                                         #Control switch for low variance filter on features
feat_start=1                                        #Start column of features
k_cnt=5                                             #Number of 'Top k' best ranked features to select, only applies for fs_types 1 and 3

#Set global model parameters
rand_st=1                                           #Set Random State variable for randomizing splits on runs


#############################################################################
#
# Load Data
#
#####################

file1= csv.reader(open('brfss2015_5050_cleaned.csv'), delimiter=',', quotechar='"')

#Read Header Line
header=next(file1)            

#Read data
data=[]
target=[]
for row in file1:
    #Load Target
    if row[target_idx]=='':                         #If target is blank, skip row                       
        continue
    else:
        target.append(float(row[target_idx]))       #If pre-binned class, change float to int

    #Load row into temp array, cast columns  
    temp=[]
                 
    for j in range(feat_start,len(header)):
        if row[j]=='':
            temp.append(float())
        else:
            temp.append(float(row[j]))

    #Load temp into Data array
    data.append(temp)
  
#Test Print
print(header)
print(len(target),len(data))
print('\n')

data_np=np.asarray(data)
target_np=np.asarray(target)


#############################################################################
#
# Preprocess data
#
##########################################

if norm_target==1:
    #Target normalization for continuous values
    target_np=scale(target_np)

if norm_features==1:
    #Feature normalization for continuous values
    data_np=scale(data_np)

'''if binning==1:
    #Discretize Target variable with KBinsDiscretizer
    enc = KBinsDiscretizer(n_bins=[bin_cnt], encode='ordinal', strategy='quantile')                         #Strategy here is important, quantile creating equal bins, but kmeans prob being more valid "clusters"
    target_np_bin = enc.fit_transform(target_np.reshape(-1,1))

    #Get Bin min/max
    temp=[[] for x in range(bin_cnt+1)]
    for i in range(len(target_np)):
        for j in range(bin_cnt):
            if target_np_bin[i]==j:
                temp[j].append(target_np[i])

    for j in range(bin_cnt):
        print('Bin', j, ':', min(temp[j]), max(temp[j]), len(temp[j]))
    print('\n')

    #Convert Target array back to correct shape
    target_np=np.ravel(target_np_bin)'''


#############################################################################
#
# Feature Selection
#
##########################################

#Low Variance Filter
if lv_filter==1:
    print('--LOW VARIANCE FILTER ON--', '\n')
    
    #LV Threshold
    sel = VarianceThreshold(threshold=0.5)                                      #Removes any feature with less than 20% variance
    fit_mod=sel.fit(data_np)
    fitted=sel.transform(data_np)
    sel_idx=fit_mod.get_support()

    #Get lists of selected and non-selected features (names and indexes)
    temp=[]
    temp_idx=[]
    temp_del=[]
    for i in range(len(data_np[0])):
        if sel_idx[i]==1:                                                           #Selected Features get added to temp header
            temp.append(header[i+feat_start])
            temp_idx.append(i)
        else:                                                                       #Indexes of non-selected features get added to delete array
            temp_del.append(i)

    print('Selected', temp)
    print('Features (total, selected):', len(data_np[0]), len(temp))
    print('\n')

    #Filter selected columns from original dataset
    header = header[0:feat_start]
    for field in temp:
        header.append(field)
    data_np = np.delete(data_np, temp_del, axis=1)                                 #Deletes non-selected features by index


#Feature Selection
if feat_select==1:
    '''Three steps:
       1) Run Feature Selection
       2) Get lists of selected and non-selected features
       3) Filter columns from original dataset
       '''
    
    print('--FEATURE SELECTION ON--', '\n')
    
    ##1) Run Feature Selection #######
    if fs_type==1:
        #Stepwise Recursive Backwards Feature removal
        if binning==1:
            clf = RandomForestClassifier(n_estimators=200, max_depth=None, min_samples_split=3, criterion='entropy', random_state=rand_st)
            sel = RFE(clf, n_features_to_select=k_cnt, step=.1)
            print('Stepwise Recursive Backwards - Random Forest: ')
        if binning==0:
            rgr = RandomForestRegressor(n_estimators=500, max_depth=None, min_samples_split=3, criterion='mse', random_state=rand_st)
            sel = RFE(rgr, n_features_to_select=k_cnt, step=.1)
            print('Stepwise Recursive Backwards - Random Forest: ')
            
        fit_mod=sel.fit(data_np, target_np)
        print(sel.ranking_)
        sel_idx=fit_mod.get_support()      

    if fs_type==2:
        #Wrapper Select via model
        if binning==1:
            clf = SVC(kernel='linear', gamma='scale', C=1.0, probability=True, random_state=rand_st)
            sel = SelectFromModel(clf, prefit=False, threshold='mean', max_features=None)                                                           #to select only based on max_features, set to integer value and set threshold=-np.inf
            print ('Wrapper Select: ')
        if binning==0:
            rgr = '''Unused in this homework'''
            sel = SelectFromModel(rgr, prefit=False, threshold='mean', max_features=None)
            print ('Wrapper Select: ')
            
        fit_mod=sel.fit(data_np, target_np)    
        sel_idx=fit_mod.get_support()

    if fs_type==3:
        if binning==1:                                                              ######Only work if the Target is binned###########
            #Univariate Feature Selection - Chi-squared
            sel=SelectKBest(chi2, k=k_cnt)
            fit_mod=sel.fit(data_np, target_np)                                         #will throw error if any negative values in features, so turn off feature normalization, or switch to mutual_info_classif
            print ('Univariate Feature Selection - Chi2: ')
            sel_idx=fit_mod.get_support()

        if binning==0:                                                              ######Only work if the Target is continuous###########
            #Univariate Feature Selection - Mutual Info Regression
            sel=SelectKBest(mutual_info_regression, k=k_cnt)
            fit_mod=sel.fit(data_np, target_np)
            print ('Univariate Feature Selection - Mutual Info: ')
            sel_idx=fit_mod.get_support()

        #Print ranked variables out sorted
        temp=[]
        scores=fit_mod.scores_
        for i in range(feat_start, len(header)):            
            temp.append([header[i], float(scores[i-feat_start])])

        print('Ranked Features')
        temp_sort=sorted(temp, key=itemgetter(1), reverse=True)
        for i in range(len(temp_sort)):
            print(i, temp_sort[i][0], ':', temp_sort[i][1])
        print('\n')

    ##2) Get lists of selected and non-selected features (names and indexes) #######
    temp=[]
    temp_idx=[]
    temp_del=[]
    for i in range(len(data_np[0])):
        if sel_idx[i]==1:                                                           #Selected Features get added to temp header
            temp.append(header[i+feat_start])
            temp_idx.append(i)
        else:                                                                       #Indexes of non-selected features get added to delete array
            temp_del.append(i)
    print('Selected', temp)
    print('Features (total/selected):', len(data_np[0]), len(temp))
    print('\n')
            
                
    ##3) Filter selected columns from original dataset #########
    header = header[0:feat_start]
    for field in temp:
        header.append(field)
    data_np = np.delete(data_np, temp_del, axis=1)                                 #Deletes non-selected features by index)
    
    

#############################################################################
#
# Train SciKit Models
#
##########################################

print('--ML Model Output--', '\n')

#Test/Train split
data_train, data_test, target_train, target_test = train_test_split(data_np, target_np, test_size=0.35)

####Classifiers####
if binning==1 and cross_val==0:
    #SciKit
    '''Test/Train split unused in this homework, skip down to CV section'''
 

                                                                                                                         
 
####Cross-Val Classifiers####
if binning==1 and cross_val==1:
    #Setup Crossval classifier scorers
    scorers = {'Accuracy': 'accuracy', 'roc_auc': 'roc_auc'}                                                                                                                
    
    #SciKit RBF - SVM - Cross Val
#    start_ts=time.time()
#    clf=SVC(kernel='rbf', gamma='scale', C=1.0, probability=True, random_state=rand_st)
#    scores=cross_validate(clf, data_np, target_np, scoring=scorers, cv=5)

#    scores_Acc = scores['test_Accuracy']                                                                                                                                    
#    print("RBF-SVM Acc: %0.2f (+/- %0.2f)" % (scores_Acc.mean(), scores_Acc.std() * 2))                                                                                                    
#    scores_AUC= scores['test_roc_auc']                                                                     #Only works with binary classes, not multiclass                  
#    print("RBF-SVM AUC: %0.2f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))                           
#    print("RBF-SVM CV Runtime:", time.time()-start_ts)

    #SciKit Linear - SVM - Cross Val
    start_ts=time.time()
    clf=SVC(kernel='linear', gamma='scale', C=1.0, probability=True, random_state=rand_st)
    scores=cross_validate(clf, data_np, target_np, scoring=scorers, cv=5)

    scores_Acc = scores['test_Accuracy']                                                                                                                                    
    print("Linear-SVM Acc: %0.2f (+/- %0.2f)" % (scores_Acc.mean(), scores_Acc.std() * 2))                                                                                                    
    scores_AUC= scores['test_roc_auc']                                                                     #Only works with binary classes, not multiclass                  
    print("Linear-SVM AUC: %0.2f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))                           
    print("Linear-CV Runtime:", time.time()-start_ts)