# Project
# EN.553.602 - Research and Design in Applied Mathematics: Data Mining

**Author**: Alex Teboul

**Professor**: Kearsley, Anthony José

**Data Source**: https://www.cdc.gov/brfss/annual_data/annual_2022.html

**Data SAS format** - 2022 BRFSS Data (SAS Transport Format) [ZIP – 64.3 MB]

**Code Book** - 2022 BRFSS Codebook CDC [ZIP – 3 MB]

## About this Project

**Objective:** The objective of this project was to develop predictive models for cognitive decline utilizing the 2022 BRFSS dataset. The workflow involved obtaining the 2022 BRFSS dataset, choosing features for examination in predictive models based on risk factors previously identified in studies on cognitive decline, conducting exploratory data analysis, testing various models, and presenting the findings. The methodologies investigated in this notebook include Random Forests, Gradient Boosting, AdaBoost, and Neural Networks.


1.   **Part 1:** Getting and Cleaning the Data
*   Get the BRFSS dataset
*   Select a Relevant Subset of Features
*   Cleaning the Data (Missing Values, Modifying Values, Make Feature Names More Readable, Save Finalized Dataset to CSV)
2.   **Part 2:** Model Building

Random Forests
*   Random Forest - w/ Feature Selection - Full Dataset
*   Random Forest - w/o Feature Selection - Full Dataset
*   Random Forest - w/ and w/o Feature Selection - 50-50 Balanced Dataset
*   Random Forest - w/ Feature Selection - 60-40 Balanced Dataset

AdaBoost, GradientBoost, and Neural Networks
*   AdaBoost, GradientBoost, and Neural Network - w/o Feature Selection - Full Dataset
*   AdaBoost, GradientBoost, and Neural Network - w/ Feature Selection - Full Dataset
*   AdaBoost, GradientBoost, and Neural Network - w/ Feature Selection - 50-50 Dataset
*   AdaBoost, GradientBoost, and Neural Network - w/ Feature Selection - 60-40 Dataset

Support Vector Machines: *Too Slow - Never Finishes
*   RBF-SVM - w/ Feature Selection 50-50 Dataset




# Part 1: Getting and Cleaning the Data

In [2]:
#imports
import pandas as pd
import random
import numpy as np
random.seed(1)

### Get the BRFSS dataset

In [6]:
df=pd.read_sas('/content/drive/MyDrive/Colab Notebooks/Data Mining/Data/LLCP2022.XPT ', format='xport')
df.head()

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/Colab Notebooks/Data Mining/Data/LLCP2022.XPT '

In [None]:
#check that all the data loaded in
df.shape

(445132, 328)

In [None]:
#check that the data loaded in is in the correct format
pd.set_option('display.max_columns', 500)
df.head()

Unnamed: 0,_STATE,FMONTH,IDATE,IMONTH,IDAY,IYEAR,DISPCODE,SEQNO,_PSU,CTELENM1,PVTRESD1,COLGHOUS,STATERE1,CELPHON1,LADULT1,COLGSEX1,NUMADULT,LANDSEX1,NUMMEN,NUMWOMEN,RESPSLCT,SAFETIME,CTELNUM1,CELLFON5,CADULT1,CELLSEX1,PVTRESD3,CCLGHOUS,CSTATE1,LANDLINE,HHADULT,SEXVAR,GENHLTH,PHYSHLTH,MENTHLTH,POORHLTH,PRIMINSR,PERSDOC3,MEDCOST1,CHECKUP1,EXERANY2,SLEPTIM1,LASTDEN4,RMVTETH4,CVDINFR4,CVDCRHD4,CVDSTRK3,ASTHMA3,ASTHNOW,CHCSCNC1,CHCOCNC1,CHCCOPD3,ADDEPEV3,CHCKDNY2,HAVARTH4,DIABETE4,DIABAGE4,MARITAL,EDUCA,RENTHOM1,NUMHHOL4,NUMPHON4,CPDEMO1C,VETERAN3,EMPLOY1,CHILDREN,INCOME3,PREGNANT,WEIGHT2,HEIGHT3,DEAF,BLIND,DECIDE,DIFFWALK,DIFFDRES,DIFFALON,HADMAM,HOWLONG,CERVSCRN,CRVCLCNC,CRVCLPAP,CRVCLHPV,HADHYST2,HADSIGM4,COLNSIGM,COLNTES1,SIGMTES1,LASTSIG4,COLNCNCR,VIRCOLO1,VCLNTES2,SMALSTOL,STOLTEST,STOOLDN2,BLDSTFIT,SDNATES1,SMOKE100,SMOKDAY2,USENOW3,ECIGNOW2,LCSFIRST,LCSLAST,LCSNUMCG,LCSCTSC1,LCSSCNCR,LCSCTWHN,ALCDAY4,AVEDRNK3,DRNK3GE5,MAXDRNKS,FLUSHOT7,FLSHTMY3,PNEUVAC4,TETANUS1,HIVTST7,HIVTSTD3,HIVRISK5,COVIDPOS,COVIDSMP,COVIDPRM,PDIABTS1,PREDIAB2,DIABTYPE,INSULIN1,CHKHEMO3,EYEEXAM1,DIABEYE1,DIABEDU1,FEETSORE,TOLDCFS,HAVECFS,WORKCFS,IMFVPLA3,HPVADVC4,HPVADSHT,SHINGLE2,COVIDVA1,COVACGET,COVIDNU1,COVIDINT,COVIDFS1,COVIDSE1,COPDCOGH,COPDFLEM,COPDBRTH,COPDBTST,COPDSMOK,CNCRDIFF,CNCRAGE,CNCRTYP2,CSRVTRT3,CSRVDOC1,CSRVSUM,CSRVRTRN,CSRVINST,CSRVINSR,CSRVDEIN,CSRVCLIN,CSRVPAIN,CSRVCTL2,PSATEST1,PSATIME1,PCPSARS2,PSASUGST,PCSTALK1,CIMEMLOS,CDHOUSE,CDASSIST,CDHELP,CDSOCIAL,CDDISCUS,CAREGIV1,CRGVREL4,CRGVLNG1,CRGVHRS1,CRGVPRB3,CRGVALZD,CRGVPER1,CRGVHOU1,CRGVEXPT,ACEDEPRS,ACEDRINK,ACEDRUGS,ACEPRISN,ACEDIVRC,ACEPUNCH,ACEHURT1,ACESWEAR,ACETOUCH,ACETTHEM,ACEHVSEX,ACEADSAF,ACEADNED,LSATISFY,EMTSUPRT,SDHISOLT,SDHEMPLY,FOODSTMP,SDHFOOD1,SDHBILLS,SDHUTILS,SDHTRNSP,SDHSTRE1,MARIJAN1,MARJSMOK,MARJEAT,MARJVAPE,MARJDAB,MARJOTHR,USEMRJN4,LASTSMK2,STOPSMK2,MENTCIGS,MENTECIG,HEATTBCO,ASBIALCH,ASBIDRNK,ASBIBING,ASBIADVC,ASBIRDUC,FIREARM5,GUNLOAD,LOADULK2,RCSGEND1,RCSXBRTH,RCSRLTN2,CASTHDX2,CASTHNO2,BIRTHSEX,SOMALE,SOFEMALE,TRNSGNDR,HADSEX,PFPPRVN4,TYPCNTR9,BRTHCNT4,WHEREGET,NOBCUSE8,BCPREFER,RRCLASS3,RRCOGNT2,RRTREAT,RRATWRK2,RRHCARE4,RRPHYSM2,QSTVER,QSTLANG,_METSTAT,_URBSTAT,MSCODE,_STSTR,_STRWT,_RAWRAKE,_WT2RAKE,_IMPRACE,_CHISPNC,_CRACE2,_CPRACE2,CAGEG,_CLLCPWT,_DUALUSE,_DUALCOR,_LLCPWT2,_LLCPWT,_RFHLTH,_PHYS14D,_MENT14D,_HLTHPLN,_HCVU652,_TOTINDA,_EXTETH3,_ALTETH3,_DENVST3,_MICHD,_LTASTH1,_CASTHM1,_ASTHMS1,_DRDXAR2,_PRACE2,_MRACE2,_HISPANC,_RACE1,_RACEG22,_RACEGR4,_RACEPR1,_SEX,_AGEG5YR,_AGE65YR,_AGE80,_AGE_G,HTIN4,HTM4,WTKG3,_BMI5,_BMI5CAT,_RFBMI5,_CHLDCNT,_EDUCAG,_INCOMG1,_RFMAM22,_MAM5023,_HADCOLN,_CLNSCP1,_HADSIGM,_SGMSCP1,_SGMS101,_RFBLDS5,_STOLDN1,_VIRCOL1,_SBONTI1,_CRCREC2,_SMOKER3,_RFSMOK3,_CURECI2,_YRSSMOK,_PACKDAY,_PACKYRS,_YRSQUIT,_SMOKGRP,_LCSREC,DRNKANY6,DROCDY4_,_RFBING6,_DRNKWK2,_RFDRHV8,_FLSHOT7,_PNEUMO3,_AIDTST4
0,1.0,1.0,b'02032022',b'02',b'03',b'2022',1100.0,b'2022000001',2022000000.0,1.0,1.0,,1.0,2.0,1.0,,2.0,,1.0,1.0,2.0,,,,,,,,,,,2.0,2.0,88.0,88.0,,99.0,1.0,2.0,1.0,2.0,8.0,,,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,2.0,2.0,1.0,80.0,1.0,6.0,1.0,1.0,1.0,2.0,2.0,7.0,88.0,99.0,,9999.0,9999.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,,,,2.0,1.0,3.0,2.0,3.0,,2.0,,,,,,,,2.0,,3.0,4.0,,,,2.0,,,888.0,,,,1.0,92021.0,2.0,3.0,2.0,,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,5.0,2.0,2.0,5.0,2.0,2.0,2.0,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10.0,1.0,1.0,1.0,2.0,11011.0,37.418416,2.0,74.836832,1.0,9.0,,,,,1.0,0.520383,813.918517,487.612985,1.0,1.0,1.0,9.0,9.0,2.0,9.0,9.0,9.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,13.0,2.0,80.0,6.0,,,,,,9.0,1.0,4.0,9.0,1.0,,1.0,,1.0,,,,,,,,4.0,1.0,1.0,,,,,4.0,,2.0,5.397605e-79,1.0,5.397605e-79,1.0,1.0,2.0,2.0
1,1.0,1.0,b'02042022',b'02',b'04',b'2022',1100.0,b'2022000002',2022000000.0,1.0,1.0,,1.0,2.0,1.0,,2.0,,1.0,1.0,2.0,,,,,,,,,,,2.0,1.0,88.0,88.0,,3.0,2.0,2.0,8.0,2.0,6.0,,,2.0,2.0,2.0,2.0,,1.0,1.0,2.0,2.0,2.0,2.0,3.0,,3.0,4.0,1.0,1.0,2.0,1.0,2.0,2.0,88.0,5.0,,150.0,503.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,4.0,2.0,,,,1.0,1.0,1.0,4.0,,,2.0,,,,,,,,2.0,,3.0,1.0,,,,2.0,,,888.0,,,,2.0,,2.0,4.0,2.0,,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,5.0,2.0,2.0,5.0,2.0,2.0,2.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10.0,1.0,2.0,1.0,5.0,11011.0,37.418416,1.0,37.418416,1.0,9.0,,,,,1.0,0.520383,406.959258,432.100273,1.0,1.0,1.0,1.0,9.0,2.0,9.0,9.0,9.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,13.0,2.0,80.0,6.0,63.0,160.0,6804.0,2657.0,3.0,2.0,1.0,2.0,3.0,2.0,,1.0,,2.0,,,,,,,,4.0,1.0,1.0,,,,,4.0,,2.0,5.397605e-79,1.0,5.397605e-79,1.0,2.0,2.0,2.0
2,1.0,1.0,b'02022022',b'02',b'02',b'2022',1100.0,b'2022000003',2022000000.0,1.0,1.0,,1.0,2.0,1.0,,1.0,2.0,,,,,,,,,,,,,,2.0,2.0,2.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,5.0,,,2.0,2.0,2.0,2.0,,1.0,2.0,2.0,2.0,2.0,2.0,3.0,,1.0,6.0,1.0,2.0,,1.0,2.0,7.0,88.0,10.0,,140.0,502.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,,,,1.0,2.0,,,,,2.0,,,,,,,,2.0,,3.0,1.0,,,,2.0,,,888.0,,,,2.0,,2.0,7.0,2.0,,2.0,1.0,1.0,9.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,2.0,3.0,2.0,2.0,5.0,2.0,2.0,2.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10.0,1.0,1.0,1.0,2.0,11011.0,37.418416,1.0,37.418416,1.0,9.0,,,,,1.0,0.520383,406.959258,366.743194,1.0,2.0,2.0,1.0,1.0,1.0,9.0,,9.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,8.0,1.0,56.0,5.0,62.0,157.0,6350.0,2561.0,3.0,2.0,1.0,4.0,6.0,1.0,1.0,2.0,3.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,1.0,1.0,,,,,4.0,,2.0,5.397605e-79,1.0,5.397605e-79,1.0,,,2.0
3,1.0,1.0,b'02032022',b'02',b'03',b'2022',1100.0,b'2022000004',2022000000.0,1.0,1.0,,1.0,2.0,1.0,,3.0,,2.0,1.0,2.0,,,,,,,,,,,2.0,1.0,88.0,88.0,,99.0,1.0,2.0,1.0,1.0,7.0,,,2.0,2.0,2.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,1.0,3.0,,1.0,4.0,1.0,2.0,,1.0,2.0,7.0,88.0,77.0,2.0,140.0,505.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,1.0,2.0,1.0,7.0,2.0,1.0,1.0,3.0,,,2.0,,,,,,,,1.0,2.0,3.0,1.0,17.0,999.0,2.0,1.0,2.0,,888.0,,,,1.0,102021.0,1.0,4.0,2.0,,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,3.0,2.0,2.0,5.0,2.0,2.0,2.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10.0,1.0,1.0,1.0,1.0,11011.0,37.418416,3.0,112.255248,1.0,9.0,,,,,1.0,0.520383,1220.877775,1681.791487,1.0,1.0,1.0,9.0,9.0,1.0,9.0,9.0,9.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,14.0,3.0,73.0,6.0,65.0,165.0,6350.0,2330.0,2.0,1.0,1.0,2.0,9.0,,,1.0,,2.0,,,,,,,,2.0,2.0,1.0,56.0,0.1,6.0,,3.0,2.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,9.0,9.0,2.0
4,1.0,1.0,b'02022022',b'02',b'02',b'2022',1100.0,b'2022000005',2022000000.0,1.0,1.0,,1.0,2.0,1.0,,2.0,,1.0,1.0,2.0,,,,,,,,,,,2.0,4.0,2.0,88.0,88.0,7.0,2.0,2.0,1.0,1.0,9.0,,,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,2.0,2.0,3.0,,1.0,5.0,1.0,2.0,,2.0,2.0,5.0,88.0,5.0,2.0,119.0,502.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,3.0,1.0,2.0,1.0,,,,,,,,,,,,,,2.0,,3.0,1.0,,,,1.0,2.0,,203.0,2.0,88.0,2.0,2.0,,1.0,4.0,2.0,,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,5.0,2.0,2.0,5.0,2.0,2.0,2.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10.0,1.0,1.0,1.0,1.0,11011.0,37.418416,2.0,74.836832,1.0,9.0,,,,,1.0,0.520383,813.918517,2111.206286,2.0,2.0,1.0,1.0,1.0,1.0,9.0,,9.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,5.0,1.0,43.0,3.0,62.0,157.0,5398.0,2177.0,2.0,1.0,1.0,3.0,3.0,1.0,,,,,,,,,,,,4.0,1.0,1.0,,,,,4.0,,1.0,10.0,1.0,140.0,1.0,,,2.0


### Select Relevant Subset of Features

The dataset originally has 328 features (columns), but based on Cognitive decline research regarding factors influencing Cognitive decline, only select features are included in this analysis.



#### Important Risk Factors
Research in the field has identified the following as **important risk factors** for Cognitive decline:

*   blood pressure (high)
*   cholesterol (high)
*   smoking
*   diabetes
*   obesity
*   age
*   sex
*   race
*   diet
*   exercise
*   alcohol consumption
*   BMI
*   Household Income
*   Marital Status
*   Sleep
*   Time since last checkup
*   Education
*   Health care coverage
*   Mental Health



#### Selected Subset of Features from BRFSS
Given these risk factors, I tried to select features (columns/questions) in the BRFSS related to these risk factors. To help understand what the columns mean, I consult the BRFSS Codebook to see the questions and information about the questions. I try to match the variable names in the codebook to the variable names in the dataset.

The **selected features** from the BRFSS dataset are:

**Response Variable / Dependent Variable:**
*   Label: Have you experienced confusion or memory loss that is happening more often or is getting worse?
*   SAS Variable Name: CIMEMLOS

**Independent Variables:**


#### Get Subset of Features

In [None]:
# select specific columns
brfss_df_selected = df[['CIMEMLOS','_BMI5','SMOKE100',
                        'CVDSTRK3', '_SEX',
                        '_TOTINDA',
                        'GENHLTH', 'MENTHLTH', 'PHYSHLTH', 'DIFFWALK',
                        '_AGEG5YR', 'EDUCA','ACEDEPRS','ADDEPEV3' ]]


In [None]:
brfss_df_selected.shape

(445132, 14)

In [None]:
brfss_df_selected.head()

Unnamed: 0,CIMEMLOS,_BMI5,SMOKE100,CVDSTRK3,_SEX,_TOTINDA,GENHLTH,MENTHLTH,PHYSHLTH,DIFFWALK,_AGEG5YR,EDUCA,ACEDEPRS,ADDEPEV3
0,,,2.0,2.0,2.0,2.0,2.0,88.0,88.0,2.0,13.0,6.0,,2.0
1,,2657.0,2.0,2.0,2.0,2.0,1.0,88.0,88.0,2.0,13.0,4.0,,2.0
2,,2561.0,2.0,2.0,2.0,1.0,2.0,3.0,2.0,2.0,8.0,6.0,,2.0
3,,2330.0,1.0,2.0,2.0,1.0,1.0,88.0,88.0,2.0,14.0,4.0,,2.0
4,,2177.0,2.0,2.0,2.0,1.0,4.0,88.0,2.0,2.0,5.0,5.0,,2.0


### Cleaning the Data

#### Missing Values

In [None]:
#Drop Missing Values - knocks 100,000 rows out right away
brfss_df_selected = brfss_df_selected.dropna()
brfss_df_selected.shape

(17851, 14)

In [None]:
brfss_df_selected.head()

Unnamed: 0,CIMEMLOS,_BMI5,SMOKE100,CVDSTRK3,_SEX,_TOTINDA,GENHLTH,MENTHLTH,PHYSHLTH,DIFFWALK,_AGEG5YR,EDUCA,ACEDEPRS,ADDEPEV3
63190,1.0,1953.0,2.0,2.0,1.0,1.0,2.0,88.0,88.0,2.0,12.0,6.0,2.0,2.0
63191,2.0,3109.0,1.0,2.0,2.0,1.0,2.0,88.0,88.0,2.0,9.0,6.0,2.0,2.0
63193,2.0,2439.0,1.0,2.0,1.0,1.0,1.0,88.0,88.0,7.0,13.0,6.0,2.0,2.0
63194,2.0,3228.0,2.0,2.0,1.0,1.0,4.0,3.0,1.0,1.0,10.0,6.0,2.0,2.0
63195,2.0,2918.0,1.0,2.0,2.0,1.0,2.0,88.0,88.0,2.0,10.0,5.0,1.0,2.0


#### Modifying Values

In [None]:
# CIMEMLOS
#Change 2 to 0 because this means did not have MI or CHD
brfss_df_selected['CIMEMLOS'] = brfss_df_selected['CIMEMLOS'].replace({2: 0})
brfss_df_selected.CIMEMLOS.unique()

array([1., 0., 7., 9.])

In [None]:
# Remove rows where 'CIMEMLOS' contains 7 or 9
brfss_df_selected = brfss_df_selected[~brfss_df_selected['CIMEMLOS'].isin([7, 9])]
brfss_df_selected.CIMEMLOS.unique()

array([1., 0.])

In [None]:
# ACEDEPRS -  Live With Anyone Depressed, Mentally Ill, Or Suicidal?
#Change 2 to 0 for no
brfss_df_selected['ACEDEPRS'] = brfss_df_selected['ACEDEPRS'].replace({2: 0})
brfss_df_selected['ACEDEPRS'] = brfss_df_selected['ACEDEPRS'].fillna(7)
brfss_df_selected = brfss_df_selected[brfss_df_selected.ACEDEPRS != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.ACEDEPRS != 9]
brfss_df_selected.ACEDEPRS.unique()

array([0., 1.])

In [None]:
# ADDEPEV3 -   you had a depressive disorde
#Change 2 to 0 for no
brfss_df_selected['ADDEPEV3'] = brfss_df_selected['ADDEPEV3'].replace({2: 0})
brfss_df_selected['ADDEPEV3'] = brfss_df_selected['ADDEPEV3'].fillna(7)
brfss_df_selected = brfss_df_selected[brfss_df_selected.ADDEPEV3 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.ADDEPEV3 != 9]
brfss_df_selected.ADDEPEV3.unique()

array([0., 1.])

In [None]:
# _SEX
#Change 2 to 0 for female
brfss_df_selected['_SEX'] = brfss_df_selected['_SEX'].replace({2: 0})
brfss_df_selected._SEX.unique()

array([1., 0.])

In [None]:
#4 _BMI5 (no changes, just note that these are BMI * 100. So for example a BMI of 4018 is really 40.18)
brfss_df_selected['_BMI5'] = brfss_df_selected['_BMI5'].div(100).round(0)
brfss_df_selected._BMI5.unique()

array([20., 31., 24., 32., 29., 52., 23., 36., 21., 22., 46., 35., 19.,
       30., 43., 25., 27., 28., 34., 26., 33., 42., 37., 38., 47., 45.,
       44., 39., 18., 17., 49., 41., 16., 15., 40., 51., 48., 66., 54.,
       53., 14., 96., 61., 56., 13., 57., 55., 50., 80., 63., 65., 62.,
       73., 83., 59., 60., 67., 94., 72., 84., 12., 70., 78.])

In [None]:
#5 SMOKE100
# Change 2 to 0 because it is No
# Remove all 7 (dont knows)
# Remove all 9 (refused)
brfss_df_selected['SMOKE100'] = brfss_df_selected['SMOKE100'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.SMOKE100 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.SMOKE100 != 9]
brfss_df_selected.SMOKE100.unique()

array([0., 1.])

In [None]:
#6 CVDSTRK3
# Change 2 to 0 because it is No
# Remove all 7 (dont knows)
# Remove all 9 (refused)
brfss_df_selected['CVDSTRK3'] = brfss_df_selected['CVDSTRK3'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.CVDSTRK3 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.CVDSTRK3 != 9]
brfss_df_selected.CVDSTRK3.unique()

array([0., 1.])

In [None]:
#8 _TOTINDA
# 1 for physical activity
# change 2 to 0 for no physical activity
# Remove all 9 (don't know/refused)
brfss_df_selected['_TOTINDA'] = brfss_df_selected['_TOTINDA'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected._TOTINDA != 9]
brfss_df_selected._TOTINDA.unique()

array([1., 0.])

In [None]:
#14 GENHLTH
# This is an ordinal variable that I want to keep (1 is Excellent -> 5 is Poor)
# Remove 7 and 9 for don't know and refused
brfss_df_selected = brfss_df_selected[brfss_df_selected.GENHLTH != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.GENHLTH != 9]
brfss_df_selected.GENHLTH.unique()

array([2., 1., 4., 3., 5.])

In [None]:
#15 MENTHLTH
# already in days so keep that, scale will be 0-30
# change 88 to 0 because it means none (no bad mental health days)
# remove 77 and 99 for don't know not sure and refused
brfss_df_selected['MENTHLTH'] = brfss_df_selected['MENTHLTH'].replace({88:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.MENTHLTH != 77]
brfss_df_selected = brfss_df_selected[brfss_df_selected.MENTHLTH != 99]
brfss_df_selected.MENTHLTH.unique()

array([ 0.,  3.,  2.,  1., 30.,  5., 10., 14., 12., 15.,  4.,  8.,  6.,
        7., 20., 29.,  9., 25., 21., 26., 18., 28., 23., 16., 11., 27.,
       13., 17., 22.])

In [None]:
#16 PHYSHLTH
# already in days so keep that, scale will be 0-30
# change 88 to 0 because it means none (no bad mental health days)
# remove 77 and 99 for don't know not sure and refused
brfss_df_selected['PHYSHLTH'] = brfss_df_selected['PHYSHLTH'].replace({88:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.PHYSHLTH != 77]
brfss_df_selected = brfss_df_selected[brfss_df_selected.PHYSHLTH != 99]
brfss_df_selected.PHYSHLTH.unique()

array([ 0.,  1.,  3., 10., 20., 30., 14., 15.,  2.,  5.,  4.,  7., 13.,
       29., 28.,  8., 17., 21., 19.,  6., 25., 26., 18., 12., 16.,  9.,
       11., 22., 23., 24., 27.])

In [None]:
#17 DIFFWALK
# change 2 to 0 for no. 1 is already yes
# remove 7 and 9 for don't know not sure and refused
brfss_df_selected['DIFFWALK'] = brfss_df_selected['DIFFWALK'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.DIFFWALK != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.DIFFWALK != 9]
brfss_df_selected.DIFFWALK.unique()

array([0., 1.])

In [None]:
#19 _AGEG5YR
# already ordinal. 1 is 18-24 all the way up to 13 wis 80 and older. 5 year increments.
# remove 14 because it is don't know or missing
brfss_df_selected = brfss_df_selected[brfss_df_selected._AGEG5YR != 14]
brfss_df_selected._AGEG5YR.unique()

array([12.,  9., 10.,  6., 11.,  7., 13.,  8.])

In [None]:
#20 EDUCA
# This is already an ordinal variable with 1 being never attended school or kindergarten only up to 6 being college 4 years or more
# Scale here is 1-6
# Remove 9 for refused:
brfss_df_selected = brfss_df_selected[brfss_df_selected.EDUCA != 9]
brfss_df_selected.EDUCA.unique()

array([6., 5., 4., 3., 2., 1.])

In [None]:
#Check the shape of the dataset now: We have 253,680 cleaned rows and 22 columns (1 of which is our dependent variable)
brfss_df_selected.shape

(16031, 14)

In [None]:
 #Check Class Sizes
 brfss_df_selected.groupby(['CIMEMLOS']).size()

CIMEMLOS
0.0    14221
1.0     1810
dtype: int64

#### Make Feature Names More Readable

In [None]:
 #Rename the columns to make them more readable
 brfss = brfss_df_selected.rename(columns = {'_BMI5':'BMI',
                                         'SMOKE100':'Smoker',
                                         'CVDSTRK3':'Stroke',
                                         '_TOTINDA':'PhysActivity',
                                         'GENHLTH':'GenHlth',
                                             'MENTHLTH':'MentHlth',
                                             'PHYSHLTH':'PhysHlth',
                                             'DIFFWALK':'DiffWalk',
                                          '_AGEG5YR':'Age',
                                             'EDUCA':'Education',
                                             'CIMEMLOS':'CognitiveDecline',
                                            'ACEDEPRS':'liveWithDepressed',
                                            'ADDEPEV3':'toldDepression',
                                            '_SEX':'Sex'})

In [None]:
#See the cleaned dataset
brfss.head(10)

Unnamed: 0,CognitiveDecline,BMI,Smoker,Stroke,Sex,PhysActivity,GenHlth,MentHlth,PhysHlth,DiffWalk,Age,Education,liveWithDepressed,toldDepression
63190,1.0,20.0,0.0,0.0,1.0,1.0,2.0,0.0,0.0,0.0,12.0,6.0,0.0,0.0
63191,0.0,31.0,1.0,0.0,0.0,1.0,2.0,0.0,0.0,0.0,9.0,6.0,0.0,0.0
63194,0.0,32.0,0.0,0.0,1.0,1.0,4.0,3.0,1.0,1.0,10.0,6.0,0.0,0.0
63195,0.0,29.0,1.0,0.0,0.0,1.0,2.0,0.0,0.0,0.0,10.0,5.0,1.0,0.0
63196,0.0,24.0,0.0,0.0,0.0,1.0,3.0,2.0,3.0,0.0,6.0,4.0,0.0,0.0
63197,0.0,52.0,0.0,0.0,0.0,1.0,2.0,0.0,0.0,1.0,9.0,4.0,0.0,0.0
63198,0.0,23.0,0.0,0.0,0.0,1.0,2.0,0.0,1.0,0.0,10.0,6.0,0.0,0.0
63199,0.0,36.0,0.0,0.0,1.0,1.0,2.0,3.0,0.0,0.0,11.0,5.0,1.0,0.0
63200,0.0,31.0,1.0,0.0,1.0,0.0,4.0,1.0,10.0,0.0,11.0,6.0,0.0,0.0
63201,0.0,21.0,1.0,0.0,0.0,0.0,4.0,30.0,20.0,1.0,10.0,4.0,0.0,1.0


In [None]:

correlation_matrix = brfss.corr()

# Extract the correlation of cognitive decline with other variables
cognitive_decline_corr = correlation_matrix['CognitiveDecline'].drop('CognitiveDecline')
print("Correlation of Cognitive Decline with other variables:")
print(cognitive_decline_corr)


Correlation of Cognitive Decline with other variables:
BMI                  0.040355
Smoker               0.053978
Stroke               0.110135
Sex                 -0.007527
PhysActivity        -0.108755
GenHlth              0.240656
MentHlth             0.234320
PhysHlth             0.227483
DiffWalk             0.223705
Age                  0.041151
Education           -0.071946
liveWithDepressed    0.120254
toldDepression       0.213583
Name: CognitiveDecline, dtype: float64


In [None]:
# Drop the specified columns
brfss.drop(['Sex', 'Education', 'PhysActivity'], axis=1, inplace=True)


In [None]:
correlation_matrix = brfss.corr()

# Extract the correlation of cognitive decline with other variables
cognitive_decline_corr = correlation_matrix['CognitiveDecline'].drop('CognitiveDecline')
print("Correlation of Cognitive Decline with other variables:")
print(cognitive_decline_corr)

Correlation of Cognitive Decline with other variables:
BMI                  0.040355
Smoker               0.053978
Stroke               0.110135
GenHlth              0.240656
MentHlth             0.234320
PhysHlth             0.227483
DiffWalk             0.223705
Age                  0.041151
liveWithDepressed    0.120254
toldDepression       0.213583
Name: CognitiveDecline, dtype: float64


In [None]:
#Double check shape of the dataset (rows and columns)
brfss.shape

(16031, 11)

In [None]:
 #Check how many respondents have had heart disease or a heart attack. Note the class imbalance!
 brfss.groupby(['CognitiveDecline']).size()

CognitiveDecline
0.0    14221
1.0     1810
dtype: int64

#### Save Finalized Dataset to CSV

In [None]:
#************************************************************************************************
brfss.to_csv('brfss_cleaned.csv', sep=",", index=False)
#************************************************************************************************

#### Get a BALANCED 50-50 Dataset Randomly Selected
*  The brfss dataset is clearly imbalanced. When training my models, I get about 90% accuracy on many models with AUC between 70 and 80. This may be caused by the models are learning the distribution in the data.
*  To check these concerns, I will create a second dataset with a 50-50 balance for the HeartDiseaseorAttack response variable - just to compare performance.
*  To do this, I will take a random sample of 49786 instances of the 0 (or No Cognitive Decline) and all of the 6003 instances of the 1 (or Yes).
* The if the new dataset performs comparably, then I can rest assured that it
* With roughly 48,000 datapoints, I hope that this is sufficient to train the model and that the random selection will not greatly change the results. I have the random seed set to 1.

In [None]:
#Separate the 0 and 1
#Get the 1s
is1 = brfss['CognitiveDecline'] == 1
brfss_5050_1 = brfss[is1]
# Check the type of brfss_5050
print(type(brfss_5050_1))

#Get the 0s
is0 = brfss['CognitiveDecline'] == 0
brfss_5050_0 = brfss[is0]
print(type(brfss_5050_0))

#Select the 6003 random cases for 0
brfss_5050_0_rand1 = brfss_5050_0.take(np.random.permutation(len(brfss_5050_0))[:6003])
print(type(brfss_5050_0_rand1))

brfss_5050 = pd.concat([brfss_5050_0_rand1, brfss_5050_1], ignore_index=True)

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


In [None]:
#Check that it worked. Now we have a dataset of 47,786 rows that is equally balanced with 50% 1 and 50% 0 for the target variable HeartDiseaseorAttack
brfss_5050

Unnamed: 0,CognitiveDecline,BMI,Smoker,Stroke,GenHlth,MentHlth,PhysHlth,DiffWalk,Age,liveWithDepressed,toldDepression
0,0.0,27.0,1.0,1.0,4.0,0.0,30.0,1.0,9.0,0.0,0.0
1,0.0,37.0,0.0,0.0,4.0,0.0,14.0,0.0,9.0,0.0,0.0
2,0.0,31.0,1.0,0.0,3.0,0.0,5.0,0.0,12.0,0.0,0.0
3,0.0,24.0,1.0,0.0,2.0,0.0,0.0,0.0,13.0,0.0,0.0
4,0.0,20.0,0.0,0.0,2.0,0.0,0.0,0.0,12.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
7808,1.0,28.0,1.0,0.0,4.0,20.0,30.0,0.0,10.0,0.0,1.0
7809,1.0,27.0,1.0,0.0,3.0,15.0,0.0,1.0,9.0,0.0,0.0
7810,1.0,27.0,0.0,0.0,5.0,7.0,7.0,1.0,8.0,0.0,1.0
7811,1.0,26.0,0.0,1.0,5.0,15.0,30.0,1.0,6.0,0.0,1.0


In [None]:
#See the classes are perfectly balanced now
brfss_5050.groupby(['CognitiveDecline']).size()

CognitiveDecline
0.0    6003
1.0    1810
dtype: int64

In [None]:
#Save the 50-50 balanced dataset to csv

#************************************************************************************************
brfss_5050.to_csv('brfss_5050_cleaned.csv', sep=",", index=False)
#************************************************************************************************

#### Also Get a 60-40 Dataset Randomly Selected

In [None]:
#Also make a 60-40 dataset
brfss_6040_0_rand1 = brfss_5050_0.take(np.random.permutation(len(brfss_5050_0))[:12006])


brfss_6040 =  pd.concat([brfss_6040_0_rand1,brfss_5050_1], ignore_index = True)
#Save the 6040 balanced dataset to csv
#************************************************************************************************
brfss_6040.to_csv('brfss_6040_cleaned.csv', sep=",", index=False)
#************************************************************************************************
brfss_6040

Unnamed: 0,CognitiveDecline,BMI,Smoker,Stroke,GenHlth,MentHlth,PhysHlth,DiffWalk,Age,liveWithDepressed,toldDepression
0,0.0,25.0,1.0,0.0,4.0,15.0,30.0,1.0,9.0,0.0,0.0
1,0.0,33.0,0.0,0.0,1.0,0.0,0.0,0.0,8.0,0.0,0.0
2,0.0,33.0,1.0,0.0,4.0,0.0,4.0,0.0,11.0,0.0,0.0
3,0.0,28.0,1.0,0.0,4.0,10.0,4.0,1.0,9.0,0.0,0.0
4,0.0,28.0,1.0,0.0,3.0,0.0,0.0,0.0,7.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
13811,1.0,28.0,1.0,0.0,4.0,20.0,30.0,0.0,10.0,0.0,1.0
13812,1.0,27.0,1.0,0.0,3.0,15.0,0.0,1.0,9.0,0.0,0.0
13813,1.0,27.0,0.0,0.0,5.0,7.0,7.0,1.0,8.0,0.0,1.0
13814,1.0,26.0,0.0,1.0,5.0,15.0,30.0,1.0,6.0,0.0,1.0


# Part 2: Model Building

## Random Forests

In [None]:
df_full = pd.read_csv('brfss_cleaned.csv')
df_full.head()

Unnamed: 0,CognitiveDecline,BMI,Smoker,Stroke,GenHlth,MentHlth,PhysHlth,DiffWalk,Age,liveWithDepressed,toldDepression
0,1.0,20.0,0.0,0.0,2.0,0.0,0.0,0.0,12.0,0.0,0.0
1,0.0,31.0,1.0,0.0,2.0,0.0,0.0,0.0,9.0,0.0,0.0
2,0.0,32.0,0.0,0.0,4.0,3.0,1.0,1.0,10.0,0.0,0.0
3,0.0,29.0,1.0,0.0,2.0,0.0,0.0,0.0,10.0,1.0,0.0
4,0.0,24.0,0.0,0.0,3.0,2.0,3.0,0.0,6.0,0.0,0.0


In [None]:
df_5050 = pd.read_csv('brfss_5050_cleaned.csv')
df_5050.head()

Unnamed: 0,CognitiveDecline,BMI,Smoker,Stroke,GenHlth,MentHlth,PhysHlth,DiffWalk,Age,liveWithDepressed,toldDepression
0,0.0,27.0,1.0,1.0,4.0,0.0,30.0,1.0,9.0,0.0,0.0
1,0.0,37.0,0.0,0.0,4.0,0.0,14.0,0.0,9.0,0.0,0.0
2,0.0,31.0,1.0,0.0,3.0,0.0,5.0,0.0,12.0,0.0,0.0
3,0.0,24.0,1.0,0.0,2.0,0.0,0.0,0.0,13.0,0.0,0.0
4,0.0,20.0,0.0,0.0,2.0,0.0,0.0,0.0,12.0,0.0,0.0


In [None]:
df_6040 = pd.read_csv('brfss_6040_cleaned.csv')
df_6040.head()

Unnamed: 0,CognitiveDecline,BMI,Smoker,Stroke,GenHlth,MentHlth,PhysHlth,DiffWalk,Age,liveWithDepressed,toldDepression
0,0.0,25.0,1.0,0.0,4.0,15.0,30.0,1.0,9.0,0.0,0.0
1,0.0,33.0,0.0,0.0,1.0,0.0,0.0,0.0,8.0,0.0,0.0
2,0.0,33.0,1.0,0.0,4.0,0.0,4.0,0.0,11.0,0.0,0.0
3,0.0,28.0,1.0,0.0,4.0,10.0,4.0,1.0,9.0,0.0,0.0
4,0.0,28.0,1.0,0.0,3.0,0.0,0.0,0.0,7.0,0.0,0.0


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix, precision_score

# Load dataframes
dfs = {
    'df_full': pd.read_csv('brfss_cleaned.csv'),
    'df_6040': pd.read_csv('brfss_6040_cleaned.csv'),
    'df_5050': pd.read_csv('brfss_5050_cleaned.csv')
}

# Loop over each dataframe
for df_name, df in dfs.items():
    print(f"\nRunning RandomForestClassifier on {df_name}...")

    # Separate features (X) and target variable (y)
    X = df.drop(columns=["CognitiveDecline"])
    y = df["CognitiveDecline"]

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Initialize the random forest classifier
    rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

    # Train the classifier on the training data
    rf_classifier.fit(X_train, y_train)

    # Make predictions on the testing data
    y_pred = rf_classifier.predict(X_test)

    # Calculate the accuracy of the model
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy of {df_name}: {accuracy:.4f}")

    # Calculate the AUC score
    auc_score = roc_auc_score(y_test, y_pred)
    print(f"AUC of {df_name}: {auc_score:.4f}")

    # Calculate the confusion matrix
    conf_matrix = confusion_matrix(y_test, y_pred)
    print(f"Confusion Matrix of {df_name}:\n{conf_matrix}")

    # Calculate the precision score
    precision = precision_score(y_test, y_pred)
    print(f"Precision of {df_name}: {precision:.4f}")



Running RandomForestClassifier on df_full...
Accuracy of df_full: 0.8803
AUC of df_full: 0.5504
Confusion Matrix of df_full:
[[2778   56]
 [ 328   45]]
Precision of df_full: 0.4455

Running RandomForestClassifier on df_6040...
Accuracy of df_6040: 0.8614
AUC of df_6040: 0.5669
Confusion Matrix of df_6040:
[[2320   69]
 [ 314   61]]
Precision of df_6040: 0.4692

Running RandomForestClassifier on df_5050...
Accuracy of df_5050: 0.7646
AUC of df_5050: 0.6080
Confusion Matrix of df_5050:
[[1080  113]
 [ 255  115]]
Precision of df_5050: 0.5044


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Load dataframes
dfs = {
    'df_full': pd.read_csv('brfss_cleaned.csv'),
    'df_6040': pd.read_csv('brfss_6040_cleaned.csv'),
    'df_5050': pd.read_csv('brfss_5050_cleaned.csv')
}

# Loop over each dataframe
for df_name, df in dfs.items():
    print(f"Running GradientBoostingClassifier on {df_name}...")

    # Separate features (X) and target variable (y)
    X = df.drop(columns=["CognitiveDecline"])
    y = df["CognitiveDecline"]

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Initialize the gradient boosting classifier
    gb_classifier = GradientBoostingClassifier(n_estimators=100, random_state=42)

    # Train the classifier on the training data
    gb_classifier.fit(X_train, y_train)

    # Make predictions on the testing data
    y_pred = gb_classifier.predict(X_test)

    # Calculate the accuracy of the model
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy of {df_name}: {accuracy}\n")


Running GradientBoostingClassifier on df_full...
Accuracy of df_full: 0.888992828188338

Running GradientBoostingClassifier on df_6040...
Accuracy of df_6040: 0.8726483357452967

Running GradientBoostingClassifier on df_5050...
Accuracy of df_5050: 0.7779910428662828



In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
from sklearn.metrics import accuracy_score

# Load dataframes
dfs = {
    'df_full': pd.read_csv('brfss_cleaned.csv'),
    'df_6040': pd.read_csv('brfss_6040_cleaned.csv'),
    'df_5050': pd.read_csv('brfss_5050_cleaned.csv')
}

# Define a simple neural network model
def create_model(input_shape):
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(64, activation='relu', input_shape=input_shape),
        tf.keras.layers.Dense(32, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

# Loop over each dataframe
for df_name, df in dfs.items():
    print(f"Running Neural Network on {df_name}...")

    # Separate features (X) and target variable (y)
    X = df.drop(columns=["CognitiveDecline"])
    y = df["CognitiveDecline"]

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Create and compile the neural network model
    model = create_model(input_shape=(X_train.shape[1],))

    # Train the model on the training data
    model.fit(X_train, y_train, epochs=10, batch_size=32, verbose=0)

    # Evaluate the model on the testing data
    _, accuracy = model.evaluate(X_test, y_test, verbose=0)
    print(f"Accuracy of {df_name}: {accuracy}\n")


Running Neural Network on df_full...
Accuracy of df_full: 0.8861864805221558

Running Neural Network on df_6040...
Accuracy of df_6040: 0.8737336993217468

Running Neural Network on df_5050...
Accuracy of df_5050: 0.782469630241394

