# UCF Data Analytics Bootcamp
## Final Project: Mental Health Prediction

Our team selected the below linked data as the starting point for our final project:

https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-County-Data-20/swc5-untb/explore/query/SELECT%0A%20%20%60year%60%2C%0A%20%20%60stateabbr%60%2C%0A%20%20%60statedesc%60%2C%0A%20%20%60locationname%60%2C%0A%20%20%60datasource%60%2C%0A%20%20%60category%60%2C%0A%20%20%60measure%60%2C%0A%20%20%60data_value_unit%60%2C%0A%20%20%60data_value_type%60%2C%0A%20%20%60data_value%60%2C%0A%20%20%60data_value_footnote_symbol%60%2C%0A%20%20%60data_value_footnote%60%2C%0A%20%20%60low_confidence_limit%60%2C%0A%20%20%60high_confidence_limit%60%2C%0A%20%20%60totalpopulation%60%2C%0A%20%20%60locationid%60%2C%0A%20%20%60categoryid%60%2C%0A%20%20%60measureid%60%2C%0A%20%20%60datavaluetypeid%60%2C%0A%20%20%60short_question_text%60%2C%0A%20%20%60geolocation%60%2C%0A%20%20%60%3A%40computed_region_bxsw_vy29%60%2C%0A%20%20%60%3A%40computed_region_he4y_prf8%60%0AORDER%20BY%20%60statedesc%60%20DESC%20NULL%20LAST/page/filter

The data describes various health indicators at a FIPS code (county) level for the years 2019 and 2020.  However, there are a few issues with the current file that don't allow it to be readily used for machine learning:
1. There are multiple years in the file, and our goal is to select only a single year.
2. There is repetitive data.  For example, MeasureID is a unique key for each indicator and Measure is the description of that MeasureID.  That description field could be pulled out and made into a separate table to reduce the overall size of the tables.

This Jupyter notebook will detail the pre-processing work for this file.  The goal is to separate it into several CSV files that we can then load into our AWS database to more closely fit the standards of a structured database.

In [79]:
import pandas as pd

In [80]:
base_df = pd.read_csv('PLACES__Local_Data_for_Better_Health__County_Data_2022_release.csv')
base_df.head()

Unnamed: 0,Year,StateAbbr,StateDesc,LocationName,DataSource,Category,Measure,Data_Value_Unit,Data_Value_Type,Data_Value,...,High_Confidence_Limit,TotalPopulation,LocationID,CategoryID,MeasureId,DataValueTypeID,Short_Question_Text,Geolocation,States,Counties
0,2020,WY,Wyoming,Teton,BRFSS,Health Status,Physical health not good for >=14 days among a...,%,Crude prevalence,7.3,...,8.2,23497,56039,HLTHSTAT,PHLTH,CrdPrv,Physical Health,POINT (-110.426087 44.048662),14.0,3126.0
1,2020,WY,Wyoming,Goshen,BRFSS,Health Status,Fair or poor self-rated health status among ad...,%,Crude prevalence,13.8,...,15.8,13235,56015,HLTHSTAT,GHLTH,CrdPrv,General Health,POINT (-104.3535403 42.0894553),14.0,890.0
2,2020,WY,Wyoming,Laramie,BRFSS,Prevention,"Fecal occult blood test, sigmoidoscopy, or col...",%,Age-adjusted prevalence,61.6,...,64.6,100595,56021,PREVENT,COLON_SCREEN,AgeAdjPrv,Colorectal Cancer Screening,POINT (-104.660395 41.2928302),14.0,3119.0
3,2020,WY,Wyoming,Park,BRFSS,Prevention,Visits to doctor for routine checkup within th...,%,Crude prevalence,71.0,...,72.0,29331,56029,PREVENT,CHECKUP,CrdPrv,Annual Checkup,POINT (-109.5935975 44.4923865),14.0,3122.0
4,2020,WY,Wyoming,Lincoln,BRFSS,Health Outcomes,Chronic obstructive pulmonary disease among ad...,%,Age-adjusted prevalence,5.9,...,6.8,20253,56023,HLTHOUT,COPD,AgeAdjPrv,COPD,POINT (-110.6829614 42.2299932),14.0,3120.0


### Preprocessing Plan

Looking at the data, the plan for preprocessing is:
1. Create a data frame that just contains the location details with FIPs ID.
2. Create a data frame that contains the details about the health measure alongside the Measure ID.
3. Split the remaining details into two separate data frames, one for each year.
4. Output the final data frames as individual CSV files to load into AWS.

In [81]:
# Look at the unique values in specific columns
base_df.Year.unique()

array([2020, 2019], dtype=int64)

In [82]:
base_df.Category.unique()

array(['Health Status', 'Prevention', 'Health Outcomes',
       'Health Risk Behaviors'], dtype=object)

In [83]:
base_df.MeasureId.unique()

array(['PHLTH', 'GHLTH', 'COLON_SCREEN', 'CHECKUP', 'COPD', 'CASTHMA',
       'TEETHLOST', 'CANCER', 'DENTAL', 'STROKE', 'CHOLSCREEN',
       'CSMOKING', 'MHLTH', 'OBESITY', 'DIABETES', 'ARTHRITIS', 'ACCESS2',
       'BINGE', 'LPA', 'CHD', 'KIDNEY', 'BPHIGH', 'HIGHCHOL', 'SLEEP',
       'COREW', 'CERVICAL', 'MAMMOUSE', 'BPMED', 'DEPRESSION', 'COREM'],
      dtype=object)

In [84]:
base_df.States.nunique()

51

Based on the above, it appears there are only two years of data in the data set so we can create a 2019 and 2020 version of these data sets.  I was also checking to see what kind of values appeared in a couple confusing columns.

In [85]:
base_df.LocationID.value_counts().sort_values()

34037    52
34041    52
34017    52
34023    52
34029    52
         ..
1103     60
1119     60
1005     60
1081     60
29161    60
Name: LocationID, Length: 3144, dtype: int64

In [86]:
base_df[base_df.Year == 2020].LocationID.value_counts().sort_values()

56039    52
55135    52
56007    52
56021    52
56029    52
         ..
1077     52
1005     52
1047     52
1081     52
1059     52
Name: LocationID, Length: 3144, dtype: int64

One of the concerns I had was that the measure counts or FIPS locations changed between 2019 and 2020.  When I look at value counts of the Location ID for both years, I see some instances with 8 less measure counts.  I don't have that issue when just considering 2020.  Looking on the internet, there were changes in the FIPS boundaries between the two years.

For our purposes, I will be dropping 2019 data and just using the 2020 data to ensure the entire data set is consistent both with the other informational tables and with any other 2020 data that we incorporate into our analysis.

In [87]:
# Drop all rows with 2019 data
base_df = base_df[base_df.Year == 2020]

In [88]:
base_df.head()

Unnamed: 0,Year,StateAbbr,StateDesc,LocationName,DataSource,Category,Measure,Data_Value_Unit,Data_Value_Type,Data_Value,...,High_Confidence_Limit,TotalPopulation,LocationID,CategoryID,MeasureId,DataValueTypeID,Short_Question_Text,Geolocation,States,Counties
0,2020,WY,Wyoming,Teton,BRFSS,Health Status,Physical health not good for >=14 days among a...,%,Crude prevalence,7.3,...,8.2,23497,56039,HLTHSTAT,PHLTH,CrdPrv,Physical Health,POINT (-110.426087 44.048662),14.0,3126.0
1,2020,WY,Wyoming,Goshen,BRFSS,Health Status,Fair or poor self-rated health status among ad...,%,Crude prevalence,13.8,...,15.8,13235,56015,HLTHSTAT,GHLTH,CrdPrv,General Health,POINT (-104.3535403 42.0894553),14.0,890.0
2,2020,WY,Wyoming,Laramie,BRFSS,Prevention,"Fecal occult blood test, sigmoidoscopy, or col...",%,Age-adjusted prevalence,61.6,...,64.6,100595,56021,PREVENT,COLON_SCREEN,AgeAdjPrv,Colorectal Cancer Screening,POINT (-104.660395 41.2928302),14.0,3119.0
3,2020,WY,Wyoming,Park,BRFSS,Prevention,Visits to doctor for routine checkup within th...,%,Crude prevalence,71.0,...,72.0,29331,56029,PREVENT,CHECKUP,CrdPrv,Annual Checkup,POINT (-109.5935975 44.4923865),14.0,3122.0
4,2020,WY,Wyoming,Lincoln,BRFSS,Health Outcomes,Chronic obstructive pulmonary disease among ad...,%,Age-adjusted prevalence,5.9,...,6.8,20253,56023,HLTHOUT,COPD,AgeAdjPrv,COPD,POINT (-110.6829614 42.2299932),14.0,3120.0


In [89]:
base_df.Year.value_counts()

2020    163488
Name: Year, dtype: int64

In [90]:
base_df.drop(['Year'], axis = 1)
base_df.head()

Unnamed: 0,Year,StateAbbr,StateDesc,LocationName,DataSource,Category,Measure,Data_Value_Unit,Data_Value_Type,Data_Value,...,High_Confidence_Limit,TotalPopulation,LocationID,CategoryID,MeasureId,DataValueTypeID,Short_Question_Text,Geolocation,States,Counties
0,2020,WY,Wyoming,Teton,BRFSS,Health Status,Physical health not good for >=14 days among a...,%,Crude prevalence,7.3,...,8.2,23497,56039,HLTHSTAT,PHLTH,CrdPrv,Physical Health,POINT (-110.426087 44.048662),14.0,3126.0
1,2020,WY,Wyoming,Goshen,BRFSS,Health Status,Fair or poor self-rated health status among ad...,%,Crude prevalence,13.8,...,15.8,13235,56015,HLTHSTAT,GHLTH,CrdPrv,General Health,POINT (-104.3535403 42.0894553),14.0,890.0
2,2020,WY,Wyoming,Laramie,BRFSS,Prevention,"Fecal occult blood test, sigmoidoscopy, or col...",%,Age-adjusted prevalence,61.6,...,64.6,100595,56021,PREVENT,COLON_SCREEN,AgeAdjPrv,Colorectal Cancer Screening,POINT (-104.660395 41.2928302),14.0,3119.0
3,2020,WY,Wyoming,Park,BRFSS,Prevention,Visits to doctor for routine checkup within th...,%,Crude prevalence,71.0,...,72.0,29331,56029,PREVENT,CHECKUP,CrdPrv,Annual Checkup,POINT (-109.5935975 44.4923865),14.0,3122.0
4,2020,WY,Wyoming,Lincoln,BRFSS,Health Outcomes,Chronic obstructive pulmonary disease among ad...,%,Age-adjusted prevalence,5.9,...,6.8,20253,56023,HLTHOUT,COPD,AgeAdjPrv,COPD,POINT (-110.6829614 42.2299932),14.0,3120.0


The other issue is that there is two lines of data for each measure and location: the crude and age-adjusted measures.  Since we want to pull other data into the analysis that is age-based, such as average age of the residents, we will drop anything age adjusted and look at crude data only.


In [91]:
# Drop all rows with 2019 data
base_df = base_df[base_df.Data_Value_Type == 'Crude prevalence']

### Creating Sub-Tables

In [92]:
base_df.columns

Index(['Year', 'StateAbbr', 'StateDesc', 'LocationName', 'DataSource',
       'Category', 'Measure', 'Data_Value_Unit', 'Data_Value_Type',
       'Data_Value', 'Data_Value_Footnote_Symbol', 'Data_Value_Footnote',
       'Low_Confidence_Limit', 'High_Confidence_Limit', 'TotalPopulation',
       'LocationID', 'CategoryID', 'MeasureId', 'DataValueTypeID',
       'Short_Question_Text', 'Geolocation', 'States', 'Counties'],
      dtype='object')

For the next steps, we will be making copies of the data frame and only including subsets of related, duplicative data, then removing the duplicate entries.  For example, we expect the same LocationID/State/State Abbreviation/Location Name to appear several times, once for each health measure.  Instead of that duplicate information, we can simply leave Location ID and move a single instance of that other detail in another table.

In [93]:
location_df = base_df[['LocationID', 'StateAbbr', 'StateDesc', 'LocationName', 'TotalPopulation',
                      'Geolocation', 'States', 'Counties']].copy()
location_df.drop_duplicates()
location_df.head()

Unnamed: 0,LocationID,StateAbbr,StateDesc,LocationName,TotalPopulation,Geolocation,States,Counties
0,56039,WY,Wyoming,Teton,23497,POINT (-110.426087 44.048662),14.0,3126.0
1,56015,WY,Wyoming,Goshen,13235,POINT (-104.3535403 42.0894553),14.0,890.0
3,56029,WY,Wyoming,Park,29331,POINT (-109.5935975 44.4923865),14.0,3122.0
5,56001,WY,Wyoming,Albany,38950,POINT (-105.7218826 41.6655141),14.0,3079.0
6,56005,WY,Wyoming,Campbell,46676,POINT (-105.5170141 44.1919991),14.0,889.0


In [94]:
location_df = base_df[['LocationID', 'StateAbbr', 'StateDesc', 'LocationName', 'TotalPopulation',
                      'Geolocation', 'States', 'Counties']].copy()
location_df.drop_duplicates(inplace = True)
location_df.head()

Unnamed: 0,LocationID,StateAbbr,StateDesc,LocationName,TotalPopulation,Geolocation,States,Counties
0,56039,WY,Wyoming,Teton,23497,POINT (-110.426087 44.048662),14.0,3126.0
1,56015,WY,Wyoming,Goshen,13235,POINT (-104.3535403 42.0894553),14.0,890.0
3,56029,WY,Wyoming,Park,29331,POINT (-109.5935975 44.4923865),14.0,3122.0
5,56001,WY,Wyoming,Albany,38950,POINT (-105.7218826 41.6655141),14.0,3079.0
6,56005,WY,Wyoming,Campbell,46676,POINT (-105.5170141 44.1919991),14.0,889.0


In [106]:
location_df.describe

<bound method NDFrame.describe of         LocationID StateAbbr StateDesc LocationName  TotalPopulation  \
0            56039        WY   Wyoming        Teton            23497   
1            56015        WY   Wyoming       Goshen            13235   
3            56029        WY   Wyoming         Park            29331   
5            56001        WY   Wyoming       Albany            38950   
6            56005        WY   Wyoming     Campbell            46676   
...            ...       ...       ...          ...              ...   
184868        1085        AL   Alabama      Lowndes             9641   
184942        1097        AL   Alabama       Mobile           412716   
184996        1037        AL   Alabama        Coosa            10650   
185002        1059        AL   Alabama     Franklin            31507   
185104        1079        AL   Alabama     Lawrence            32857   

                            Geolocation  States  Counties  
0         POINT (-110.426087 44.048662)  

In [95]:
# Look at what values are in DataSource column
base_df.DataSource.value_counts()

BRFSS    81744
Name: DataSource, dtype: int64

In [96]:
base_df.drop(['DataSource'], axis = 1)
base_df.head()

Unnamed: 0,Year,StateAbbr,StateDesc,LocationName,DataSource,Category,Measure,Data_Value_Unit,Data_Value_Type,Data_Value,...,High_Confidence_Limit,TotalPopulation,LocationID,CategoryID,MeasureId,DataValueTypeID,Short_Question_Text,Geolocation,States,Counties
0,2020,WY,Wyoming,Teton,BRFSS,Health Status,Physical health not good for >=14 days among a...,%,Crude prevalence,7.3,...,8.2,23497,56039,HLTHSTAT,PHLTH,CrdPrv,Physical Health,POINT (-110.426087 44.048662),14.0,3126.0
1,2020,WY,Wyoming,Goshen,BRFSS,Health Status,Fair or poor self-rated health status among ad...,%,Crude prevalence,13.8,...,15.8,13235,56015,HLTHSTAT,GHLTH,CrdPrv,General Health,POINT (-104.3535403 42.0894553),14.0,890.0
3,2020,WY,Wyoming,Park,BRFSS,Prevention,Visits to doctor for routine checkup within th...,%,Crude prevalence,71.0,...,72.0,29331,56029,PREVENT,CHECKUP,CrdPrv,Annual Checkup,POINT (-109.5935975 44.4923865),14.0,3122.0
5,2020,WY,Wyoming,Albany,BRFSS,Health Outcomes,Current asthma among adults aged >=18 years,%,Crude prevalence,9.8,...,10.5,38950,56001,HLTHOUT,CASTHMA,CrdPrv,Current Asthma,POINT (-105.7218826 41.6655141),14.0,3079.0
6,2020,WY,Wyoming,Campbell,BRFSS,Health Outcomes,All teeth lost among adults aged >=65 years,%,Crude prevalence,13.1,...,18.2,46676,56005,HLTHOUT,TEETHLOST,CrdPrv,All Teeth Lost,POINT (-105.5170141 44.1919991),14.0,889.0


### Create a Health Measures table

In [97]:
base_df.columns

Index(['Year', 'StateAbbr', 'StateDesc', 'LocationName', 'DataSource',
       'Category', 'Measure', 'Data_Value_Unit', 'Data_Value_Type',
       'Data_Value', 'Data_Value_Footnote_Symbol', 'Data_Value_Footnote',
       'Low_Confidence_Limit', 'High_Confidence_Limit', 'TotalPopulation',
       'LocationID', 'CategoryID', 'MeasureId', 'DataValueTypeID',
       'Short_Question_Text', 'Geolocation', 'States', 'Counties'],
      dtype='object')

In [103]:
measures_df = base_df[['MeasureId', 'CategoryID', 'Measure', 'Category', 'Data_Value_Unit', 
                       'Data_Value_Type']].copy()
measures_df.drop_duplicates(inplace = True)
measures_df.head(30)

Unnamed: 0,MeasureId,CategoryID,Measure,Category,Data_Value_Unit,Data_Value_Type
0,PHLTH,HLTHSTAT,Physical health not good for >=14 days among a...,Health Status,%,Crude prevalence
1,GHLTH,HLTHSTAT,Fair or poor self-rated health status among ad...,Health Status,%,Crude prevalence
3,CHECKUP,PREVENT,Visits to doctor for routine checkup within th...,Prevention,%,Crude prevalence
5,CASTHMA,HLTHOUT,Current asthma among adults aged >=18 years,Health Outcomes,%,Crude prevalence
6,TEETHLOST,HLTHOUT,All teeth lost among adults aged >=65 years,Health Outcomes,%,Crude prevalence
12,MHLTH,HLTHSTAT,Mental health not good for >=14 days among adu...,Health Status,%,Crude prevalence
17,DIABETES,HLTHOUT,Diagnosed diabetes among adults aged >=18 years,Health Outcomes,%,Crude prevalence
19,CSMOKING,RISKBEH,Current smoking among adults aged >=18 years,Health Risk Behaviors,%,Crude prevalence
24,ACCESS2,PREVENT,Current lack of health insurance among adults ...,Prevention,%,Crude prevalence
26,BINGE,RISKBEH,Binge drinking among adults aged >=18 years,Health Risk Behaviors,%,Crude prevalence


In [99]:
len(measures_df.index)

26

### Create a Flat Table with Numeric Health Measurements
The remaining data not moved to one of the separate tables is the actual health measurement data for 2020.  However, each FIPS location is listed multiple times, one for each different health measurement.  In order to use this data for machine learning we will need to create a separate column for each measure.  The end result will have a single row for each FIPS and columns for each of the measures.

In [100]:
# Pull specific columns to process
working_df = base_df[['MeasureId', 'LocationID', 'Data_Value']].copy()
working_df.head()

Unnamed: 0,MeasureId,LocationID,Data_Value
0,PHLTH,56039,7.3
1,GHLTH,56015,13.8
3,CHECKUP,56029,71.0
5,CASTHMA,56001,9.8
6,TEETHLOST,56005,13.1


In [101]:
#Use Pivot Table to create the desired data structure
data_df = working_df.pivot_table('Data_Value', index='LocationID', columns = 'MeasureId')

In [102]:
data_df.head()

MeasureId,ACCESS2,ARTHRITIS,BINGE,CANCER,CASTHMA,CERVICAL,CHD,CHECKUP,COLON_SCREEN,COPD,...,GHLTH,KIDNEY,LPA,MAMMOUSE,MHLTH,OBESITY,PHLTH,SLEEP,STROKE,TEETHLOST
LocationID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
59,15.3,24.2,15.5,6.5,9.2,82.8,6.4,74.7,72.4,6.4,...,14.5,3.0,23.5,78.2,13.5,31.9,10.0,32.7,3.2,13.4
1001,15.3,30.8,14.6,6.8,9.8,84.1,7.3,75.7,72.9,8.3,...,18.0,3.2,27.2,74.7,15.6,37.6,11.4,37.0,3.5,12.1
1003,14.1,33.8,14.7,8.2,9.1,85.1,8.0,76.3,74.2,8.4,...,16.8,3.3,24.7,75.2,14.7,32.9,11.2,33.9,3.5,9.5
1005,23.4,36.9,12.0,7.0,11.0,81.2,10.1,77.8,71.7,12.1,...,29.4,4.6,37.0,71.8,17.0,46.1,16.1,42.7,5.8,21.3
1007,19.2,32.7,14.4,6.8,9.9,81.8,8.3,75.1,71.9,10.2,...,22.9,3.5,32.5,71.0,16.8,38.6,13.5,39.5,4.2,16.1


In [105]:
nan_count = data_df.isna().sum()
nan_count

MeasureId
ACCESS2         0
ARTHRITIS       0
BINGE           0
CANCER          0
CASTHMA         0
CERVICAL        0
CHD             0
CHECKUP         0
COLON_SCREEN    0
COPD            0
COREM           0
COREW           0
CSMOKING        0
DENTAL          0
DEPRESSION      0
DIABETES        0
GHLTH           0
KIDNEY          0
LPA             0
MAMMOUSE        0
MHLTH           0
OBESITY         0
PHLTH           0
SLEEP           0
STROKE          0
TEETHLOST       0
dtype: int64

### Exporting the Tables as CSV
Now we can export the data to be used later for machine learning.

In [110]:
location_df.to_csv('location_lookup.csv', index=False)
measures_df.to_csv('measures_lookup.csv', index=False)
data_df.to_csv('2020_crude_health_measures.csv')