# EDA of Mental Health Dataset
This notebook presents the exploratory data analysis process of the Mental Health Dataset through visualization.

The notebook utilizes code from MH_EDA.py and GeoBound_ChoroplethMap.py. These files can be found in the same folder as the notebook.

In [1]:
import MH_EDA as mh
import Merged_Map as mm

## Content
*Note: Internal links only work in local environment*

- [Cleaning Mental Health Dataset](#cleaning-mental-health-dataset)
    - Result: Removed chronic diseases unrelated to mental illness and less relevant columns.

- [Explore MH Overall Status](#explore-mh-overall-status)
    - Top 5 cities with the most severe mental illness: New Bedford, MA; Fall River, MA; Springfield, MA; Flint, MI; Reading, PA.
    - Top 2 states with the most severe mental illness: Ohio (OH), Massachusetts (MA).
    - Possible Bias: The number of cities surveyed varies across different states.

- [Regional and Divisional MH Prevalence](#regional-and-divisional-mh-prevalence)
    - Regional: The Northeast region appears to have more severe mental health issues, averaging 14.2%.
    - Divisional: Three divisions show more severe mental health issues than others: East South Central (14.42%), Middle Atlantic (14.36%), and New England (13.94%).

- [Population Influence](#population-influence)
    - Population shows no correlation with mental health prevalence.

## Cleaning Mental Health Dataset
First of all, we need to load the mental health dataset, clean it and remove less relevant features.

In [2]:
mh_file_path = 'MHDS/Original/500_Cities__City-level_Data__GIS_Friendly_Format___2017_release_20240514.csv'

raw_df= mh.mh_load_file(mh_file_path)
raw_df.head()

Unnamed: 0,StateAbbr,PlaceName,PlaceFIPS,Population2010,ACCESS2_CrudePrev,ACCESS2_Crude95CI,ACCESS2_AdjPrev,ACCESS2_Adj95CI,ARTHRITIS_CrudePrev,ARTHRITIS_Crude95CI,...,SLEEP_Adj95CI,STROKE_CrudePrev,STROKE_Crude95CI,STROKE_AdjPrev,STROKE_Adj95CI,TEETHLOST_CrudePrev,TEETHLOST_Crude95CI,TEETHLOST_AdjPrev,TEETHLOST_Adj95CI,Geolocation
0,AL,Birmingham,107000,212237,19.6,"(19.2, 20.0)",19.8,"(19.5, 20.2)",30.9,"(30.8, 31.1)",...,"(46.6, 47.0)",5.2,"( 5.1, 5.3)",5.2,"( 5.1, 5.2)",26.1,"(25.1, 27.2)",25.9,"(25.0, 26.9)","(33.52756637730, -86.7988174678)"
1,AL,Hoover,135896,81619,9.7,"( 9.3, 10.1)",9.9,"( 9.5, 10.4)",25.3,"(25.0, 25.7)",...,"(34.2, 35.0)",2.2,"( 2.1, 2.3)",2.2,"( 2.1, 2.2)",9.6,"( 8.6, 10.8)",9.5,"( 8.5, 10.9)","(33.37676027290, -86.8051937568)"
2,AL,Huntsville,137000,180105,15.1,"(14.7, 15.4)",15.1,"(14.8, 15.5)",27.5,"(27.3, 27.7)",...,"(39.4, 40.0)",3.4,"( 3.3, 3.4)",3.3,"( 3.2, 3.3)",14.9,"(14.1, 15.7)",14.7,"(13.8, 15.5)","(34.69896926710, -86.6387042882)"
3,AL,Mobile,150000,195111,16.9,"(16.6, 17.2)",17.2,"(16.9, 17.5)",30.5,"(30.3, 30.6)",...,"(42.0, 42.4)",4.4,"( 4.3, 4.5)",4.1,"( 4.1, 4.2)",24.3,"(23.4, 25.3)",24.1,"(23.1, 25.0)","(30.67762486480, -88.1184482714)"
4,AL,Montgomery,151000,205764,17.4,"(17.0, 17.9)",17.5,"(17.1, 17.9)",29.8,"(29.7, 30.0)",...,"(41.0, 41.5)",4.1,"( 4.1, 4.2)",4.2,"( 4.1, 4.3)",21.2,"(20.3, 22.2)",21.2,"(20.1, 22.2)","(32.34726453330, -86.2677059552)"


We are not interested in other chronic diseases, hence I will remove irrelevant chronic diseases and retain features related to mental health, along with other essential features.

In [3]:
mh_df = mh.mh_remove_chronics(raw_df)
mh_df.head()

Unnamed: 0,StateAbbr,PlaceName,PlaceFIPS,Population2010,MHLTH_CrudePrev,MHLTH_Crude95CI,MHLTH_AdjPrev,MHLTH_Adj95CI,Geolocation
0,AL,Birmingham,107000,212237,15.6,"(15.4, 15.8)",15.6,"(15.4, 15.8)","(33.52756637730, -86.7988174678)"
1,AL,Hoover,135896,81619,10.4,"(10.1, 10.7)",10.4,"(10.1, 10.7)","(33.37676027290, -86.8051937568)"
2,AL,Huntsville,137000,180105,13.3,"(13.1, 13.6)",13.4,"(13.2, 13.7)","(34.69896926710, -86.6387042882)"
3,AL,Mobile,150000,195111,14.9,"(14.7, 15.1)",15.0,"(14.9, 15.2)","(30.67762486480, -88.1184482714)"
4,AL,Montgomery,151000,205764,14.9,"(14.7, 15.2)",14.8,"(14.6, 15.1)","(32.34726453330, -86.2677059552)"


The dataset has been refined by excluding other chronic diseases, resulting in a dataframe focused on mental health. The remained features are explained in the table below:

|Features|Type|Meaning|
|--|--|--|
|StateAbbr|Plain Text|State abbreviation|
|PlaceName|Plain Text|City name|
|PlaceFIPS|Number|City FIPS Code|
|Population2010|Number|2010 Census population count|
|MHLTH_CrudePrev|Number|Crude prevalence of poor mental health for 14 days or more among adults aged 18 years and older, 2015. <br> Crude prevalence represents the ratio of the total number of responses of 'not good' to the total number of valid responses (excluding those who refused to answer, provided no response, or indicated 'don’t know/not sure').|
|MHLTH_Crude95CI|Plain Text|Estimated 95% confidence interval for crude prevalence|
|MHLTH_AdjPrev|Number|Age-adjusted prevalence, standardized by the direct method to the year 2000 standard U.S. population, distribution 9. `[1]` |
|MHLTH_Adj95CI|Plain Text|Estimated 95% Confidence interval for age-adjusted prevalence|
|Geolocation|Plain Text|Latitude, longitude of city centroid|

Further cleaning and manipulation will be necessary as some features are less useful or stored in an incorrect format:

Removing Features:

- MHLTH_CrudePrev, MHLTH_Crude95CI: We will use age-adjusted prevalence because it represents standardized prevalence.

Transforming Format:

- Geolocation: Geolocation needs to be converted into a list of two floats representing latitude and longitude.

`[1]` The direct method, aligned with the year 2000 standard U.S. population distribution 9, is a statistical technique used to adjust for age differences by assigning different weights to various age groups. This method is a policy mandated by the Department of Health and Human Services (DHHS) across all its agencies, aiming to enhance the comparability of age-adjusted rates among data systems.[(reference)](https://www.cdc.gov/places/measure-definitions/health-status/index.html#mental-health) Distribution 9 indicates that this age-adjusted prevalence uses the weighting factors provided by Distribution 9. For more information about the weight, check [page 3](https://www.cdc.gov/nchs/data/statnt/statnt20.pdf).

In [4]:
mhdf = mh.mh_secondary_remove_and_transform(mh_df)
# output the cleaned data
mh.mh_to_csv(mhdf)

mhdf.head()

MH_cleaned.csv already exists.


Unnamed: 0,StateAbbr,PlaceName,PlaceFIPS,Population2010,MHLTH_AdjPrev,MHLTH_Adj95CI,Geolocation
0,AL,Birmingham,107000,212237,15.6,"(15.4, 15.8)","[33.5275663773, -86.7988174678]"
1,AL,Hoover,135896,81619,10.4,"(10.1, 10.7)","[33.3767602729, -86.8051937568]"
2,AL,Huntsville,137000,180105,13.4,"(13.2, 13.7)","[34.6989692671, -86.6387042882]"
3,AL,Mobile,150000,195111,15.0,"(14.9, 15.2)","[30.6776248648, -88.1184482714]"
4,AL,Montgomery,151000,205764,14.8,"(14.6, 15.1)","[32.3472645333, -86.2677059552]"


[Return to Content](#content)

## Explore MH Overall Status
I chose a treemap to present the overall status of all 500 cities instead of a bar chart because it effectively utilizes size and color to clearly depict the mental health prevalence in each city. Imagine trying to read data from a bar chart with 500 bars!

Plotly is an interactive visualization tool that allows us to extract more information by hovering over or clicking on a box to obtain detailed information.

Unfortunately, the interactive function is not available in the GitHub environment. The following pictures are treated as static visualizations to provide an overview. To explore the interactive functions, download the visualizations using the following links and open them with any web browser:

- [fig_city](MHDS/Visuals/fig_city.html)
- [fig_statecity](MHDS/Visuals/fig_statecity.html)

In [5]:
fig_city = mh.mh_city_level_treemap(mhdf, title='Mental Health Prevalence by City')

# output png and html files
mh.output_visuals(fig_city, 'MHDS/Visuals/fig_city.png')
mh.output_visuals(fig_city, 'MHDS/Visuals/fig_city.html', tohtml = True)

fig_city.show()

# Below static image aims to be displayed in Github,
# you can disable it when running in local environment:
# Image(filename='MHDS/Visuals/fig_city.png') 


MHDS/Visuals/fig_city.png already exists.
MHDS/Visuals/fig_city.html already exists.


According to the treemap above, the top 5 cities with most severe mental issues are:
- New Bedford, MA
- Fall River, MA
- Springfield, MA
- Flint, MI
- Reading, PA

Among these five cities, three are located in Massachusetts (MA). This observation raises the question of whether Massachusetts has the most severe mental health issues among all U.S. states.

To explore this, I will use a treemap to present state-level data. Treemaps are effective for displaying hierarchical data and illustrating the status of states along with the relationship between cities and their parent state. By clicking on a state box in the treemap, users can view the average prevalence of mental health issues. This interactive approach can help identify states facing severe mental health challenges.

In [6]:
fig_statecity = mh.mh_plotly_treemap(mhdf)

# output png and html files
mh.output_visuals(fig_statecity, 'MHDS/Visuals/fig_statecity.png')
mh.output_visuals(fig_statecity, 'MHDS/Visuals/fig_statecity.html', tohtml = True)

fig_statecity.show()

# disable following code when running in local environment:
# Image(filename='MHDS/Visuals/fig_statecity.png') 

MHDS/Visuals/fig_statecity.png already exists.
MHDS/Visuals/fig_statecity.html already exists.


Upon examining the treemap, it is clear that Massachusetts (MA), with an average prevalence of 15.06, is the second most affected state by severe mental health issues. Ohio (OH) has the highest severity, with an average prevalence of 15.37.

Despite this, Massachusetts has a higher number of cities with significant mental health challenges. Three out of 13 cities have a prevalence over 17, whereas Ohio has only one such city. However, the inclusion of more cities in Massachusetts's sample, some with lower prevalences like Newton (9.2), reduces the state's average. This variation in city selection can introduce significant bias if we analyze at a geographic level larger than the city.

Considering these variations, it raises the question: could larger geographic regions, such as regions and divisions, influence mental health?

[Return to Content](#content) 

## Regional and Divisional MH Prevalence

How would geographic locations (regions and divisions) affect mental health prevalence?

Inference: Environmental and socioeconomic statuses vary significantly among regions and divisions. Intuitively, locations with less green space and lower socioeconomic status might exhibit higher mental health prevalence. Thus, I infer that central areas of the US could have more severe mental health issues. However, the results might also be influenced by the data collection method, such as fewer data points from central areas.

Regardless, let's dive into the data. I will explore how mental health prevalence varies among regions and divisions using the choropleth map provided by the Folium package, which offers interactive functionalities, making the map more informative.

we will use the [Census Regions and Divisions of the United States](https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf) to categorize states into different regional and divisional groups.

Both regional and divisional boundary data can be found and downloaded [here](https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html).

Same display issue as Plotly Treemaps, the github environment doesn't support the intereactive choropleth map created by Folium. To explore the interactive functions, download the visualizations using the following links and open them with any web browser:

- [m_regions](MHDS/Visuals/m_regions.html)
- [m_divisions](MHDS/Visuals/m_divisions.html)

In [7]:
# regions and divisions data built based on census regions of the US: https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf
# download related boundary files from: https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html
# unzip the downloaded file and set the path to the .shp file
# NOTE: Do not delete any files in the unzipped folder. Although only the .shp file is used, the other files are necessary for the .shp file to work.
# regional_path = 'MHDS/Original/cb_2018_us_region_500k/cb_2018_us_region_500k.shp' 
# division_path = 'MHDS/Original/cb_2018_us_division_500k/cb_2018_us_division_500k.shp'
# city_path = 'MHDS/Original/500Cities_City_11082016/CityBoundaries.shp'


In [8]:
# prepare the regional average df 
# call function to return the us_region dictionary
us_region = mm.us_region()

# apply labels from dictionary to mhdf and output a grouped df by regions
mh_region = mm.apply_geo_labels(mhdf, 'Region', us_region, 'StateAbbr')

agg_dict = {'Population2010': 'sum', 'MHLTH_AdjPrev': 'mean'}

mh_region_avg = mh.mh_aggregation(mh_region, 'Region', agg_dict)
mh_region_avg

Unnamed: 0,Region,Population2010,MHLTH_AdjPrev
0,Midwest,17271027,12.143011
1,Northeast,16906935,14.125
2,South,31712597,12.705128
3,West,37130249,11.875897


In [9]:
# output geojson file of regions boundary if not exsited
regions_path = 'MHDS/Original/cb_2018_us_region_500k/cb_2018_us_region_500k.shp'
# regional_bound = gb.bound_load_file_output_geojson(regions_path, df = _, file = True,  full_state = True, output = True, output_folder = 'MHDS/', output_filename ='region_gdf.geojson')

regional_bound = mm.file_to_geojson(regions_path, 'GEOJSON/region_gdf.geojson')

File already exists.


In [10]:
# create regional choropleth map
boundary_file_path = 'GEOJSON/region_gdf.geojson'
df = mh_region_avg
col_list = ['Region','Population2010', 'MHLTH_AdjPrev']

m_regions = mm.merged_choropleth_map(boundary_file_path,df,col_list,
    lat=39.5,
    lon=-98.35,
    geo_col=["Region", "MHLTH_AdjPrev"],
    key="NAME",
    color="YlGnBu",
    opacity=0.7,
    weight=1,
    zoom_start=5,
    legend="Average Mental Health Prevalence (%)",)

# output the map to html file
# m_regions.save('MHDS/Visuals/m_regions.html')

# display(m_regions)

Although the map shows that the Northeast region appears to have more severe mental health issues (14.2% on average), the regional map seems less informative. Therefore, we will explore further based on the nine divisions of the U.S.

In [11]:
# same as above, create divisional average df and boundary file

# prepare the regional average df 
# call function to return the us_region dictionary
us_division = mm.us_division()

# apply labels from dictionary to mhdf and output a grouped df by regions
mh_division = mm.apply_geo_labels(mhdf, 'Division', us_division, 'StateAbbr')

agg_dict = {'Population2010': 'sum', 'MHLTH_AdjPrev': 'mean'}

mh_division_avg = mh.mh_aggregation(mh_division, 'Division', agg_dict)
mh_division_avg

Unnamed: 0,Division,Population2010,MHLTH_AdjPrev
0,East North Central,12084075,12.727869
1,East South Central,3936094,14.425
2,Middle Atlantic,12765714,14.356
3,Mountain,10102952,11.46
4,New England,3449409,13.941379
5,Pacific,27027297,12.01931
6,South Atlantic,13260549,13.007692
7,West North Central,5186952,11.028125
8,West South Central,15207766,11.94375


In [13]:
# create the divisional choropleth map
boundary_file_path = 'GEOJSON/division_gdf.geojson'
df = mh_division_avg
col_list = ['Division','Population2010', 'MHLTH_AdjPrev']

m_divisions = mm.merged_choropleth_map(boundary_file_path,df,col_list,
    lat=39.5,
    lon=-98.35,
    geo_col=["Division", "MHLTH_AdjPrev"],
    key="NAME",
    color="YlGnBu",
    opacity=0.7,
    weight=1,
    zoom_start=5,
    legend="Average Mental Health Prevalence (%)",)

# output the map as html file
m_divisions.save('MHDS/Visuals/m_divisions.html')

# display(m_divisions)

From the map above, we can see that three divisions appear to have more severe mental health issues than other divisions: East South Central (14.42%), Middle Atlantic (14.36%), and New England (13.94%).

In fact, the entire Eastern region seems to experience more severe mental health issues compared to the Central and Western regions. Given that the Eastern area has a distinct environment and socioeconomic status compared to the Central and Western regions, this distinction provides a valuable starting point to further explore how environmental and socioeconomic factors correlate with mental health prevalence.

[Return to Content](#content) 

## Population Influence

Does the size of a population affect mental health (MH) prevalence?

Inference: The size of the population, often a reference to the size of a city, can influence mental health through two aspects:
- Accessibility to Green Spaces: Larger cities, having denser populations, typically offer less access to green spaces, which may exacerbate MH prevalence.
- Socioeconomic Status: Conversely, large cities often have higher socioeconomic statuses, which can mitigate MH prevalence.

The relationship is complex, so let’s explore it using the dataset `mhdf`.

First, we need to sort cities into different size groups based on their population. Following the [OECD Classification](https://data.oecd.org/popregion/urban-population-by-city-size.htm), we can categorize cities into four groups under a new column `CitySize`:

- Small Urban Areas: 50,000 to 200,000 people.
- Medium-Size Urban Areas: 200,000 to 500,000 people.
- Metropolitan Areas: 500,000 to 1.5 million people.
- Large Metropolitan Areas: 1.5 million or more people.

We will then apply this classification to `mhdf` and create a new DataFrame, `df_CitySize`, that includes the number of cities and the average MH prevalence for each city-size group (using groupby on `CitySize`). We may also create a squared MH prevalence column (`square_MHLTH_AdjPrev`) for better visualization.

Finally, we will use [Altair](https://altair-viz.github.io/index.html) (a visualization package that allows for flexible customizations) to create a visualization of `Population vs. MH Prev` combining a bar chart (presenting the number of cities for each group) and a scatter plot (presenting the average MH prevalence) to analyze the influence of population size on MH prevalence.

In [14]:
city_size_dict = {
    'Small Urban Areas': [50000,200000],
    'Medium-Size Urban Areas': [200000,500000],
    'Metropolitan Areas': [500000,1500000],
    'Large Metropolitan Areas': [1500000, 100**100]
}

# call function to add CitySize col to mhdf and output a grouped df

df_CitySzie = mh.mh_apply_CitySize(mhdf, city_size_dict)
df_CitySzie

Unnamed: 0,CitySize,Population2010,MHLTH_AdjPrev,square_MHLTH_AdjPrev
0,Large Metropolitan Areas,5,12.88,165.8944
1,Medium-Size Urban Areas,73,12.549315,157.485309
2,Metropolitan Areas,29,12.458621,155.217229
3,Small Urban Areas,392,12.410204,154.013165


In [15]:
# call function to present the Population vs. MH Prev
mh.mh_pop_vs_mh(df_CitySzie)

Unfortunately, according to the chart above, the size of the population seems uncorrelated with mental health prevalence.

I think it would be better to delve deeper into accessibility to green spaces and socioeconomic status instead of focusing on population size.

[Return to Content](#content) 