# EDA of Mental Health Dataset
This notebook presents the exploratory data analysis process and the plots for the Mental Health Dataset. \
The code is imported from *.py* files stored in the same directory.

In [1]:
from CL_MHDS import *
from City_Boundary import *
from MH_choropleth import *
from MH_altair_plots import *

remove_warnings()

In [2]:
mh_file_path = 'MHDS/Original/500_Cities__City-level_Data__GIS_Friendly_Format___2017_release_20240514.csv'
city_path = 'MHDS/Original/500Cities_City_11082016/CityBoundaries.shp'
key_lst = ['StateAbbr','PlaceName','PlaceFIPS','Population2010','Geolocation']


# Load and cleaning the mental health data
mh_df = load_cleaning(mh_file_path, key_lst)
mh_df.head()


Unnamed: 0,StateAbbr,PlaceName,PlaceFIPS,Population2010,Geolocation,MHLTH_CrudePrev,MHLTH_AdjPrev,Crude95CI_Low,Crude95CI_High,Adjusted95CI_Low,Adjusted95CI_High
0,AL,Birmingham,107000,212237,"[33.5275663773, -86.7988174678]",15.6,15.6,15.4,15.8,15.4,15.8
1,AL,Hoover,135896,81619,"[33.3767602729, -86.8051937568]",10.4,10.4,10.1,10.7,10.1,10.7
2,AL,Huntsville,137000,180105,"[34.6989692671, -86.6387042882]",13.3,13.4,13.1,13.6,13.2,13.7
3,AL,Mobile,150000,195111,"[30.6776248648, -88.1184482714]",14.9,15.0,14.7,15.1,14.9,15.2
4,AL,Montgomery,151000,205764,"[32.3472645333, -86.2677059552]",14.9,14.8,14.7,15.2,14.6,15.1


## General Exploration
The following plots were created using the Altair package. \
Altair produces elegantly styled plots that are easy to use and customize. \
It features detailed documentation and allows for direct manipulation of data within the charts through data transformation. \
More information about Altair, click the [link](https://altair-viz.github.io/getting_started/overview.html).

In [3]:
# correlations between population and mental health

mh_df['log_Population2010'] = np.log(mh_df['Population2010'])

"""
def scatter_plot(df, x_col, y_col, title, x_title, y_title, size = 60, opacity = 0.7):"""

pop_mhp = scatter_plot(mh_df, 'MHLTH_AdjPrev','Population2010','Population vs Mental Health Prevalence(%)', 'Mental Health Prevalence(%)', 'Population',opacity=0.5)

# transform the population data to log scale to create a better visualization

log_pop_mhdf = feature_trans(mh_df,'Population2010','log_Population2010', lambda x: np.log(x))

log_pop_mhp = scatter_plot(log_pop_mhdf, 'MHLTH_AdjPrev','log_Population2010','Log Population vs Mental Health Prevalence(%)', 'Mental Health Prevalence(%)', 'Log Population', opacity=0.5)

pop_mhp | log_pop_mhp


It seems there is no correlation between population and mental health prevalence.

In [4]:
# present mental health prevalence status by state
box_plot(mh_df, 'StateAbbr', 'MHLTH_AdjPrev', 'Mental Health Prevalence(%) by State', 'State', 'Mental Health Prevalence(%)')

Using a box plot allows us to assess the overall mental health status of each state by focusing on the maximum, minimum, and median values. \
Notably, some states show significant variation, while others remain more consistent. This variation could be influenced by the number of cities sampled within each state. \
To mitigate this effect and ensure comparability, we should standardize the number of cities considered in each state (if we aim to pursue a state-level analysis further). 

In [5]:
# present avg mental health prevalence by state
state_avgmhp = bar_plot(mh_df, 'StateAbbr', 'mean(MHLTH_AdjPrev)', 'State', 'Average Mental Health Prevalence(%)', 'Average Mental Health Prevalence(%) by State', width = 800, height = 200)
state_avgmhp

The three states with the most severe average mental health issues are Ohio (OH), Mississippi (MS), and Tennessee (TN).

In [10]:
# present median mental health prevalence by state
state_medmhp = bar_plot(mh_df, 'StateAbbr', 'median(MHLTH_AdjPrev)', 'State', 'Median Mental Health Prevalence(%)', 'Median Mental Health Prevalence(%) by State', width = 800, height = 200, color='Navy')
state_medmhp

The three states with the most severe median mental health issues are Ohio (OH), New Jersey (NJ), and Mississippi (MS).

## Choropleth Maps
Choropleth maps are created using GeoPandas and Folium. \
GeoPandas handles dataframes that contain geographic information and can output GeoJSON, which is essential for Folium to delineate regional boundaries accurately. \
Folium is an excellent tool for map creation. Although it is not inherently interactive, it allows users to zoom in and out to examine details thoroughly.
More information about GeoPandas and Folium, click [link to GeoPandas](https://geopandas.org/en/stable/index.html), [link to Folium](https://python-visualization.github.io/folium/latest/user_guide/geojson/choropleth.html).

In [7]:
# convert mh_df to geodf and get centroid of a chosen state
gdf = convert_geodf(mh_df)

# create choropleth map at state level
m_state = choropleth_map(gdf, 37.0902, -95.7129, city_level = False, start = 4)
display(m_state)


In [8]:
# compute the centroid of a chosen state
lat,lon = geo_centroid(gdf, ['CA'])

# load city boundary data and output geojson file for map
city_bound_df = load_shp_convert_geo(city_path)
export_geojson(city_bound_df)

# create choropleth map of chosen state
m = choropleth_map(gdf, lat, lon)
display(m)

Output geojson file only contains data of the following states: ['CA']
