# Data Cleaning and Merge

The notebook details the process of cleaning the mental health and greenspace datasets, merging them, and analyzing key findings. Ultimately, it outputs a regionally and divisionally labeled merged dataframe as `merged_df.csv`(temp), located at `data/cleaned/` directory.

## Content
- Cleaning Mental Health Dataset (@ Shuting, remove assignment before submission)
- Cleaning Greenspace Dataset
- Merging Mental Health and Greenspace Dataset
- Adding Regions and Divisions Labels (@ Shuting)
- Output to CSV

In [9]:
import file_path as fp 


# import pandas as pd
import plotly.express as px
import altair as alt
import os

In [2]:
fp.mh_file

'../data/raw_data/500_Cities__City-level_Data__GIS_Friendly_Format___2017_release_20240514.csv'

## Cleaning Mental Health and Key Findings

We will cleaning the raw mental health dataset, explain the meaning of important columns and present our key findings about this dataset.


In [3]:
def load_file_df(file_path):
    """
    Load the file from the file path.
    Return the dataframe.
    """
    df = pd.read_csv(file_path)
    return df

mh_raw = load_file_df(fp.mh_file)
mh_raw.head(2)

Unnamed: 0,StateAbbr,PlaceName,PlaceFIPS,Population2010,ACCESS2_CrudePrev,ACCESS2_Crude95CI,ACCESS2_AdjPrev,ACCESS2_Adj95CI,ARTHRITIS_CrudePrev,ARTHRITIS_Crude95CI,...,SLEEP_Adj95CI,STROKE_CrudePrev,STROKE_Crude95CI,STROKE_AdjPrev,STROKE_Adj95CI,TEETHLOST_CrudePrev,TEETHLOST_Crude95CI,TEETHLOST_AdjPrev,TEETHLOST_Adj95CI,Geolocation
0,AL,Birmingham,107000,212237,19.6,"(19.2, 20.0)",19.8,"(19.5, 20.2)",30.9,"(30.8, 31.1)",...,"(46.6, 47.0)",5.2,"( 5.1, 5.3)",5.2,"( 5.1, 5.2)",26.1,"(25.1, 27.2)",25.9,"(25.0, 26.9)","(33.52756637730, -86.7988174678)"
1,AL,Hoover,135896,81619,9.7,"( 9.3, 10.1)",9.9,"( 9.5, 10.4)",25.3,"(25.0, 25.7)",...,"(34.2, 35.0)",2.2,"( 2.1, 2.3)",2.2,"( 2.1, 2.2)",9.6,"( 8.6, 10.8)",9.5,"( 8.5, 10.9)","(33.37676027290, -86.8051937568)"


In [4]:
def mh_remove_chronics(df, remove_key_words=["Crude", "Adj"], mh_key_words="MH"):
    """
    Remove columns with key words in remove_key_words and keep columns with key words in mh_key_words
    """
    indf = df.copy()
    col_lst = indf.columns
    remove_lst = [
        x
        for x in col_lst
        if any(word in x for word in remove_key_words) and mh_key_words not in x
    ]
    indf.drop(columns=remove_lst, inplace=True)
    return indf

mh_data = mh_remove_chronics(mh_raw)
mh_data.head(2)


Unnamed: 0,StateAbbr,PlaceName,PlaceFIPS,Population2010,MHLTH_CrudePrev,MHLTH_Crude95CI,MHLTH_AdjPrev,MHLTH_Adj95CI,Geolocation
0,AL,Birmingham,107000,212237,15.6,"(15.4, 15.8)",15.6,"(15.4, 15.8)","(33.52756637730, -86.7988174678)"
1,AL,Hoover,135896,81619,10.4,"(10.1, 10.7)",10.4,"(10.1, 10.7)","(33.37676027290, -86.8051937568)"


The dataset has been refined by excluding other chronic diseases, resulting in a dataframe focused on mental health. The remained features are explained in the table below:

|Features|Type|Meaning|
|--|--|--|
|StateAbbr|Plain Text|State abbreviation|
|PlaceName|Plain Text|City name|
|PlaceFIPS|Number|City FIPS Code|
|Population2010|Number|2010 Census population count|
|MHLTH_CrudePrev|Number|Crude prevalence of poor mental health for 14 days or more among adults aged 18 years and older, 2015. <br> Crude prevalence represents the ratio of the total number of responses of 'not good' to the total number of valid responses (excluding those who refused to answer, provided no response, or indicated 'don’t know/not sure').|
|MHLTH_Crude95CI|Plain Text|Estimated 95% confidence interval for crude prevalence|
|MHLTH_AdjPrev|Number|Age-adjusted prevalence, standardized by the direct method to the year 2000 standard U.S. population, distribution 9. `[1]` |
|MHLTH_Adj95CI|Plain Text|Estimated 95% Confidence interval for age-adjusted prevalence|
|Geolocation|Plain Text|Latitude, longitude of city centroid|

Further cleaning and manipulation will be necessary as some features are less useful or stored in an incorrect format:

Removing Features:

- PlaceFIPS: We will use `PlaceName` (city name) as primary key, hence this is less important.
- MHLTH_CrudePrev, MHLTH_Crude95CI: We will use age-adjusted prevalence because it represents standardized prevalence.

Transforming Format:

- Geolocation: Geolocation needs to be converted into a list of two floats representing latitude and longitude.

`[1]` The direct method, aligned with the year 2000 standard U.S. population distribution 9, is a statistical technique used to adjust for age differences by assigning different weights to various age groups. This method is a policy mandated by the Department of Health and Human Services (DHHS) across all its agencies, aiming to enhance the comparability of age-adjusted rates among data systems.[(reference)](https://www.cdc.gov/places/measure-definitions/health-status/index.html#mental-health) Distribution 9 indicates that this age-adjusted prevalence uses the weighting factors provided by Distribution 9. For more information about the weight, check [page 3](https://www.cdc.gov/nchs/data/statnt/statnt20.pdf).

In [6]:
def mh_clean_transfrom(
    df, col_lst=["MHLTH_CrudePrev", "MHLTH_Crude95CI"], trans_col="Geolocation"
):
    """
    Return a new dataframe with columns in col_lst removed and Geolocation transformed to a list of float
    """
    new_df = df.drop(columns=col_lst).copy()
    new_df[trans_col] = new_df[trans_col].apply(
        lambda x: x.replace("(", "").replace(")", "")
    )
    new_df[trans_col] = new_df[trans_col].apply(lambda x: x.split(","))
    new_df[trans_col] = new_df[trans_col].apply(lambda x: [float(x[0]), float(x[1])])
    return new_df

mh_cleaned = mh_clean_transfrom(mh_data)
mh_cleaned.head(2)

Unnamed: 0,StateAbbr,PlaceName,PlaceFIPS,Population2010,MHLTH_AdjPrev,MHLTH_Adj95CI,Geolocation
0,AL,Birmingham,107000,212237,15.6,"(15.4, 15.8)","[33.5275663773, -86.7988174678]"
1,AL,Hoover,135896,81619,10.4,"(10.1, 10.7)","[33.3767602729, -86.8051937568]"


In [11]:
# presenting top 5 states with highest MHLTH_AdjPrev
# the higher the MHLTH_AdjPrev, the worse the mental health condition
mh_cleaned.sort_values(by="MHLTH_AdjPrev", ascending=False).head(5)

Unnamed: 0,StateAbbr,PlaceName,PlaceFIPS,Population2010,MHLTH_AdjPrev,MHLTH_Adj95CI,Geolocation
275,MA,New Bedford,2545000,95072,18.3,"(18.0, 18.6)","[41.6712667258, -70.9441204537]"
271,MA,Fall River,2523000,88857,18.2,"(17.8, 18.5)","[41.7139907598, -71.0996396919]"
279,MA,Springfield,2567000,153060,17.5,"(17.3, 17.7)","[42.1154977999, -72.5395254143]"
390,PA,Reading,4263624,88082,17.4,"(17.1, 17.6)","[40.3399678686, -75.9266128837]"
285,MI,Flint,2629000,102434,17.4,"(17.2, 17.6)","[43.0236339386, -83.6920640313]"


In [7]:
def mh_plotly_treemap(
    df,
    path_lst=[px.Constant("US"), "StateAbbr", "PlaceName"],
    color="MHLTH_AdjPrev",
    values="MHLTH_AdjPrev",
    style="Blues",
    title="Mental Health Prevalence by State and City",
    width=1000,
    height=600,
):
    """
    Return a plotly treemap figure

    Parameters:
        df: dataframe
        path_lst: list, the path of the treemap to have a constant parent node, use px.Constant()
        color: str, the column name for color
        values(str), the column name for values
        style: str, the color style
        title: str, the title of the treemap
        width: int, the width of the treemap
        height: int, the height of the treemap
    """
    fig = px.treemap(
        df,
        path=path_lst,
        values=values,
        title=title,
        color=color,
        color_continuous_scale=style,
        width=width,
        height=height,
    )
    fig.update_layout(margin=dict(t=50, l=25, r=25, b=25))
    return fig

# treemap to present an overview of the mental health prevalence by state and city
mh_treemap = mh_plotly_treemap(mh_cleaned)
mh_treemap.show()

### Key Findings

- **Top 5 Cities with Severe Mental Health Issues**: New Bedford, MA (18.3%), Fall River, MA (18.2%), Springfield, MA (17.5%), Reading, PA (17.4%), and Flint, MI (17.4%).

- **State-Level Analysis**: While Massachusetts (MA) might intuitively seem the most affected state, it is actually the second, with an average mental health issue prevalence of 15.06%. Ohio (OH) ranks highest with an average prevalence of 15.37%.

- **Distribution of Severe Cases**: Massachusetts has a higher concentration of cities with severe mental health challenges; three out of the top 13 cities have prevalences over 17%. In contrast, Ohio has only one city above this threshold.

- **Impact of Sample Variation**: The inclusion of cities like Newton, MA (9.2%), which has a lower prevalence, impacts the average for Massachusetts. This demonstrates how city selection can significantly affect state-level averages and potentially introduce biases if not considered carefully.

- **Potential Bias**: Analyzing data at a larger geographic scale than the city level might introduce selection bias, which is a significant limitation of this study.