# Goal
## Original Questions
- Can we use Chronic Health Conditions to accurately predict Health Care Access?
- Are there Demographic clusters that are disproportionately affected by Chronic Health Conditions?
- Can unsupervised learning methods reveal distinct clusters that account for the bulk of Chronic Health Conditions?

### Questions:
- I have gotten a bit hung up bc as worded 2 and 3 seem to be asking same thing?
- were there established chronic health conditions?
- established demographic features?
- what are the features used in RQ1, and RQ3?

#### Rough Plan:
Question: Refinement:
Are there diagnostically useful demographic clusters that indicate chronic health conditions?
- Are there demographic clusters that strongly indicate certain chronic health conditions?
- Can we predict chronic health conditions from demographics, and how does a ML model compare with simpler cluster membership?

##### Part 1: Clustering
- Cluster the demographic features of the BRFSS data
- Visualize clusters and prevalence of chronic health conditions within each cluster
    - what chronic health conditions to use?
    - VISUAL: Clustering results
    - VISUAL:  Heatmaps of cluster membership vs chronic health conditions
- Run statistical tests to determine if certain clusters are significantly more affected by chronic health conditions
    - Translation - test the strength of correlation between cluster membership and chronic health conditions
    - Does being a member of a cluster correlate with having a chronic health condition?
#### Part 2: Prediction of Chronic Health Conditions From Demographics via DL Model. Inversion of Question 3
- use cluster labels as features?
- compare performance of a deep learning model vs simpler clustering membership



### Misc
- potential 'linchpin' variables given we are clustering on demographics (and ran random forest on demographics)


In [None]:
import os
import sys
from IPython import get_ipython
import logging

import subprocess
import sys

try:
    import kmodes
    print("kmodes already installed")
except ImportError:
    print("Installing kmodes...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "kmodes"])
    import kmodes
    print("kmodes installed successfully")

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def is_colab():
    return 'google.colab' in str(get_ipython())

# Set up environment and paths
if is_colab():
    print("Running in Google Colab")

    # Clone the repository if not already cloned
    if not os.path.exists('dat490'):
        import subprocess
        print("Cloning repository...")
        subprocess.run(['git', 'clone', 'https://github.com/sksizer/dat490.git'], check=True)
        print("Repository cloned successfully")

    # Add the repository to Python path for imports
    sys.path.insert(0, '/content/dat490')

    # Set paths to use data from the cloned repository
    BFRSS_DATA_PATH = 'dat490/data/LLCP2023.parquet'
    BFRSS_CODEBOOK_PATH = 'dat490/data/codebook_USCODE23_LLCP_021924.HTML'
    BFRSS_DESC_PATH = 'dat490/data/LLCP2023_desc.parquet'  # Additional metadata file if needed
else:
    print("Running in local environment")

    # Add parent directory to path for dat490 module imports
    sys.path.insert(0, os.path.abspath('..'))

    # Use local data paths
    BFRSS_DATA_PATH = '../data/LLCP2023.parquet'
    BFRSS_CODEBOOK_PATH = '../data/codebook_USCODE23_LLCP_021924.HTML'
    BFRSS_DESC_PATH = '../data/LLCP2023_desc.parquet'  # Additional metadata file if needed

# Verify files exist
print(f"\\nData path: {BFRSS_DATA_PATH}")
print(f"Codebook path: {BFRSS_CODEBOOK_PATH}")

if not os.path.exists(BFRSS_DATA_PATH):
    raise FileNotFoundError(f"Data file not found at {BFRSS_DATA_PATH}")

if not os.path.exists(BFRSS_CODEBOOK_PATH):
    raise FileNotFoundError(f"Codebook file not found at {BFRSS_CODEBOOK_PATH}")

print("\\nAll required files found!")
logger.info('Environment setup complete')

##################################
# Load BFRSS data and metadata using the new wrapper
from dat490 import load_bfrss

# Single function call to load everything
# exclude_desc_columns=True will exclude _DESC columns from metadata generation
bfrss = load_bfrss(exclude_desc_columns=True)

# Get a copy of the raw DataFrame
bfrss_raw_df = bfrss.cloneDF()
bfrss_raw_df.info()

DEMOGRAPHIC_FEATURE_COLUMNS = [
    # Demographics section columns (13 total)
    # Demographics section columns (13 total)
    'MARITAL',    # https://singular-eclair-6a5a16.netlify.app/columns/MARITAL
    'EDUCA',      # https://singular-eclair-6a5a16.netlify.app/columns/EDUCA
    'RENTHOM1',   # https://singular-eclair-6a5a16.netlify.app/columns/RENTHOM1
    'NUMHHOL4',   # https://singular-eclair-6a5a16.netlify.app/columns/NUMHHOL4
    'NUMPHON4',   # https://singular-eclair-6a5a16.netlify.app/columns/NUMPHON4
    'CPDEMO1C',   # https://singular-eclair-6a5a16.netlify.app/columns/CPDEMO1C
    'VETERAN3',   # https://singular-eclair-6a5a16.netlify.app/columns/VETERAN3
    'EMPLOY1',    # https://singular-eclair-6a5a16.netlify.app/columns/EMPLOY1
    'CHILDREN',   # https://singular-eclair-6a5a16.netlify.app/columns/CHILDREN
    'INCOME3',    # https://singular-eclair-6a5a16.netlify.app/columns/INCOME3
    'PREGNANT',   # https://singular-eclair-6a5a16.netlify.app/columns/PREGNANT
    'SEXVAR',    # https://singular-eclair-6a5a16.netlify.app/columns/SEXVAR
    '_HISPANC', # https://singular-eclair-6a5a16.netlify.app/columns/_HISPANC # Calculated but not sure from what
    '_CRACE1',    # https://singular-eclair-6a5a16.netlify.app/columns/_CRACE1 # Child race
    '_IMPRACE',   # https://singular-eclair-6a5a16.netlify.app/columns/_IMPRACE
    '_AGE80',     # https://singular-eclair-6a5a16.netlify.app/columns/_AGE80
]




# K-Modes Analysis of BRFSS Data

K-Modes clustering is an extension of K-Means designed for categorical data. Instead of using means to define cluster centers, K-Modes uses modes (most frequent values) and measures dissimilarity using the number of mismatches between data points.

In [None]:
import pandas as pd
from kmodes.kmodes import KModes

def kmode_analysis(df:pd.DataFrame):
    km = KModes(n_clusters=2, init='Huang', n_init=5, verbose=1)
    data = df[DEMOGRAPHIC_FEATURE_COLUMNS].copy()

    # Data preprocessing: Imput NaN values with the mode of each column
    for col in data.columns:
        if data[col].dtype == 'object' or data[col].dtype.name == 'category':
            # Fill NaN with mode for categorical columns
            data[col].fillna(data[col].mode()[0], inplace=True)
        else:
            # For numerical columns, fill NaN with the mean (if any)
            data[col].fillna(data[col].mean(), inplace=True)

    # Fit the model
    clusters = km.fit_predict(data)
    df['Clusters'] = clusters

    # Print the cluster centroids
    print("Cluster centroids:")
    print(km.cluster_centroids_)

    # View the data with cluster labels
    print(data)

kmode_analysis(bfrss_raw_df)