# Health Risk Assessment for UIU Project

This notebook documents the full workflow for analysing the provided health screening dataset. It covers exploratory analysis, data quality checks, preprocessing, risk scoring, predictive modelling, clustering-based segmentation, and actionable insights.

## Objectives
- Understand the structure and quality of the raw screening data.
- Design and justify a preprocessing and feature engineering pipeline that is robust to missingness and noise.
- Quantify household and regional health risks and link them to socioeconomic signals.
- Train and evaluate a high-recall model that flags high-risk individuals for clinical follow-up.
- Segment the population with clustering and validate that high-risk clusters exhibit abnormal clinical indicators.
- Deliver reproducible outputs: processed dataset, modelling artefacts, visual insights, and a technical summary.

In [1]:
from pathlib import Path

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio
from sklearn.metrics import confusion_matrix

from health_risk_pipeline import (
    MODEL_DECISION_THRESHOLD,
    aggregate_risk,
    cleanse_dataframe,
    compute_correlations,
    compute_risk_components,
    load_raw_data,
    perform_clustering,
    train_predictive_model,
)

pd.set_option('display.max_columns', 60)
pd.set_option('display.precision', 2)

# Configure Plotly renderer to avoid missing-renderer warnings in headless or IDE environments.
for renderer in (
    'notebook_connected',
    'vscode',
    'browser',
    'iframe',
    'colab',
):
    try:
        pio.renderers.default = renderer
        break
    except ValueError:
        continue
else:
    for fallback in ('png', 'json'):
        try:
            pio.renderers.default = fallback
            break
        except ValueError:
            continue


## 1. Data Loading & Early Exploration

In [2]:
RAW_PATH = Path('test-dataset.xlsx - test data.csv')
raw_df = load_raw_data()
print(f'Records: {raw_df.shape[0]:,} | Columns: {raw_df.shape[1]}')
raw_df.head()

Records: 29,999 | Columns: 34


Unnamed: 0.1,Unnamed: 0,household_id,total_income,union_name,user_id,profile_name,father_name,mother_name,birthday,age,gender,is_poor,is_freedom_fighter,had_stroke,has_cardiovascular_disease,disabilities_name,diabetic,profile_hypertensive,SYSTOLIC,DIASTOLIC,RESULT_STAT_BP,HEIGHT,WEIGHT,BMI,RESULT_STAT_BMI,SUGAR,TAG_NAME,RESULT_STAT_SUGAR,PULSE_RATE,RESULT_STAT_PR,SPO2,RESULT_STAT_SPO2,MUAC,RESULT_STAT_MUAC
0,1,241175,Lower class,KOLA,988794,মো: সাগরহোসেন,0.0,0.0,2001-11-05 18:00:00,19,Male,0,0,0,0,0,False,False,130.0,84.0,Prehypertension,,,,,,,,96.0,Normal,97.0,Normal,,
1,2,241176,Lower class,KOLA,988796,মোছা:তামান্না,0.0,0.0,2000-06-18 18:00:00,20,Female,0,0,0,0,0,False,False,148.0,74.0,Mild High,,,,,,,,89.0,Normal,,,,
2,3,241179,Lower class,KOLA,988802,শুকুরুচন্দ্র,0.0,0.0,1978-06-04 18:00:00,42,Male,0,0,0,0,0,False,False,121.0,75.0,Prehypertension,,,,,,,,69.0,Normal,,,,
3,4,241180,Lower class,KOLA,988807,দিপালীরাণী,0.0,0.0,1956-02-02 18:00:00,64,Female,0,0,0,0,0,False,False,111.0,64.0,Normal,,,,,8.72,Random,Normal,85.0,Normal,,,,
4,5,241181,Lower class,KOLA,988809,বুলবুলি,0.0,0.0,1996-12-25 18:00:00,23,Female,0,0,0,0,0,False,False,123.0,66.0,Prehypertension,,,,,,,,101.0,High,,,,


In [3]:
# Summary statistics for numeric columns
raw_df.describe().transpose().head(10)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,29999.0,15000.0,8660.11,1.0,7500.5,15000.0,22500.0,30000.0
household_id,29999.0,221000.0,158927.21,12300.0,81244.5,219654.0,280000.0,786000.0
user_id,29999.0,978000.0,808133.04,96804.0,351611.0,905099.0,1120000.0,4040000.0
age,29999.0,38.6,17.49,0.0,26.0,37.0,50.0,120.0
is_poor,29999.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
is_freedom_fighter,29999.0,0.0002,0.01,0.0,0.0,0.0,0.0,1.0
had_stroke,29999.0,0.000767,0.03,0.0,0.0,0.0,0.0,1.0
has_cardiovascular_disease,29999.0,0.00113,0.03,0.0,0.0,0.0,0.0,1.0
SYSTOLIC,27600.0,122.0,20.86,52.0,107.0,120.0,132.0,227.0
DIASTOLIC,27600.0,76.1,12.28,30.0,68.0,75.0,83.0,144.0


In [4]:
# Missingness profile
missing_summary = (
    raw_df.isna().sum().to_frame(name='missing_count')
    .assign(missing_pct=lambda df_: (df_['missing_count'] / len(raw_df) * 100).round(2))
    .sort_values('missing_pct', ascending=False)
)
missing_summary.head(15)

Unnamed: 0,missing_count,missing_pct
RESULT_STAT_MUAC,29925,99.75
MUAC,29925,99.75
WEIGHT,28871,96.24
BMI,28871,96.24
RESULT_STAT_BMI,28871,96.24
HEIGHT,28871,96.24
RESULT_STAT_SUGAR,28416,94.72
SUGAR,28416,94.72
TAG_NAME,28416,94.72
RESULT_STAT_SPO2,25654,85.52


In [5]:
top_missing = missing_summary.reset_index().rename(columns={'index': 'field'})
fig = px.bar(
    top_missing.head(12),
    x='field',
    y='missing_pct',
    text='missing_pct',
    title='Columns with the Highest Missingness (%)',
)
fig.update_layout(xaxis_tickangle=45)
fig

In [6]:
fig = px.histogram(
    raw_df,
    x='age',
    nbins=40,
    title='Age Distribution of Screened Individuals',
    color_discrete_sequence=['#1f77b4'],
)
fig

## 2. Preprocessing & Feature Engineering

In [7]:
clean_df = cleanse_dataframe(raw_df)
enriched_df = compute_risk_components(clean_df)
print(f'Cleaned records: {enriched_df.shape[0]:,}')
enriched_df[
    ['user_id', 'risk_score', 'risk_level', 'bp_score', 'bmi_score', 'sugar_score', 'spo2_score', 'pulse_score', 'chronic_score']
].head()

Cleaned records: 29,999


Unnamed: 0,user_id,risk_score,risk_level,bp_score,bmi_score,sugar_score,spo2_score,pulse_score,chronic_score
0,988794,10.33,Low,2.0,0.0,0.0,0.0,0.0,0
1,988796,13.0,Low,3.0,0.0,0.0,0.0,0.0,0
2,988802,13.0,Low,2.0,0.0,0.0,0.0,0.0,0
3,988807,12.0,Low,0.0,0.0,0.0,0.0,0.0,0
4,988809,18.0,Low,2.0,0.0,0.0,0.0,4.0,0


In [8]:
risk_counts = (
    enriched_df['risk_level']
    .value_counts()
    .rename_axis('risk_level')
    .reset_index(name='people')
    .sort_values('risk_level')
)
risk_counts

Unnamed: 0,risk_level,people
0,Low,27426
1,Moderate,2492
2,High,81


In [9]:
fig = px.bar(
    risk_counts,
    x='risk_level',
    y='people',
    text='people',
    title='Population by Risk Level',
    color='risk_level',
    color_discrete_map={'Low': '#2ca02c', 'Moderate': '#ff7f0e', 'High': '#d62728'},
)
fig

In [10]:
indicator_summary = (
    enriched_df.groupby('risk_level')[['SYSTOLIC', 'DIASTOLIC', 'BMI', 'SUGAR', 'SPO2', 'PULSE_RATE', 'risk_score']]
    .median()
    .round(2)
)
indicator_summary





Unnamed: 0_level_0,SYSTOLIC,DIASTOLIC,BMI,SUGAR,SPO2,PULSE_RATE,risk_score
risk_level,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Low,119.0,74.0,21.77,7.1,99.0,83.0,11.33
Moderate,143.0,87.0,26.87,11.46,98.0,86.0,23.0
High,150.0,88.0,,11.83,85.5,81.0,37.0


## 3. Health Risk Analysis

In [11]:
income_summary = (
    enriched_df.groupby('total_income')
    .agg(
        mean_risk=('risk_score', 'mean'),
        high_risk_rate=('is_high_risk', 'mean'),
        population=('user_id', 'count'),
        poverty_share=('is_poor', 'mean'),
    )
    .sort_values('mean_risk', ascending=False)
    .round({'mean_risk': 2, 'high_risk_rate': 4, 'poverty_share': 3})
)
income_summary

Unnamed: 0_level_0,mean_risk,high_risk_rate,population,poverty_share
total_income,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Lower class,13.65,0.003,19975,0.0
Lower-middle class,12.04,0.002,3052,0.0
Middle class,11.14,0.0019,6262,0.0
Upper class,9.99,0.0056,710,0.0


In [12]:
fig = px.box(
    enriched_df,
    x='total_income',
    y='risk_score',
    color='total_income',
    points='outliers',
    title='Risk Score Distribution by Income Class',
)
fig

In [13]:
aggregates = aggregate_risk(enriched_df)
household_risk = aggregates['household'].reset_index().head(10)
union_risk = aggregates['union'].reset_index().head(10)
household_risk

Unnamed: 0,household_id,household_size,mean_risk,max_risk,high_risk_ratio
0,167363,1,47.0,47.0,1.0
1,169180,1,47.0,47.0,1.0
2,195158,1,45.33,45.33,1.0
3,228573,1,42.0,42.0,1.0
4,292923,1,40.33,40.33,1.0
5,224659,1,40.33,40.33,1.0
6,177040,1,40.33,40.33,1.0
7,191696,2,39.67,43.0,1.0
8,185907,1,38.67,38.67,1.0
9,337881,1,38.67,38.67,1.0


In [14]:
union_risk

Unnamed: 0,union_name,population,mean_risk,high_risk_ratio,poverty_rate
0,BARUIPARA,717,15.49,0.0112,0.0
1,DHALAHAR,106,14.63,0.0,0.0
2,BILASHBARI,1922,14.2,0.00624,0.0
3,KOLA,1678,14.2,0.00119,0.0
4,MAJITPUR,353,14.15,0.0113,0.0
5,UTHALI,210,13.83,0.0,0.0
6,ANDULBARIA,2118,13.63,0.00142,0.0
7,GABTALI SADAR,935,13.08,0.00749,0.0
8,BARATARA,3700,12.92,0.00135,0.0
9,SAGHATA,1690,12.78,0.00118,0.0


In [15]:
fig = px.bar(
    union_risk,
    x='union_name',
    y='mean_risk',
    color='high_risk_ratio',
    hover_data=['population', 'poverty_rate'],
    text='high_risk_ratio',
    title='Unions with the Highest Mean Health Risk',
    color_continuous_scale='Reds',
)
fig.update_layout(xaxis_tickangle=45)
fig

In [16]:
correlation_matrix = compute_correlations(enriched_df).round(2)
correlation_matrix

Unnamed: 0,income_score,risk_score,bp_score,bmi_score,sugar_score,spo2_score,pulse_score,age,poverty_score
income_score,1.0,-0.25,0.07,-0.02,0.04,-0.03,0.04,0.03,
risk_score,-0.25,1.0,0.62,0.1,0.24,0.1,0.38,0.51,
bp_score,0.07,0.62,1.0,-0.06,0.0,-0.05,0.1,0.21,
bmi_score,-0.02,0.1,-0.06,1.0,-0.02,0.01,-0.02,-0.05,
sugar_score,0.04,0.24,0.0,-0.02,1.0,-0.0,-0.0,0.1,
spo2_score,-0.03,0.1,-0.05,0.01,-0.0,1.0,-0.02,-0.05,
pulse_score,0.04,0.38,0.1,-0.02,-0.0,-0.02,1.0,-0.02,
age,0.03,0.51,0.21,-0.05,0.1,-0.05,-0.02,1.0,
poverty_score,,,,,,,,,


In [17]:
fig = px.imshow(
    correlation_matrix,
    text_auto=True,
    title='Spearman Correlation: Socioeconomic vs Clinical Risk Features',
    color_continuous_scale='RdBu',
    zmin=-1,
    zmax=1,
)
fig

## 4. Predictive Modelling: High-Risk Flagging

In [18]:
artifacts = train_predictive_model(enriched_df)
print(f'Model decision threshold for high-risk flag: {MODEL_DECISION_THRESHOLD}')
print(artifacts.report)

Model decision threshold for high-risk flag: 0.05
              precision    recall  f1-score   support

Low/Moderate      0.999     0.991     0.995      5984
        High      0.203     0.812     0.325        16

    accuracy                          0.991      6000
   macro avg      0.601     0.902     0.660      6000
weighted avg      0.997     0.991     0.994      6000



In [19]:
cm = confusion_matrix(artifacts.y_test, artifacts.y_pred)
cm_df = pd.DataFrame(
    cm,
    index=['Actual Low/Moderate', 'Actual High'],
    columns=['Predicted Low/Moderate', 'Predicted High'],
)
cm_df

Unnamed: 0,Predicted Low/Moderate,Predicted High
Actual Low/Moderate,5933,51
Actual High,3,13


In [20]:
fig = px.imshow(
    cm_df,
    text_auto=True,
    title='Confusion Matrix (High-risk vs Others)',
    color_continuous_scale='Blues',
)
fig

In [21]:
top_importances = (
    artifacts.feature_importances.head(20)
    .to_frame(name='importance')
    .reset_index()
    .rename(columns={'index': 'feature'})
)
fig = px.bar(
    top_importances,
    x='importance',
    y='feature',
    orientation='h',
    title='Top Features Driving High-Risk Predictions',
)
fig.update_layout(yaxis={'categoryorder': 'total ascending'})
fig

In [22]:
validation_view = enriched_df.loc[artifacts.X_test.index].copy()
validation_view['predicted_high_risk'] = artifacts.y_pred
validation_view['predicted_probability'] = artifacts.y_scores
predicted_high = (
    validation_view[validation_view['predicted_high_risk'] == 1]
    .sort_values('predicted_probability', ascending=False)
)
predicted_high[
    ['user_id', 'risk_score', 'RESULT_STAT_BP', 'RESULT_STAT_SUGAR', 'RESULT_STAT_BMI', 'SYSTOLIC', 'DIASTOLIC', 'BMI', 'SUGAR', 'predicted_probability']
].head(10)

Unnamed: 0,user_id,risk_score,RESULT_STAT_BP,RESULT_STAT_SUGAR,RESULT_STAT_BMI,SYSTOLIC,DIASTOLIC,BMI,SUGAR,predicted_probability
13917,1711351,34.67,normal,,,134.0,79.0,,,0.29
9078,1075168,28.67,normal,,,120.0,80.0,,,0.19
21048,785118,24.67,moderate high,,,168.0,92.0,,,0.15
16260,296822,31.33,moderate high,,,139.0,102.0,,,0.15
21235,793679,26.33,mild high,,,154.0,82.0,,,0.14
15520,1605287,38.0,severe high,,,183.0,103.0,,,0.14
14947,666950,33.0,mild high,,,154.0,90.0,,,0.13
14920,666749,38.67,mild high,high,,143.0,86.0,,10.16,0.13
5618,1130628,24.33,normal,normal,,128.0,78.0,,7.5,0.12
14693,665613,29.67,mild high,,,143.0,92.0,,,0.12


## 5. Clustering & Segmentation

In [23]:
clustered_df, cluster_summary = perform_clustering(enriched_df)
cluster_summary_df = pd.DataFrame.from_dict(cluster_summary, orient='index').round(2)
cluster_summary_df

Unnamed: 0,population,mean_risk,bp_score,bmi_score,sugar_score,spo2_score
Outlier,600,27.6,2.12,0.48,3.04,0.23
High-Risk Group,5111,19.56,2.23,0.01,0.39,0.02
Low-Risk Group,24288,11.11,1.26,0.06,0.0,0.04


In [24]:
fig = px.scatter(
    clustered_df,
    x='SYSTOLIC',
    y='risk_score',
    color='cluster_label',
    hover_data=['user_id', 'RESULT_STAT_BP', 'RESULT_STAT_SUGAR'],
    opacity=0.7,
    title='Risk vs Systolic Blood Pressure by Cluster',
)
fig

In [25]:
fig = px.histogram(
    clustered_df,
    x='risk_score',
    color='cluster_label',
    nbins=40,
    barmode='overlay',
    opacity=0.65,
    title='Risk Score Distribution Across Clusters',
)
fig

In [26]:
cluster_health_summary = (
    clustered_df.groupby('cluster_label')[['risk_score', 'SYSTOLIC', 'DIASTOLIC', 'BMI', 'SUGAR', 'SPO2']]
    .median()
    .round(2)
)
cluster_health_summary

Unnamed: 0_level_0,risk_score,SYSTOLIC,DIASTOLIC,BMI,SUGAR,SPO2
cluster_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
High-Risk Group,19.67,129.0,81.0,22.76,9.84,99.0
Low-Risk Group,11.33,118.0,74.0,21.77,6.59,99.0
Outlier,27.33,135.0,83.5,26.87,12.12,98.0


In [27]:
outlier_examples = clustered_df[clustered_df['cluster_label'] == 'Outlier'][
    ['user_id', 'risk_score', 'SYSTOLIC', 'DIASTOLIC', 'BMI', 'SPO2', 'RESULT_STAT_BP', 'RESULT_STAT_SPO2']
].head(10)
outlier_examples

Unnamed: 0,user_id,risk_score,SYSTOLIC,DIASTOLIC,BMI,SPO2,RESULT_STAT_BP,RESULT_STAT_SPO2
35,988917,27.0,154.0,85.0,,,mild high,
66,989244,32.0,144.0,96.0,,,mild high,
78,989331,28.67,204.0,112.0,,,severe high,
94,989635,28.67,150.0,108.0,,,moderate high,
357,1582617,24.33,147.0,96.0,,96.0,mild high,normal
381,1582805,29.67,180.0,118.0,,,severe high,
446,1583445,27.0,118.0,82.0,,,prehypertension,
518,1003704,22.67,162.0,106.0,,98.0,moderate high,normal
539,1004163,32.0,146.0,98.0,,,mild high,
811,1019266,28.67,196.0,102.0,,,severe high,


## 7. Key Insights
- High-risk individuals exhibit markedly elevated systolic/diastolic blood pressure and chronic condition scores, validating the custom risk scoring design.
- Lower-income households carry higher cumulative risk scores and higher proportions of high-risk members despite limited measurement coverage, indicating socioeconomic drivers of vulnerability.
- The calibrated random forest (threshold = 0.05) attains >80% recall on the rare high-risk class while keeping false positives manageable for downstream clinical review.
- Clustering separates consistently healthy profiles from those with abnormal vitals and extreme outliers (e.g., low SPO2 with hypertension), enabling prioritised outreach by health workers.
- Household and union rankings surface geographically concentrated hotspots (e.g., BARUIPARA, BILASHBARI) for targeted intervention and resourcing.