# Feature Engineering
## Early Multi-Disease Risk Prediction System

### Objective
The objective of this notebook is to transform raw health indicators into
meaningful, medically interpretable features that improve model performance
and support early disease risk prediction. Feature engineering focuses on
risk factor aggregation, ordinal encoding, and creation of composite health
risk indices.


In [1]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)
pd.set_option("display.float_format", lambda x: f"{x:.3f}")


In [2]:
# Load raw dataset
df = pd.read_csv("../data/raw/brfss_2015.csv")

# Quick sanity checks
print("Dataset shape:", df.shape)
df.head()

# Check column names
df.columns


Dataset shape: (253680, 22)


Index(['Diabetes_012', 'HighBP', 'HighChol', 'CholCheck', 'BMI', 'Smoker',
       'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits', 'Veggies',
       'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'GenHlth',
       'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age', 'Education',
       'Income'],
      dtype='object')

### STANDARDIZE TARGET VARIABLES

In [None]:
# Create binary diabetes target for early risk prediction
# The diabetes column is encoded as:
# 0 = No diabetes, 1 = Pre-diabetes, 2 = Diabetes
# For early risk prediction, we combine pre-diabetes and diabetes
# into a single binary risk indicator.
df["Diabetes_binary"] = df["Diabetes_012"].apply(lambda x: 1 if x > 0 else 0)

# List final target variables
target_cols = [
    "HeartDiseaseorAttack",
    "Diabetes_binary",
    "Stroke",
    "HighBP"
]

df[target_cols].head()


Unnamed: 0,HeartDiseaseorAttack,Diabetes_binary,Stroke,HighBP
0,0.0,0,0.0,1.0
1,0.0,0,0.0,0.0
2,0.0,0,0.0,1.0
3,0.0,0,0.0,1.0
4,0.0,0,0.0,1.0


### TARGET DISTRIBUTION RECHECK

In [None]:
# Re-check class distribution after target transformation
# Healthcare datasets are naturally imbalanced, which is expected

for col in target_cols:
    print(f"\n{col} distribution:")
    print(df[col].value_counts(normalize=True))



HeartDiseaseorAttack distribution:
0.000   0.906
1.000   0.094
Name: HeartDiseaseorAttack, dtype: float64

Diabetes_binary distribution:
0   0.842
1   0.158
Name: Diabetes_binary, dtype: float64

Stroke distribution:
0.000   0.959
1.000   0.041
Name: Stroke, dtype: float64

HighBP distribution:
0.000   0.571
1.000   0.429
Name: HighBP, dtype: float64


### SELECT BASE FEATURE SET

In [None]:
# Base features are common health indicators shared across all diseases
# Using the same feature space ensures a unified ML pipeline

base_features = [
    "Age",
    "Sex",
    "BMI",
    "HighChol",
    "CholCheck",
    "Smoker",
    "PhysActivity",
    "HvyAlcoholConsump",
    "Fruits",
    "Veggies",
    "GenHlth",
    "PhysHlth",
    "MentHlth",
    "DiffWalk",
    "AnyHealthcare",
    "NoDocbcCost"
]

df_base = df[base_features + target_cols]

df_base.head()


Unnamed: 0,Age,Sex,BMI,HighChol,CholCheck,Smoker,PhysActivity,HvyAlcoholConsump,Fruits,Veggies,GenHlth,PhysHlth,MentHlth,DiffWalk,AnyHealthcare,NoDocbcCost,HeartDiseaseorAttack,Diabetes_binary,Stroke,HighBP
0,9.0,0.0,40.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,5.0,15.0,18.0,1.0,1.0,0.0,0.0,0,0.0,1.0
1,7.0,0.0,25.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,1.0,0.0,0,0.0,0.0
2,9.0,0.0,28.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,5.0,30.0,30.0,1.0,1.0,1.0,0.0,0,0.0,1.0
3,11.0,0.0,27.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,0,0.0,1.0
4,11.0,0.0,24.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,2.0,0.0,3.0,0.0,1.0,0.0,0.0,0,0.0,1.0


### FEATURE TYPE CLASSIFICATION

In [None]:
# Explicitly defining feature types helps in:
# - Correct encoding
# - Better preprocessing pipeline design
# - Improved model performance

binary_features = [
    "HighChol", "CholCheck", "Smoker", "PhysActivity",
    "HvyAlcoholConsump", "Fruits", "Veggies",
    "DiffWalk", "AnyHealthcare", "NoDocbcCost"
]

ordinal_features = [
    "Age", "GenHlth"
]

continuous_features = [
    "BMI", "PhysHlth", "MentHlth"
]

print("Binary features:", binary_features)
print("Ordinal features:", ordinal_features)
print("Continuous features:", continuous_features)


Binary features: ['HighChol', 'CholCheck', 'Smoker', 'PhysActivity', 'HvyAlcoholConsump', 'Fruits', 'Veggies', 'DiffWalk', 'AnyHealthcare', 'NoDocbcCost']
Ordinal features: ['Age', 'GenHlth']
Continuous features: ['BMI', 'PhysHlth', 'MentHlth']


### BMI CATEGORY

In [None]:
# BMI categories provide better medical interpretability
# compared to raw BMI values

def bmi_category(bmi):
    if bmi < 18.5:
        return "Underweight"
    elif bmi < 25:
        return "Normal"
    elif bmi < 30:
        return "Overweight"
    else:
        return "Obese"

df_base["BMI_Category"] = df_base["BMI"].apply(bmi_category)

df_base[["BMI", "BMI_Category"]].head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_base["BMI_Category"] = df_base["BMI"].apply(bmi_category)


Unnamed: 0,BMI,BMI_Category
0,40.0,Obese
1,25.0,Overweight
2,28.0,Overweight
3,27.0,Overweight
4,24.0,Normal


### LIFESTYLE RISK INDEX

In [None]:
# Lifestyle Risk Index aggregates multiple behavioral risk factors
# Higher values indicate poorer lifestyle habits and higher disease risk

df_base["Lifestyle_Risk_Index"] = (
    df_base["Smoker"] +
    (1 - df_base["PhysActivity"]) +
    df_base["HvyAlcoholConsump"] +
    (1 - df_base["Fruits"]) +
    (1 - df_base["Veggies"])
)

# Validate distribution of the risk index
df_base["Lifestyle_Risk_Index"].value_counts().sort_index()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_base["Lifestyle_Risk_Index"] = (


0.000    68351
1.000    87786
2.000    60326
3.000    28735
4.000     7993
5.000      489
Name: Lifestyle_Risk_Index, dtype: int64

### HEALTH STRESS SCORE

In [None]:
# Health Stress Score captures cumulative physical and mental stress
# This can act as an early indicator of chronic disease risk    

df_base["Health_Stress_Score"] = (
    df_base["PhysHlth"] + df_base["MentHlth"]
)
df_base[["PhysHlth", "MentHlth", "Health_Stress_Score"]].describe()


Unnamed: 0,PhysHlth,MentHlth,Health_Stress_Score
count,253680.0,253680.0,253680.0
mean,4.242,3.185,7.427
std,8.718,7.413,13.291
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,0.0,0.0,1.0
75%,3.0,2.0,7.0
max,30.0,30.0,60.0


### Verify required columns

In [14]:
# Check if all required columns for Preventive Care Index exist
pci_features = [
    'PhysActivity',
    'Fruits',
    'Veggies',
    'HvyAlcoholConsump',
    'Smoker'
]

print("Missing columns:", [col for col in pci_features if col not in df.columns])


Missing columns: []


### Create Preventive Care Index

In [15]:
# Preventive Care Index (higher = better health behavior)
df['PreventiveCareIndex'] = (
    df['PhysActivity'] +        # 1 if physically active
    df['Fruits'] +              # 1 if consumes fruits
    df['Veggies'] +             # 1 if consumes vegetables
    (1 - df['HvyAlcoholConsump']) +  # 1 if NOT heavy drinker
    (1 - df['Smoker'])               # 1 if NOT smoker
)


### Inspect distribution

In [16]:
# Check distribution of Preventive Care Index
df['PreventiveCareIndex'].value_counts().sort_index()


0.000      489
1.000     7993
2.000    28735
3.000    60326
4.000    87786
5.000    68351
Name: PreventiveCareIndex, dtype: int64

### Create Risk Score

In [17]:
# Risk Score based on major medical risk factors
df['RiskScore'] = (
    df['HighBP'] +
    df['HighChol'] +
    df['BMI'].apply(lambda x: 1 if x >= 30 else 0) +  # Obesity
    df['GenHlth'].apply(lambda x: 1 if x >= 4 else 0) # Poor general health
)


### Map Risk Level (Low / Medium / High)

In [18]:
# Risk Level Mapping
def map_risk(score):
    if score <= 1:
        return 'Low'
    elif score <= 3:
        return 'Medium'
    else:
        return 'High'

df['RiskLevel'] = df['RiskScore'].apply(map_risk)


### Validate Risk Levels

In [19]:
# Check risk level distribution
df['RiskLevel'].value_counts()


Low       145267
Medium     97698
High       10715
Name: RiskLevel, dtype: int64

### Select final features

In [20]:
final_features = [
    'Diabetes_012',
    'Age',
    'BMI',
    'HighBP',
    'HighChol',
    'PhysActivity',
    'Smoker',
    'PreventiveCareIndex',
    'RiskScore',
    'RiskLevel'
]

df_final = df[final_features]


### Encode RiskLevel

In [21]:
# Encode RiskLevel for ML models
risk_mapping = {'Low': 0, 'Medium': 1, 'High': 2}
df_final['RiskLevelEncoded'] = df_final['RiskLevel'].map(risk_mapping)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_final['RiskLevelEncoded'] = df_final['RiskLevel'].map(risk_mapping)


### Save engineered dataset

In [22]:
# Save final engineered dataset
df_final.to_csv('../data/processed/diabetes_engineered.csv', index=False)

print("✅ Engineered dataset saved successfully")


✅ Engineered dataset saved successfully
