## Feature Engineering

In this notebook, we derived new features from the existing dataset to enhance analytical capabilities and enable downstream modeling and reporting. These features include:

- A synthetic `patient_id` to support longitudinal tracking
- A proxy variable for `treatment_cost` using `billing_amount`
- A `readmitted` flag indicating if the patient was readmitted within 30 days of discharge


---

### Loading the Cleaned Dataset

In [14]:
import pandas as pd

# Loading the dataset
cleaned_df = pd.read_csv('../data/cleaned/cleaned_dataset.csv')

# Quick look at the structure
print(cleaned_df.shape)
print(cleaned_df.dtypes)
dashboard_df = cleaned_df.copy()
dashboard_df.head()

(54966, 15)
name                   object
age                     int64
gender                 object
blood_type             object
medical_condition      object
date_of_admission      object
doctor                 object
hospital               object
insurance_provider     object
billing_amount        float64
room_number             int64
admission_type         object
discharge_date         object
medication             object
test_results           object
dtype: object


Unnamed: 0,name,age,gender,blood_type,medical_condition,date_of_admission,doctor,hospital,insurance_provider,billing_amount,room_number,admission_type,discharge_date,medication,test_results
0,Bobby Jackson,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.28,328,Urgent,2024-02-02,Paracetamol,Normal
1,Leslie Terry,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.33,265,Emergency,2019-08-26,Ibuprofen,Inconclusive
2,Danny Smith,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.1,205,Emergency,2022-10-07,Aspirin,Normal
3,Andrew Watts,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,Hernandez Rogers and Vang,Medicare,37909.78,450,Elective,2020-12-18,Ibuprofen,Abnormal
4,Adrienne Bell,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.32,458,Urgent,2022-10-09,Penicillin,Abnormal


### Patient ID Generation – Design Summary

To uniquely identify patients in the dataset, we proposed an initial ID structure using the first letter of the patient's first and last name, their room number, and a short abbreviation of their medical condition (e.g., Bobby Jackson, Room 328, with Carcinoma → `BJ328CA`).

While this format is readable and informative, we raised concerns about possible ID duplication, especially in a large dataset (~50,000 records), due to:

- Shared names (common initials)
- Reuse of hospital room numbers
- Limited set of medical conditions (only 6 unique values)

To address the risk of reoccurrence, the final solution includes:

- Generating a base ID using the original format
- Appending a short hash suffix (e.g., _A7F) derived from the base ID to ensure uniqueness without sacrificing readability

This hybrid approach balances clarity and data integrity, making the patient_id suitable for both internal tracking and external reporting.

In [15]:
# F1: Create a synthetic patient_id
import hashlib

# Condition abbreviation map
condition_map = {
    'Cancer': 'CA',
    'Obesity': 'OB',
    'Diabetes': 'DB',
    'Asthma': 'AS',
    'Hypertension': 'HT',
    'Arthritis': 'AR'
}

def create_patient_id(row):

    # ============== Parameters ==============

    # First and Last Name
    full_name = row['name'].strip().split()
    first_initial = full_name[0][0].upper() if len(full_name) > 0 else 'X'
    last_initial = full_name[1][0].upper() if len(full_name) > 1 else 'X'

    # Room Number
    room = str(row['room_number']) if pd.notnull(row['room_number']) else '000'

    # Medical Condition
    condition_abbr = condition_map.get(row['medical_condition'], 'XX')  # Use a predefined map

    # ============== Adding the short hash ==============

    # Putting all parameters together to create the unique ID
    base_id = f"{first_initial}{last_initial}{room}{condition_abbr}"

    hash_suffix = hashlib.md5(base_id.encode()).hexdigest()[:3].upper()

    return f"{base_id}_{hash_suffix}"

dashboard_df['patient_id'] = dashboard_df.apply(create_patient_id, axis=1)
dashboard_df.head()

Unnamed: 0,name,age,gender,blood_type,medical_condition,date_of_admission,doctor,hospital,insurance_provider,billing_amount,room_number,admission_type,discharge_date,medication,test_results,patient_id
0,Bobby Jackson,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.28,328,Urgent,2024-02-02,Paracetamol,Normal,BJ328CA_628
1,Leslie Terry,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.33,265,Emergency,2019-08-26,Ibuprofen,Inconclusive,LT265OB_C0E
2,Danny Smith,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.1,205,Emergency,2022-10-07,Aspirin,Normal,DS205OB_EF6
3,Andrew Watts,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,Hernandez Rogers and Vang,Medicare,37909.78,450,Elective,2020-12-18,Ibuprofen,Abnormal,AW450DB_991
4,Adrienne Bell,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.32,458,Urgent,2022-10-09,Penicillin,Abnormal,AB458CA_E7B


In [16]:
# Ensuring the unique ID is at the start of the dataset
cols = ['patient_id'] + [col for col in dashboard_df.columns if col != 'patient_id']
dashboard_df = dashboard_df[cols]
dashboard_df.head()

Unnamed: 0,patient_id,name,age,gender,blood_type,medical_condition,date_of_admission,doctor,hospital,insurance_provider,billing_amount,room_number,admission_type,discharge_date,medication,test_results
0,BJ328CA_628,Bobby Jackson,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.28,328,Urgent,2024-02-02,Paracetamol,Normal
1,LT265OB_C0E,Leslie Terry,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.33,265,Emergency,2019-08-26,Ibuprofen,Inconclusive
2,DS205OB_EF6,Danny Smith,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.1,205,Emergency,2022-10-07,Aspirin,Normal
3,AW450DB_991,Andrew Watts,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,Hernandez Rogers and Vang,Medicare,37909.78,450,Elective,2020-12-18,Ibuprofen,Abnormal
4,AB458CA_E7B,Adrienne Bell,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.32,458,Urgent,2022-10-09,Penicillin,Abnormal


### Length of Stay – Design Summary

To derive the Length of Stay for each patient, we took the following steps:

- Converted the date_of_admission and discharge_date columns to proper datetime format to ensure accurate date calculations.
- Calculated the length of stay as the difference (in days) between discharge and admission dates.
- Replaced any zero or negative stay durations (which may arise from data issues) with a minimum value of 1 day, ensuring logical consistency for downstream analysis.

This new feature provides a foundational metric for healthcare utilization and cost-related insights.

In [19]:
# Ensuring date columns are datetime
dashboard_df['date_of_admission'] = pd.to_datetime(dashboard_df['date_of_admission'])
dashboard_df['discharge_date'] = pd.to_datetime(dashboard_df['discharge_date'])

# Calculating length of stay and handling negative/zero values
dashboard_df['length_of_stay'] = (dashboard_df['discharge_date'] - dashboard_df['date_of_admission']).dt.days
dashboard_df['length_of_stay'] = dashboard_df['length_of_stay'].apply(lambda x: max(x, 1))
dashboard_df.head()

Unnamed: 0,patient_id,name,age,gender,blood_type,medical_condition,date_of_admission,doctor,hospital,insurance_provider,billing_amount,room_number,admission_type,discharge_date,medication,test_results,length_of_stay
0,BJ328CA_628,Bobby Jackson,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.28,328,Urgent,2024-02-02,Paracetamol,Normal,2
1,LT265OB_C0E,Leslie Terry,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.33,265,Emergency,2019-08-26,Ibuprofen,Inconclusive,6
2,DS205OB_EF6,Danny Smith,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.1,205,Emergency,2022-10-07,Aspirin,Normal,15
3,AW450DB_991,Andrew Watts,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,Hernandez Rogers and Vang,Medicare,37909.78,450,Elective,2020-12-18,Ibuprofen,Abnormal,30
4,AB458CA_E7B,Adrienne Bell,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.32,458,Urgent,2022-10-09,Penicillin,Abnormal,20


### Treatment Cost – Design Summary

Text description.

In [28]:
dashboard_df.loc[dashboard_df['name'] == 'Bobby Jackson', 
    ['patient_id', 'medical_condition', 'medication', 'test_results', 'length_of_stay']]

Unnamed: 0,patient_id,medical_condition,medication,test_results,length_of_stay
0,BJ328CA_628,Cancer,Paracetamol,Normal,2


In [11]:
# Saving final dataset
dashboard_df.to_csv("../powerbi/data/dashboard_dataset.csv", index=False)

print("✅ Feature Engineering complete! Processed data saved in 'powerbi/data/dashboard_dataset.csv'.")

✅ Feature Engineering complete! Processed data saved in 'powerbi/data/dashboard_dataset.csv'.
