# Cancer patient readmission model training
This notebook develops and evaluates machine learning models for predicting **unplanned 30-day emergency readmissions after chemotherapy** among oncology inpatients.

## Data:  
- Source: De-identified EHR data from cancer inpatients 
- 18 months cancer patient EHR
- Training set / Test set : 0.8/0.2
- Positive cases: ~0.4% (extremely imbalanced dataset)  

## Model:
Baseline：Logistic regression, class_weight="balanced",solver="liblinear", penalty="l2"\
Tree-based：XGBoost scale_pos_weight = neg/pos，max depth = 4,  XGBoost Temp scaling

## Both models are evaluated using evaluation metrics: 
| Metric | Meaning |
|--------|----------|
| **AUROC** | Overall discrimination between readmission vs non-readmission |
| **AUPRC** | Precision–recall balance under extreme imbalance |
| **Brier Score** | Probability calibration quality |
| **Top-k** | Screening efficiency under limited intervention capacity |

## Features:

| Category | Variables | Description |
|-----------|------------|-------------|
| **Demographics** | `In_city`, `Gender`, `Occupation` | Basic patient info |
| **Hospitalization Info** | `Fee`, `Inhospital_days`, `Diagnose` | Admission-level data |
| **Cancer Type (ICD-10)** | `'C11','C16','C18','C20','C22','C23','C24','C25','C34','C50','C56','C61','C64','C67','C71','C79','C82','C83','C85','C90','C92','D46'` | Major solid and hematologic malignancies |
| **Surgery** | `Is_surgery`, `Class4_surgery` *(highest difficulty)*, `Surgery_complication` | Indicates surgical treatment complexity |
| **Treatment Intensity** | `RT_under_96h`, `RT_over_96h`, `ECMO`, `CRRT` | Acute or critical care indicators |
| **Within 72h before discharge** | `Ascites_72h`, `Fever_72h`, `Positive_bacteria_72h` | Short-term clinical deterioration features |


## Import the libraries

In [1]:
import pandas as pd
import numpy as np
import yaml
import os
import data_cleaning_func as dcf
import re

## Identify readmission patients

In [None]:
CFG = yaml.safe_load(open('xxx.yaml','r',encoding='utf-8'))
def get_readmission_patients(df, cancer_codes, days_threshold=30, cfg = CFG, emergency = 'all'):
    """
    Identify patients who were readmitted within time limits after chemotherapy.
    This function does NOT filter for chemotherapy procedure. Use orders to filter chemotherapy visit_sn.
    Args:
        df (pd.DataFrame): DataFrame containing patient admission data.
        cancer_codes (list): ICD-10 code of cancer included.
        days_threshold (int): Number of days to consider for readmission.
        emergency (str): Code indicating emergency readmission. all for all, Y for emergency only, N for non-emergency only.
    Returns:
        (readmit_visit_sn,first_admit_visit_sn) (tuple): Tuple of two lists. 
        readmit_patient is visit_sn readmitted within the threshold after chemotherapy.
        first_admit_visit_sn is visit_sn of the first admission.
    """
    ## Convert date columns to datetime
    print(f'Getting readmission patients with cancer-code{cancer_codes}')
    df['B12'] = pd.to_datetime(df['B12'], errors='coerce')
    df['B15'] = pd.to_datetime(df['B15'], errors='coerce')
    if df['B12'].isna().sum() > 0:
        print("Date type conversion error in B12:")
        print(df[df['B12'].isna()][['patient_id','visit_sn','B12']])
    else:
        if df['B15'].isna().sum() > 0:
            print("date type conversion error in B15:")
            print(df[df['B15'].isna()][['patient_id','visit_sn','B15']])
        else:
            print("Date type conversion successful!")
    # df = df[['visit_sn','patient_id','B12','B15','B11C','C03C']].copy() # check the list, which colmn to keep

    ## calculate next visit date to identify readmission
    df = df.sort_values(by=['patient_id','B12'])
    df['next_admit_date'] = df.groupby('patient_id')['B12'].shift(-1)
    # df['next_admit_type'] = df.groupby('patient_id')['B11C'].shift(-1)
    df['next_visit_sn'] = df.groupby('patient_id')['visit_sn'].shift(-1)
    df = df[df['next_admit_date'].notna()]
    df['days_to_next_admit'] = df['next_admit_date'] - df['B15']
    df = df[df['days_to_next_admit'].dt.days < days_threshold]
    # df = df[df['C03C'].str.contains('|'.join(cancer_codes))] # filter cancer diagnosis
    if emergency == 'all':
        print(f'Total readmission patients in {days_threshold} days with cancer diag: {df["patient_id"].nunique()}\nTotal readmission visits in {days_threshold} days with cancer diag: {df["visit_sn"].nunique()}\n')
    elif emergency == 'Y':
        df = df[df['next_admit_type'] == '1']
        print(f'Total readmission EMERGENCY patients in {days_threshold} days with cancer diag: {df["patient_id"].nunique()}\nTotal readmission EMERGENCY visit in {days_threshold} days with cancer diag: {df["visit_sn"].nunique()}\n')
    elif emergency == 'N':
        df = df[df['next_admit_type'] != '1']
        print(f'Total readmission NON-EMERGENCY patients in {days_threshold} days with cancer diag: {df["patient_id"].nunique()}\nTotal readmission NON-EMERGENCY visits in {days_threshold} days with cancer diag: {df["visit_sn"].nunique()}\n')
    readmit_visit_sn_list = df['next_visit_sn'].unique().tolist()
    first_admit_patient_list = df['visit_sn'].unique().tolist()
    return list(zip(readmit_visit_sn_list, first_admit_patient_list)), readmit_visit_sn_list, first_admit_patient_list
    

## Get the patients in the first 4 months to check readmission rate and choosing positive samples

In [None]:
facesheet_202401 = pd.read_excel('.../facesheet.xlsx')
facesheet_202402 = pd.read_excel('.../facesheet.xlsx')
facesheet_202403 = pd.read_excel('.../facesheet.xlsx')
facesheet_202404 = pd.read_excel('.../facesheet.xlsx')
facesheet_202401to202404 = pd.concat([facesheet_202401,facesheet_202402,facesheet_202403,facesheet_202404])

In [None]:
lung_cancer_code = []
tuple_30,list_30,list_30_firsttime = get_readmission_patients(facesheet_202401to202404,lung_cancer_code)
tuple_15,list_15,list_15_firsttime = get_readmission_patients(facesheet_202401to202404,lung_cancer_code, days_threshold=15)
tuple_7,list_7,list_7_firsttime = get_readmission_patients(facesheet_202401to202404,lung_cancer_code, days_threshold=7)
list_emer = get_readmission_patients(facesheet_202401to202404,lung_cancer_code,emergency='Y')
print(list_30)

In [None]:
## Save the readmit patient list in order to check their diag
df15 = pd.DataFrame(list_15, columns=['visit_sn'])
cancer15 = facesheet_202401to202404[facesheet_202401to202404.visit_sn.isin(df15.visit_sn)]
cancer15.to_excel('xxx.xlsx', index=False)
df30 = pd.DataFrame(list_30, columns=['visit_sn'])
cancer30 = facesheet_202401to202404[facesheet_202401to202404.visit_sn.isin(df30.visit_sn)]
cancer30.to_excel('xxx.xlsx', index=False)

### Get the chemo patient list and filter the readmit patient list

In [None]:
drg = pd.read_excel('.../2024DRG.xlsx')
temp_drg =pd.read_excel('.../2025DRG.xlsx')
drg = pd.concat([drg,temp_drg])
drg['WR'] = pd.to_numeric(drg['WR'], errors='coerce')  
drg = drg.sort_values(by ='WR', ascending=False)


In [None]:
info_202401 = pd.read_excel('.../basic_info.xlsx')
info_202402 = pd.read_excel('.../basic_info.xlsx')
info_202403 = pd.read_excel('.../basic_info.xlsx')
info_202404 = pd.read_excel('.../basic_info.xlsx')
info_202401to202404 = pd.concat([info_202401,info_202402,info_202403,info_202404])

In [None]:
chemo_drg = drg[drg['all_surgery_concat'].str.contains('化疗')]
chemo_202401to202404 = info_202401to202404[info_202401to202404.medical_record_no.isin(chemo_drg.medical_record_no)]
chemo_visit_sn_202401to202404 = chemo_202401to202404['visit_sn']

list_chemo_30_firsttime = pd.Series(list_30_firsttime)[pd.Series(list_30_firsttime).isin(chemo_visit_sn_202401to202404)]
list_chemo_30_firsttime = [str(x).split('.')[0] for x in list_chemo_30_firsttime.tolist()]
list_chemo_30_firsttime = [int(x) for x in list_chemo_30_firsttime]
print("list_chemo_30_firsttime =", len(list_chemo_30_firsttime))
print(list_chemo_30_firsttime[:5])

first_readmit_dict = {k:v for v,k in tuple_30} # dict: first admit visit_sn as key, readmit visit_sn as value
list_chemo_30_readmit = [first_readmit_dict[x] for x in list_chemo_30_firsttime]
list_chemo_30_readmit = [str(x).split('.')[0] for x in list_chemo_30_readmit]
list_chemo_30_readmit = [int(x) for x in list_chemo_30_readmit]
print("list_chemo_30_readmit =", len(list_chemo_30_readmit))
print(list_chemo_30_readmit[:5])

facesheet_chemo_30_readmit = facesheet_202401to202404[facesheet_202401to202404['visit_sn'].isin(list_chemo_30_readmit)]
print('cancer 30 with chemo shape =', facesheet_chemo_30_readmit.shape)


### Find the patient without chemotherapy

In [None]:
no_chemo_drg = drg[~drg['all_surgery_concat'].str.contains('化疗')]
no_chemo_202401to202404 = info_202401to202404[info_202401to202404['medical_record_no'].isin(no_chemo_drg['medical_record_no'])]
print('no chemo patient shape =', no_chemo_202401to202404.shape)
list_chemo_30_readmitwithoutchemo = [x for x in list_chemo_30_readmit if x not in no_chemo_202401to202404['visit_sn'].values]
print('list_chemo_30_readmitwithoutchemo =', len(list_chemo_30_readmitwithoutchemo))
master_chemo_30_readmitwithoutchemo = facesheet_202401to202404[facesheet_202401to202404['visit_sn'].isin(list_chemo_30_readmitwithoutchemo)]
master_chemo_30_readmitwithoutchemo.to_excel('xxx.xlsx', index=False)

### Find the patient with emergency readmission

In [None]:
emer_drg = drg[drg['inpatient_type']=='急诊']
emer_drg = pd.merge(emer_drg, info_202401to202404[['patient_id','medical_record_no','visit_sn']], on='medical_record_no', how='inner')
print(emer_drg.head(5))
print(emer_drg.shape)
emer_drg_readmit = pd.merge(emer_drg, pd.DataFrame(list_30, columns=['visit_sn']), on='visit_sn', how='inner')
print('emer_drg_readmit shape =', emer_drg_readmit.shape)
print(emer_drg_readmit)

### Find the readmission with chemo

In [None]:
list_chemo_15_firsttime = pd.Series(list_15_firsttime)[pd.Series(list_15_firsttime).isin(chemo_visit_sn_202401to202404)]
list_chemo_15_firsttime = [str(x).split('.')[0] for x in list_chemo_15_firsttime.tolist()]
list_chemo_15_firsttime = [int(x) for x in list_chemo_15_firsttime]
print("list_chemo_15_firsttime =", len(list_chemo_15_firsttime))
print(list_chemo_15_firsttime[:5])

first_readmit_dict = {k:v for v,k in tuple_30} 
list_chemo_15_readmit = [first_readmit_dict[x] for x in list_chemo_15_firsttime]
list_chemo_15_readmit = [str(x).split('.')[0] for x in list_chemo_15_readmit]
list_chemo_15_readmit = [int(x) for x in list_chemo_15_readmit]
print("list_chemo_15_readmit =", len(list_chemo_15_readmit))
print(list_chemo_15_readmit[:5])

master_chemo_15_readmit = facesheet_202401to202404[facesheet_202401to202404['visit_sn'].isin(list_chemo_15_readmit)]
print('cancer 15 with chemo shape =', master_chemo_15_readmit.shape)
master_chemo_15_readmit.to_excel('xxx.xlsx', index=False)

In [None]:
drg_202401to202404 = pd.merge(drg, info_202401to202404[['patient_id','medical_record_no','visit_sn']], on='medical_record_no', how='inner')
drg_202401to202404.head(5)
drg_tuple_readmit_30,drg_list_readmit_30,drg_list_readmit_30_firsttime = get_readmission_patients(drg_202401to202404,lung_cancer_code,days_threshold=30)

emer_drg = drg[drg['inpatient_type']=='急诊']
emer_drg = pd.merge(emer_drg, info_202401to202404[['patient_id','medical_record_no','visit_sn']], on='medical_record_no', how='inner')
unplan_drg = drg[drg['unplaned'] == '是']
unplan_drg = pd.merge(unplan_drg, info_202401to202404[['patient_id','medical_record_no','visit_sn']], on='medical_record_no', how='inner')
emer_and_unplan_drg = pd.concat([emer_drg, unplan_drg]).drop_duplicates()

temp_emer_drg_readmit = pd.merge(emer_drg, pd.DataFrame(drg_list_readmit_30, columns=['visit_sn']), on='visit_sn', how='inner')
print('emer readmit,', temp_emer_drg_readmit.shape)

emer_and_unplan_drg_readmit = pd.merge(emer_and_unplan_drg, pd.DataFrame(drg_list_readmit_30, columns=['visit_sn']), on='visit_sn', how='inner')
print('emer + unplanned readmit,', emer_and_unplan_drg_readmit.shape)


### Summary in choosing positive samples
- There is too little sample when only using the emergency readmission (21 readmission in 4 months). 
- When using readmission with chemo therapy, sample size is enough (241 readmission in 4 months) but cannot distinguish planned chemo therapy which sometimes have a 7 days or 15 days interval depending on the plan. 
- Combining emergency readmission with unplanned inpatient (categorized in DRG system), sample size is moderate (112 readmission in 4 months) while ensuring the qualiity of those positive samples. \
Thus choose emergency + unplanned as positive sample
