# Patient Readmission Risk – Exploratory Data Analysis (EDA)

This notebook explores the **Diabetes 130-US hospitals dataset** (UCI Machine Learning Repository) to prepare it for modeling patient readmission risk within 30 days.

---


import pandas as pd

# Load the dataset from the data folder
df = pd.read_csv("../data/diabetic_data.csv")

# Show dataset dimensions
print("Shape (rows, columns):", df.shape)

# Show first 5 rows
df.head()


In [1]:

df["readmit_30"] = (df["readmitted"] == "<30").astype(int)

# Check balance of the target
df["readmit_30"].value_counts(normalize=True)


NameError: name 'df' is not defined

In [None]:
# See column names and data types
df.info()


In [None]:
(df == "?").sum().sort_values(ascending=False).head(10)


In [None]:
for col in ["race", "gender", "age", "max_glu_serum", "A1Cresult"]:
    print(f"\n{col} value counts:")
    print(df[col].value_counts())


## Step 3: Data Cleaning

- Drop ID columns (`encounter_id`, `patient_nbr`) since they don’t add predictive value.  
- Replace `"?"` with `NaN` for proper missing value handling.  
- Check which columns have the most missing values.


# 1) Drop identifiers and very low-value columns
to_drop = [
    "encounter_id", "patient_nbr",   # pure IDs
    "weight",                         # ~95% missing
    "payer_code",                     # very sparse
    "medical_specialty",              # many missing, ultra-high cardinality
    "examide", "citoglipton",         # almost all 'No'
    "readmitted"                      # replaced by readmit_30
]
df = df.drop(columns=[c for c in to_drop if c in df.columns])

In [None]:
import numpy as np

df = df.replace("?", np.nan)


df.isnull().sum().sort_values(ascending=False).head(15)


In [None]:
#Check where missing values are, then

#Impute (fill) them properly.

In [None]:
df.isnull().sum().sum()


## Step 6: Train-Test Split

- Separated features (`X`) and target (`y`).
- Used an 80/20 split (train = 81,412 rows, test = 20,354 rows).
- Stratified by target to preserve class balance in both sets.


In [None]:
from sklearn.model_selection import train_test_split

target = "readmit_30"
X = df.drop(columns=[target])
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
X_train.shape, X_test.shape, y_train.mean(), y_test.mean()


Here’s what each part of the output means:

- **(81412, 42)** → `X_train` has 81,412 rows (examples) and 42 columns (features).  
  These rows are used to fit (train) the model.  

- **(20354, 42)** → `X_test` has 20,354 rows and the same 42 features.  
  These rows are held out for evaluation.  

- **0.1116 (train)** → About 11.16% of patients in the training set are positive cases (`readmit_30 = 1`).  

- **0.1116 (test)** → About 11.16% of patients in the test set are positive.  

💡 Because we used `stratify=y` in the split, the class balance (≈11% positives) is preserved in both train and test. This is critical for imbalanced datasets — otherwise, you might end up with too few positive cases in one set.


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression


## imports the preprocessing and modeling tools.

In [None]:
# Separate categorical vs numeric features
cat_cols = X_train.select_dtypes(include="object").columns.tolist()
num_cols = X_train.select_dtypes(exclude="object").columns.tolist()

cat_cols[:5], num_cols[:5]   # quick peek at a few


## Step 7: Preprocessing + Baseline Model (Logistic Regression)

- Imported preprocessing tools (`ColumnTransformer`, `OneHotEncoder`, `Pipeline`) and model (`LogisticRegression`).
- Identified **categorical** columns (strings/objects) and **numeric** columns (integers/floats).
- Built a preprocessing pipeline:
  - One-hot encode categorical variables.
  - Pass numeric variables through unchanged.
- Chose Logistic Regression as the baseline model:
  - Simple, interpretable, widely used in healthcare.
  - Added `class_weight="balanced"` to account for imbalanced data (~11% positives).
- Trained the pipeline on the training set (`X_train`, `y_train`).
- Evaluated performance on the test set using:
  - **AUROC** (how well the model ranks positive vs negative cases).
  - **AUPRC** (how well the model identifies positives in an imbalanced dataset).


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline

cat_cols = X_train.select_dtypes(include="object").columns.tolist()
num_cols = X_train.select_dtypes(exclude="object").columns.tolist()

preprocess = ColumnTransformer(
    transformers=[
        ("num", Pipeline([("scaler", StandardScaler())]), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),  # sparse by default
    ],
    remainder="drop",
    force_int_remainder_cols=False  # silences the future warning & adopts new behavior
)

In [None]:
from sklearn.linear_model import LogisticRegression

clf = Pipeline(steps=[
    ("prep", preprocess),
    ("model", LogisticRegression(
        solver="saga",            # handles large sparse designs better
        max_iter=5000,            # give it room to converge
        class_weight="balanced",  # handle ~11% positives
        n_jobs=-1,                # use all CPU cores
        C=0.5                     # a bit more regularization; adjust if needed
    ))
])

clf.fit(X_train, y_train)


## Step 8: Model Evaluation

- Evaluating the fitted Logistic Regression pipeline on the test set (20,354 rows).
- Metrics:
  - **AUROC** (Area Under ROC Curve): measures ability to rank positives vs negatives.
  - **AUPRC** (Area Under Precision-Recall Curve): better for imbalanced data, shows how well we identify positives.
- Both metrics are important:
  - AUROC tells us general ranking performance.
  - AUPRC tells us how useful the model is in catching the rare 11% positives.


In [None]:
from sklearn.metrics import roc_auc_score, average_precision_score

# Predict probabilities for test set
p_test = clf.predict_proba(X_test)[:, 1]

# Metrics
auroc = roc_auc_score(y_test, p_test)
auprc = average_precision_score(y_test, p_test)

print("AUROC:", round(auroc, 3))
print("AUPRC:", round(auprc, 3))


## Step 8: Model Evaluation

- Evaluated pipeline using AUROC and AUPRC on the test set (20,354 rows).
- **AUROC = 0.646** → Model is moderately effective at ranking positives vs negatives.
- **AUPRC = 0.205** → ~2× better than random (baseline prevalence = 0.11).
- Interpretation:
  - Logistic Regression provides a solid baseline.
  - Model is better than chance but not highly accurate → suggests trying stronger models (e.g., XGBoost).
- Next:
  - Experiment with thresholding (e.g., flag top 10% high-risk patients) to calculate precision/recall.
  - Compare with tree-based models to improve AUROC/AUPRC.


## Step 9: Thresholding Predictions

- Logistic Regression outputs probabilities, but in practice we need to choose a threshold to classify patients as "at risk."
- Instead of the default threshold (0.5), we flagged the **top 10% of patients by predicted risk** as positives.
- This approach simulates how hospitals might focus limited resources on the riskiest patients.

### Why this matters
- **Precision** → Of the patients we flagged, how many truly were readmitted within 30 days?  
- **Recall** → Of all patients who were actually readmitted, how many did we catch?  
- **F1-score** → Balance of precision and recall.  

### Key learning
- Choosing a threshold depends on business/clinical priorities:
  - **High precision** → fewer false alarms (good when resources are expensive).  
  - **High recall** → catch more true cases (good when missing a case is very costly).  
- Evaluating at the top 10% risk level shows how the model could be used in a real-world hospital setting.


In [None]:
import numpy as np
from sklearn.metrics import classification_report

# Define threshold as the 90th percentile of predicted risk scores
thresh = np.quantile(p_test, 0.90)

# Classify as positive if risk >= threshold
pred_10 = (p_test >= thresh).astype(int)

print("Threshold (90th percentile risk):", round(thresh, 4))
print(classification_report(y_test, pred_10, digits=3))


### Step 9: Thresholding Results (Top 10% Risk Patients)

- **Threshold chosen:** 0.6536 (90th percentile of predicted probabilities).  
- **Results at this threshold:**
  - Precision (positives) = 0.242 → Of the patients flagged as high risk, ~24% were truly readmitted.  
  - Recall (positives) = 0.217 → The model identified ~22% of all actual readmissions.  
  - F1-score = 0.228 → Balance of precision and recall is modest.  
  - Accuracy = 0.837, but accuracy is less meaningful with imbalance.

### Interpretation
- The model does focus on a smaller high-risk group: flagged patients are more than **2× as likely** to be readmitted compared to baseline prevalence (24% vs 11%).  
- However, recall is limited: ~78% of true readmissions were missed at this threshold.  
- This highlights the **precision–recall trade-off**:
  - Raising threshold → fewer flagged, higher precision, lower recall.  
  - Lowering threshold → more flagged, lower precision, higher recall.  

### Next Steps
- Experiment with different thresholds (e.g., top 20%, 30%) to explore trade-offs.  
- Try stronger models (e.g., XGBoost) to improve AUROC/AUPRC.  
- Consider calibration to improve probability estimates.  


## Step 10: Stronger Model (XGBoost)

- XGBoost (Extreme Gradient Boosting) is a tree-based ensemble method:
  - Builds many decision trees sequentially, each correcting errors of the last.
  - Handles non-linear relationships better than Logistic Regression.
  - Robust to imbalanced datasets when using `scale_pos_weight`.
- Expectation: Better AUROC and AUPRC compared to Logistic Regression.
- Goal: Train and evaluate XGBoost with the same preprocessing pipeline for a fair comparison.


In [None]:
from xgboost import XGBClassifier
print("XGBoost is working!")


In [None]:
import sys, platform
print("Python:", sys.version)
print("Exec:", sys.executable)
import numpy as np, pandas as pd
import sklearn
print("numpy:", np.__version__)
print("pandas:", pd.__version__)
print("sklearn:", sklearn.__version__)
try:
    import xgboost as xgb
    print("xgboost:", xgb.__version__)
except Exception as e:
    print("xgboost import error:", e)


In [None]:
# ==== Imports ====
import numpy as np, pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, average_precision_score, classification_report
from sklearn.linear_model import LogisticRegression

# ==== Load data ====
df = pd.read_csv("../data/diabetic_data.csv")

# Target
df["readmit_30"] = (df["readmitted"] == "<30").astype(int)

# Drop low-value columns
to_drop = [
    "encounter_id", "patient_nbr",
    "weight", "payer_code", "medical_specialty",
    "examide", "citoglipton",
    "readmitted",
]
df = df.drop(columns=[c for c in to_drop if c in df.columns])

# Normalize missing markers and simple imputations (fast baseline)
df = df.replace("?", np.nan)

num_cols = df.select_dtypes(include=["int64","float64"]).columns.tolist()
cat_cols = df.select_dtypes(include=["object"]).columns.tolist()
if "readmit_30" in num_cols: num_cols.remove("readmit_30")

for c in num_cols:
    df[c] = df[c].astype(float).fillna(df[c].median())
for c in cat_cols:
    df[c] = df[c].fillna("Unknown")

# Train/test split (stratified)
X = df.drop(columns=["readmit_30"])
y = df["readmit_30"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Preprocess: scale numeric, OHE categorical (sparse)
preprocess = ColumnTransformer(
    transformers=[
        ("num", Pipeline([("scaler", StandardScaler())]), X_train.select_dtypes(exclude="object").columns.tolist()),
        ("cat", OneHotEncoder(handle_unknown="ignore"), X_train.select_dtypes(include="object").columns.tolist()),
    ],
    remainder="drop",
    force_int_remainder_cols=False,
)

# ==== Quick sanity model (LogReg) on a subset for speed ====
subset = min(10000, len(X_train))
X_tr_sub = X_train.sample(subset, random_state=42)
y_tr_sub = y_train.loc[X_tr_sub.index]

logreg = Pipeline(steps=[
    ("prep", preprocess),
    ("model", LogisticRegression(
        solver="saga", max_iter=2000, class_weight="balanced", n_jobs=-1, C=0.7
    ))
])

logreg.fit(X_tr_sub, y_tr_sub)
p_test_lr = logreg.predict_proba(X_test)[:, 1]
print("Sanity AUROC (LogReg subset):", round(roc_auc_score(y_test, p_test_lr), 3))
print("Sanity AUPRC (LogReg subset):", round(average_precision_score(y_test, p_test_lr), 3))

# ==== Try XGBoost if available; else fall back to HistGradientBoosting ====
try:
    from xgboost import XGBClassifier
    model = Pipeline(steps=[
        ("prep", preprocess),
        ("model", XGBClassifier(
            n_estimators=300, max_depth=4, learning_rate=0.08,
            subsample=0.9, colsample_bytree=0.9, eval_metric="logloss",
            scale_pos_weight=(y_train.value_counts()[0]/y_train.value_counts()[1])
        ))
    ])
    model.fit(X_train, y_train)
    p_test = model.predict_proba(X_test)[:, 1]
    print("XGB AUROC:", round(roc_auc_score(y_test, p_test), 3))
    print("XGB AUPRC:", round(average_precision_score(y_test, p_test), 3))
    chosen = "XGBoost"
except Exception as e:
    print("XGBoost not available, using HistGradientBoosting:", e)
    from sklearn.ensemble import HistGradientBoostingClassifier
    model = Pipeline(steps=[
        ("prep", preprocess),
        ("model", HistGradientBoostingClassifier(
            learning_rate=0.08, max_depth=4, max_iter=300,
            class_weight={0:1, 1:(y_train.value_counts()[0]/y_train.value_counts()[1])}
        ))
    ])
    model.fit(X_train, y_train)
    # HGB returns decision_function as proba only if predict_proba exists; it does in recent sklearn
    p_test = model.predict_proba(X_test)[:, 1]
    print("HGB AUROC:", round(roc_auc_score(y_test, p_test), 3))
    print("HGB AUPRC:", round(average_precision_score(y_test, p_test), 3))

# Threshold @ top 10% for the chosen model
import numpy as np
thresh = np.quantile(p_test, 0.90)
pred_10 = (p_test >= thresh).astype(int)
print("Threshold (90th percentile):", round(thresh, 4))
print(classification_report(y_test, pred_10, digits=3))


## Step 10: Stronger Model (XGBoost) — Results

**Metrics (test set):**
- **AUROC:** 0.682  
- **AUPRC:** 0.232  
- (Baseline Logistic Regression earlier: AUROC 0.646, AUPRC 0.205)

**Thresholded evaluation (top 10% highest risk):**
- **Precision:** 0.272  
- **Recall:** 0.244  
- **Accuracy:** 0.843  
- Interpretation: The flagged cohort is ~2.5× the baseline prevalence (27% vs 11%), and we capture ~24% of true readmissions at this operating point.

**Takeaways:**
- XGBoost improves ranking and early-warning utility over Logistic Regression.
- Further gains likely from threshold tuning, hyperparameter search, and feature engineering (e.g., diagnosis grouping, medication aggregates).


In [None]:
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline

# (Re)build the pipeline with a non-conflicting name
xgb_pipe = Pipeline(steps=[
    ("prep", preprocess),
    ("model", XGBClassifier(
        n_estimators=300,
        max_depth=4,
        learning_rate=0.08,
        subsample=0.9,
        colsample_bytree=0.9,
        eval_metric="logloss",
        scale_pos_weight=(y_train.value_counts()[0] / y_train.value_counts()[1])
    ))
])

# Fit if needed (skip if you already have a fitted pipeline under a different name)
xgb_pipe.fit(X_train, y_train)

# Ensure the reports folder exists
import os
os.makedirs("../reports", exist_ok=True)

# Save the fitted pipeline
import joblib
joblib.dump(xgb_pipe, "../reports/readmission_xgb_pipeline.joblib")


## Step 11: Model Saving & Reuse

- Saved the trained XGBoost pipeline to `../reports/readmission_xgb_pipeline.joblib`.
- The file contains the **entire pipeline**:
  - Preprocessing (scaling + one-hot encoding).
  - Trained XGBoost model.
- Benefits:
  - Can reload later without retraining.
  - Ensures consistent preprocessing and model logic.
- Verified by reloading and confirming AUROC/AUPRC match previous results.


In [None]:
import joblib

# Load the saved pipeline
loaded_model = joblib.load("../reports/readmission_xgb_pipeline.joblib")

# Sanity check: evaluate on test set
p_loaded = loaded_model.predict_proba(X_test)[:, 1]

print("Reloaded AUROC:", round(roc_auc_score(y_test, p_loaded), 3))
print("Reloaded AUPRC:", round(average_precision_score(y_test, p_loaded), 3))


### Step 11: Model Saving & Verification

- Reloaded the saved pipeline (`../reports/readmission_xgb_pipeline.joblib`) using `joblib.load`.  
- Evaluated the reloaded model on the test set:  
  - AUROC ≈ 0.682  
  - AUPRC ≈ 0.232  
- Results match the original Step 10 evaluation, confirming that:
  - The entire pipeline (preprocessing + XGBoost model) was saved successfully.  
  - Reloading works as expected and can be used for future predictions without retraining.  

**Key takeaway:**  
Saving models with `joblib` is essential for deployment and reproducibility. It ensures that preprocessing and model logic stay consistent across training, testing, and future use.


## Step 12: Demo Predictions (Deployment-Style)

- Goal: Show how the saved model can be reused to score **new patient data** without retraining.
- Process:
  1. Load the saved pipeline using `joblib.load`.
  2. Create a small sample input (new patient encounter) with the same feature structure as the training data.
  3. Use `.predict_proba()` to generate readmission risk probabilities.
  4. Interpret the output:
     - A probability close to 1 → high risk of 30-day readmission.
     - A probability close to 0 → low risk.
- Importance:
  - Demonstrates the model is **deployment-ready**.
  - Clinicians, analysts, or downstream systems could plug in real patient data to get a risk score.
  - Closing the loop: from raw data → trained model → saved pipeline → real-world predictions.


In [None]:
import pandas as pd, joblib

# Load saved pipeline
loaded_model = joblib.load("../reports/readmission_xgb_pipeline.joblib")

# Get original raw feature columns from the ColumnTransformer
ct = loaded_model.named_steps["prep"]
cat_cols, num_cols = [], []
for name, trans, cols in ct.transformers_:
    if name == "cat":
        cat_cols = list(cols)
    elif name == "num":
        num_cols = list(cols)

feature_cols = num_cols + cat_cols  # raw input order

# Build defaults then overwrite with your demo values
demo_data = {c: ("Unknown" if c in cat_cols else 0) for c in feature_cols}
demo_data.update({
    "race": "Caucasian",
    "gender": "Female",
    "age": "[60-70)",
    "admission_type_id": 1,
    "discharge_disposition_id": 1,
    "admission_source_id": 7,
    "time_in_hospital": 5,
    "num_lab_procedures": 45,
    "num_procedures": 1,
    "num_medications": 12,
    "number_diagnoses": 4,
    "diag_1": "250.00",
    "diag_2": "401.9",
    "diag_3": "414.01",
    "max_glu_serum": "None",
    "A1Cresult": ">7",
    "metformin": "No",
    "insulin": "Up",
    "change": "Ch",
    "diabetesMed": "Yes"
})

demo_patient = pd.DataFrame([demo_data], columns=feature_cols)
risk_score = loaded_model.predict_proba(demo_patient)[:, 1][0]
print("Predicted 30-day readmission risk:", f"{risk_score:.1%}")


In [3]:
# Text-only SHAP explanation for the demo patient
import numpy as np
import pandas as pd
import shap

# 1) Grab pieces from the saved pipeline
prep = loaded_model.named_steps["prep"]          # ColumnTransformer
model = loaded_model.named_steps["model"]        # XGBClassifier

# 2) Transform the demo row to the encoded feature space
Xp = prep.transform(demo_patient)

# 3) Get output feature names (after OHE + scaling)
feat_names = prep.get_feature_names_out()

# 4) Compute SHAP values for this single row (tree explainer is fast)
explainer = shap.TreeExplainer(model)
shap_vals = explainer.shap_values(Xp)            # shape: (n_samples, n_features)
shap_row = np.asarray(shap_vals)[0]              # take first (and only) row

# 5) Build a tidy table: feature, contribution (signed), |contribution|
contrib = pd.DataFrame({
    "feature": feat_names,
    "shap": shap_row,
    "abs_shap": np.abs(shap_row)
}).sort_values("abs_shap", ascending=False)

# 6) Show top drivers up and down
top_k = 10  # adjust to see more/less
top_up = contrib[contrib["shap"] > 0].head(top_k)[["feature","shap"]]
top_down = contrib[contrib["shap"] < 0].head(top_k)[["feature","shap"]]

print("Top factors INCREASING risk:")
display(top_up)

print("\nTop factors DECREASING risk:")
display(top_down)

# Optional: concise bullet list
def bullets(df):
    return [f"{r.feature}: {r.shap:+.3f}" for r in df.itertuples(index=False)]
print("\nQuick summary ↑:", bullets(top_up))
print("Quick summary ↓:", bullets(top_down))


  from .autonotebook import tqdm as notebook_tqdm


NameError: name 'loaded_model' is not defined

In [4]:
# === Self-contained SHAP explainability for one demo patient ===
import numpy as np, pandas as pd, joblib, shap

# 1) Load the saved pipeline
loaded_model = joblib.load("../reports/readmission_xgb_pipeline.joblib")
prep = loaded_model.named_steps["prep"]     # ColumnTransformer
model = loaded_model.named_steps["model"]   # XGBClassifier

# 2) Rebuild the raw feature list from the fitted ColumnTransformer
cat_cols, num_cols = [], []
for name, trans, cols in prep.transformers_:
    if name == "cat":
        cat_cols = list(cols)
    elif name == "num":
        num_cols = list(cols)
feature_cols = num_cols + cat_cols

# 3) Create a demo patient with safe defaults, then overwrite key values
demo_data = {c: ("Unknown" if c in cat_cols else 0) for c in feature_cols}
demo_data.update({
    "race": "Caucasian",
    "gender": "Female",
    "age": "[60-70)",
    "admission_type_id": 1,
    "discharge_disposition_id": 1,
    "admission_source_id": 7,
    "time_in_hospital": 5,
    "num_lab_procedures": 45,
    "num_procedures": 1,
    "num_medications": 12,
    "number_diagnoses": 4,
    "diag_1": "250.00",
    "diag_2": "401.9",
    "diag_3": "414.01",
    "max_glu_serum": "None",
    "A1Cresult": ">7",
    "metformin": "No",
    "insulin": "Up",
    "change": "Ch",
    "diabetesMed": "Yes"
})
demo_patient = pd.DataFrame([demo_data], columns=feature_cols)

# 4) Predict risk (sanity check)
risk = float(loaded_model.predict_proba(demo_patient)[:, 1][0])

# 5) Transform to model input space and compute SHAP
Xp = prep.transform(demo_patient)
feat_names = prep.get_feature_names_out()

explainer = shap.TreeExplainer(model)
shap_vals = explainer.shap_values(Xp)
shap_row = np.asarray(shap_vals)[0]

# 6) Rank features by contribution
contrib = pd.DataFrame({
    "feature": feat_names,
    "shap": shap_row,
    "abs_shap": np.abs(shap_row)
}).sort_values("abs_shap", ascending=False)

top_k = 10
top_up = contrib[contrib["shap"] > 0].head(top_k)[["feature","shap"]]
top_down = contrib[contrib["shap"] < 0].head(top_k)[["feature","shap"]]

print(f"Predicted 30-day readmission risk: {risk:.1%}\n")
print("Top factors INCREASING risk:")
display(top_up)
print("\nTop factors DECREASING risk:")
display(top_down)

def bullets(df):
    return [f"{r.feature}: {r.shap:+.3f}" for r in df.itertuples(index=False)]
print("\nQuick summary ↑:", bullets(top_up))
print("Quick summary ↓:", bullets(top_down))


Predicted 30-day readmission risk: 36.3%

Top factors INCREASING risk:


Unnamed: 0,feature,shap
3,num__time_in_hospital,0.03546
6,num__num_medications,0.024733
2277,cat__diabetesMed_No,0.024201
2203,cat__max_glu_serum_Unknown,0.014354
2209,cat__metformin_No,0.013106
25,cat__age_[50-60),0.011048
556,cat__diag_1_786,0.010925
1519,cat__diag_3_250,0.009701
348,cat__diag_1_486,0.009194
861,cat__diag_2_285,0.007404



Top factors DECREASING risk:


Unnamed: 0,feature,shap
9,num__number_inpatient,-0.319334
1,num__discharge_disposition_id,-0.20642
10,num__number_diagnoses,-0.11758
2207,cat__A1Cresult_Unknown,-0.054839
8,num__number_emergency,-0.02557
2248,cat__acarbose_No,-0.024632
2259,cat__insulin_Down,-0.015481
299,cat__diag_1_428,-0.013505
2234,cat__glyburide_No,-0.007645
27,cat__age_[70-80),-0.005954



Quick summary ↑: ['num__time_in_hospital: +0.035', 'num__num_medications: +0.025', 'cat__diabetesMed_No: +0.024', 'cat__max_glu_serum_Unknown: +0.014', 'cat__metformin_No: +0.013', 'cat__age_[50-60): +0.011', 'cat__diag_1_786: +0.011', 'cat__diag_3_250: +0.010', 'cat__diag_1_486: +0.009', 'cat__diag_2_285: +0.007']
Quick summary ↓: ['num__number_inpatient: -0.319', 'num__discharge_disposition_id: -0.206', 'num__number_diagnoses: -0.118', 'cat__A1Cresult_Unknown: -0.055', 'num__number_emergency: -0.026', 'cat__acarbose_No: -0.025', 'cat__insulin_Down: -0.015', 'cat__diag_1_428: -0.014', 'cat__glyburide_No: -0.008', 'cat__age_[70-80): -0.006']
