# Medical Appointment No-Shows – Tree-Based Classification

Goal: Predict whether a patient will miss their medical appointment ("No-show") based on demographic and health-related features, using Decision Tree and Random Forest classifiers.

The dataset is from Kaggle and includes information such as gender, age, diabetes, hypertension, and whether they received an SMS reminder.
The target variable is No-show:

‘Yes’ means the person did not show up
‘No’ means they attended the appointment.”


In [1]:
## Step 1 – Import Libraries

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score


In [2]:
## Step 2 – Load the Dataset

# Read the dataset from CSV file

df = pd.read_csv("KaggleV2-May-2016.csv")

print("Shape of dataset (rows, columns):", df.shape)

# Show first 5 rows
df.head()


Shape of dataset (rows, columns): (110527, 14)


Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [3]:
## Step 3 – Select Columns + Clean Data

# Keep only the columns we need
columns_to_keep = [
    "Gender",
    "Age",
    "Scholarship",
    "Hipertension",
    "Diabetes",
    "Alcoholism",
    "Handcap",
    "SMS_received",
    "No-show"
]

df = df[columns_to_keep].copy()

# Drop any rows with missing values (if there are any)
df = df.dropna()

print("Shape after selecting columns and dropping NaNs:", df.shape)
df.head()


Shape after selecting columns and dropping NaNs: (110527, 9)


Unnamed: 0,Gender,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,F,62,0,1,0,0,0,0,No
1,M,56,0,0,0,0,0,0,No
2,F,62,0,0,0,0,0,0,No
3,F,8,0,0,0,0,0,0,No
4,F,56,0,1,1,0,0,0,No


In [4]:
## Step 4 - Encode Variables + Create X, y


# Encode categorical variables

# Gender: F -> 1, M -> 0
df["Gender"] = (df["Gender"] == "F").astype(int)

# Target: "Yes" (no-show) -> 1, "No" (show) -> 0
df["No-show"] = (df["No-show"] == "Yes").astype(int)

# Define features (X) and target (y)
feature_names = [
    "Gender",
    "Age",
    "Scholarship",
    "Hipertension",
    "Diabetes",
    "Alcoholism",
    "Handcap",
    "SMS_received"
]

X = df[feature_names].copy()
y = df["No-show"].copy()

X.head()


Unnamed: 0,Gender,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
0,1,62,0,1,0,0,0,0
1,0,56,0,0,0,0,0,0
2,1,62,0,0,0,0,0,0
3,1,8,0,0,0,0,0,0
4,1,56,0,1,1,0,0,0


In [5]:
## Step 5 - Split 80 / 10 / 10

# First split: 80% train, 20% temporary (val + test)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.20, random_state=0, stratify=y
)

# Second split: from the 20%, make 10% validation and 10% test (each half)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.50, random_state=0, stratify=y_temp
)

print("Train size:", X_train.shape[0])
print("Validation size:", X_val.shape[0])
print("Test size:", X_test.shape[0])


Train size: 88421
Validation size: 11053
Test size: 11053


In [6]:
## Step 6 - Decision Tree (Gini vs Entropy)

criterions = ["gini", "entropy"]
dt_val_accuracies = {}

for crit in criterions:
    # Build the decision tree model with a given criterion
    dt_model = DecisionTreeClassifier(criterion=crit, random_state=0)
    dt_model.fit(X_train, y_train)
    
    # Predict on validation set
    y_val_pred = dt_model.predict(X_val)
    
    # Compute validation accuracy
    val_acc = accuracy_score(y_val, y_val_pred)
    dt_val_accuracies[crit] = val_acc
    
    print(f"Decision Tree ({crit}) - validation accuracy: {val_acc:.4f}")


Decision Tree (gini) - validation accuracy: 0.7944
Decision Tree (entropy) - validation accuracy: 0.7945


In [7]:
## Step 7 - Choose Best Decision Tree + Confusion Matrix

# Choose the best criterion based on validation accuracy
best_criterion = max(dt_val_accuracies, key=dt_val_accuracies.get)
print("Best criterion for Decision Tree:", best_criterion)

# Train the best model on the training set
best_dt = DecisionTreeClassifier(criterion=best_criterion, random_state=0)
best_dt.fit(X_train, y_train)

# Evaluate on the test set
y_test_pred_dt = best_dt.predict(X_test)

test_acc_dt = accuracy_score(y_test, y_test_pred_dt)
print("Decision Tree - test accuracy:", test_acc_dt)

# Confusion matrix
cm_dt = confusion_matrix(y_test, y_test_pred_dt)
print("Decision Tree - confusion matrix:\n", cm_dt)


Best criterion for Decision Tree: entropy
Decision Tree - test accuracy: 0.794987786121415
Decision Tree - confusion matrix:
 [[8749   72]
 [2194   38]]


In [8]:
## Step 8 - Random Forest (10, 50, 100 trees)


n_estimators_list = [10, 50, 100]
rf_val_accuracies = {}

for n in n_estimators_list:
    # Build random forest with n trees
    rf_model = RandomForestClassifier(
        n_estimators=n,
        criterion="entropy",   # one criterion, same as slides
        random_state=0
    )
    rf_model.fit(X_train, y_train)
    
    # Predict on validation set
    y_val_pred_rf = rf_model.predict(X_val)
    
    val_acc_rf = accuracy_score(y_val, y_val_pred_rf)
    rf_val_accuracies[n] = val_acc_rf
    
    print(f"Random Forest (n_estimators={n}) - validation accuracy: {val_acc_rf:.4f}")


Random Forest (n_estimators=10) - validation accuracy: 0.7930
Random Forest (n_estimators=50) - validation accuracy: 0.7942
Random Forest (n_estimators=100) - validation accuracy: 0.7945


In [9]:
## Step 9 - Best Random Forest + Confusion Matrix

# Choose the best number of trees based on validation accuracy
best_n = max(rf_val_accuracies, key=rf_val_accuracies.get)
print("Best n_estimators for Random Forest:", best_n)

# Train final random forest model
best_rf = RandomForestClassifier(
    n_estimators=best_n,
    criterion="entropy",
    random_state=0
)
best_rf.fit(X_train, y_train)

# Evaluate on the test set
y_test_pred_rf = best_rf.predict(X_test)

test_acc_rf = accuracy_score(y_test, y_test_pred_rf)
print("Random Forest - test accuracy:", test_acc_rf)

# Confusion matrix
cm_rf = confusion_matrix(y_test, y_test_pred_rf)
print("Random Forest - confusion matrix:\n", cm_rf)


Best n_estimators for Random Forest: 100
Random Forest - test accuracy: 0.7944449470731928
Random Forest - confusion matrix:
 [[8734   87]
 [2185   47]]


## Conclusion

- I used a medical appointment dataset to predict whether a patient will miss an appointment ("No-show").
- Features: Gender, Age, Scholarship, Hipertension, Diabetes, Alcoholism, Handcap, SMS_received.
- Models: Decision Tree and Random Forest (tree-based classifiers).
- I split the data into 80% training, 10% validation, and 10% test.
- I tried different criteria for the Decision Tree (gini, entropy) and different numbers of trees for the Random Forest (10, 50, 100).
- I selected the best models based on validation accuracy and evaluated them on the test set using accuracy and the confusion matrix.
- In this dataset, Random Forest achieved higher test accuracy than a single Decision Tree (in my run).
