### 📥 Data Loading & Preprocessing

We begin by importing the necessary libraries and loading the training and test datasets. We check their dimensions and preview the first few rows to understand the structure and types of features. 

Then, to be able work with the target variable 'Risk_Label', we use LabelEncoder from sklearn.preprocessing to convert these string labels into numeric values. And we seperate the target lable from data to be able to train and test our model.





In [12]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder


# Load the datasets
train_df = pd.read_csv("dataset/train_credit_data.csv")
test_df = pd.read_csv("dataset/test_credit_data.csv")

print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)
train_df.head()

# Copy to avoid modifying original data
train_data = train_df.copy()
test_data = test_df.copy()

# Label encode the target variable
le = LabelEncoder()
train_data['Risk_Label'] = le.fit_transform(train_data['Risk_Label'])
test_data['Risk_Label'] = le.transform(test_data['Risk_Label'])

# Separate features and labels
X_train = train_data.drop("Risk_Label", axis=1)
y_train = train_data["Risk_Label"]

X_test = test_data.drop("Risk_Label", axis=1)
y_test = test_data["Risk_Label"]

print("Train shape:", X_train.shape, y_train.shape)
print("Test shape:", X_test.shape, y_train.shape)

X_train.head()




Train shape: (3772, 13)
Test shape: (3428, 13)
Train shape: (3772, 12) (3772,)
Test shape: (3428, 12) (3772,)


Unnamed: 0,Income,DebtToIncome,CreditHistory,LatePayments,Employment,CreditLines,LoanAmount,HomeOwnership,Age,Savings,Education,ReliabilityScore
0,101.692304,6.046014,12.538993,2,3.371397,2,223.749888,0,29.654235,63.632573,3,4.387126
1,65.263115,20.920108,16.524981,2,2.558994,4,39.948799,1,42.493071,61.098483,2,5.378843
2,83.760685,20.932257,6.834911,0,5.294267,2,206.694898,1,41.379374,82.461972,4,7.623962
3,73.293188,16.535676,12.673553,0,1.043815,4,120.73471,1,50.546432,80.630408,2,12.79508
4,76.857072,19.444921,10.359301,2,0.0,7,97.532696,0,29.366066,39.001485,3,3.478817


## 1️⃣ Part 1 

### 🧠 Train and Tune a Decision Tree Classifier

We train a decision tree classifier using GridSearchCV to automatically find the best hyperparameters (criterion, max_depth, and min_samples_split) via cross-validation on the training data. We use 5-fold cross-validation to test all combinations of these parameters on the training data. Below are the hype

<ul>
<li>criterion</li>
<li>max_depth</li>
<li>min_samples_split</li>
</ul>

In [15]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

# Define the model and parameters to tune
dt = DecisionTreeClassifier(random_state=42)
params = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [1,2,3,4,5,6,7,8,9, 10, 15, 20, None],
    'min_samples_split': [2, 3, 4, 5, 10, 20]
}

# Cross-validation to find the best hyperparameters
grid = GridSearchCV(dt, params, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

print("Best parameters:", grid.best_params_)


Best parameters: {'criterion': 'entropy', 'max_depth': 3, 'min_samples_split': 2}
