<a href="https://www.kaggle.com/code/selvetelifdemirel/credit-approval-ml-project?scriptVersionId=249407988" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **Credit Card Approval ML Project**


# Credit Card Approval Prediction – Machine Learning Project

This project aims to predict whether a credit card application should be approved or not using machine learning techniques. Based on applicants' personal and financial information, the model classifies individuals as "good" or "bad" clients.

## About the Dataset

The dataset contains two tables that can be merged using the `ID` column:

### 1. `application_record.csv`  
Contains demographic and socio-economic information of applicants.

| Feature | Description |
|--------|-------------|
| ID | Client number |
| CODE_GENDER | Gender |
| FLAG_OWN_CAR | Owns a car |
| FLAG_OWN_REALTY | Owns a property |
| CNT_CHILDREN | Number of children |
| AMT_INCOME_TOTAL | Annual income |
| NAME_INCOME_TYPE | Income type |
| NAME_EDUCATION_TYPE | Education level |
| NAME_FAMILY_STATUS | Marital status |
| NAME_HOUSING_TYPE | Housing type |
| DAYS_BIRTH | Days since birth (negative values) |
| DAYS_EMPLOYED | Days employed (negative values, positive means unemployed) |
| FLAG_MOBIL | Has mobile phone |
| FLAG_WORK_PHONE | Has work phone |
| FLAG_PHONE | Has phone |
| FLAG_EMAIL | Has email |
| OCCUPATION_TYPE | Occupation |
| CNT_FAM_MEMBERS | Number of family members |

### 2. `credit_record.csv`  
Includes the applicants’ monthly credit history.

| Feature | Description |
|--------|-------------|
| ID | Client number |
| MONTHS_BALANCE | Months since the record (0 = current month, -1 = previous, etc.) |
| STATUS | Credit status: |
|        | 0: 1-29 days overdue |
|        | 1: 30-59 days overdue |
|        | 2: 60-89 days overdue |
|        | 3: 90-119 days overdue |
|        | 4: 120-149 days overdue |
|        | 5: 150+ days overdue / bad debt |
|        | C: Paid off that month |
|        | X: No loan for the month |


In [2]:
!pip install imblearn
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
from imblearn.over_sampling import SMOTE
import warnings
warnings.filterwarnings('ignore')

Collecting imblearn
  Downloading imblearn-0.0-py2.py3-none-any.whl.metadata (355 bytes)
Collecting scikit-learn<2,>=1.3.2 (from imbalanced-learn->imblearn)
  Downloading scikit_learn-1.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (17 kB)
  Downloading scikit_learn-1.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Downloading scikit_learn-1.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.5/13.5 MB[0m [31m77.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: scikit-learn, imblearn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.2.2
    Uninstalling scikit-learn-1.2.2:
      Successfully uninstalled scikit-learn-1.2.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split


# 1.Dataset Preprocessing

### Application Record Dataset Features

In [4]:
application= pd.read_csv("/kaggle/input/credit-card-approval-prediction/application_record.csv")
application.head(5)

Unnamed: 0,ID,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS
0,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
1,5008805,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
2,5008806,M,Y,Y,0,112500.0,Working,Secondary / secondary special,Married,House / apartment,-21474,-1134,1,0,0,0,Security staff,2.0
3,5008808,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0
4,5008809,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0


In [5]:
application.shape

(438557, 18)

In [6]:
application.rename(columns={'CODE_GENDER':'Gender','FLAG_OWN_CAR':'Car','FLAG_OWN_REALTY':'Realty',
                         'CNT_CHILDREN':'Childnmbr','AMT_INCOME_TOTAL':'TotalIncome',
                         'NAME_INCOME_TYPE':'Incometype','NAME_EDUCATION_TYPE':'Edu','NAME_FAMILY_STATUS':'Fam',
                        'NAME_HOUSING_TYPE':'Housing','DAYS_BIRTH':'Birthday', 'DAYS_EMPLOYED':'EmplymntDate',
                        'FLAG_EMAIL':'email','NAME_INCOME_TYPE':'Incometype','FLAG_WORK_PHONE':'workphn',
                         'FLAG_MOBIL':'mobil','FLAG_PHONE':'phone','CNT_FAM_MEMBERS':'famsize',
                        'OCCUPATION_TYPE':'Occupation'
                        },inplace=True)

In [7]:
application.head(3)

Unnamed: 0,ID,Gender,Car,Realty,Childnmbr,TotalIncome,Incometype,Edu,Fam,Housing,Birthday,EmplymntDate,mobil,workphn,phone,email,Occupation,famsize
0,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
1,5008805,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
2,5008806,M,Y,Y,0,112500.0,Working,Secondary / secondary special,Married,House / apartment,-21474,-1134,1,0,0,0,Security staff,2.0


A check for missing values should be carried out on this dataset.

In [8]:
application.isnull().sum()

ID                   0
Gender               0
Car                  0
Realty               0
Childnmbr            0
TotalIncome          0
Incometype           0
Edu                  0
Fam                  0
Housing              0
Birthday             0
EmplymntDate         0
mobil                0
workphn              0
phone                0
email                0
Occupation      134203
famsize              0
dtype: int64

As seen above there are null values in the `Occupation` column, so we will drop those rows entirely.

In [9]:
application=application.dropna()

In [10]:
application.info()

<class 'pandas.core.frame.DataFrame'>
Index: 304354 entries, 2 to 438556
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   ID            304354 non-null  int64  
 1   Gender        304354 non-null  object 
 2   Car           304354 non-null  object 
 3   Realty        304354 non-null  object 
 4   Childnmbr     304354 non-null  int64  
 5   TotalIncome   304354 non-null  float64
 6   Incometype    304354 non-null  object 
 7   Edu           304354 non-null  object 
 8   Fam           304354 non-null  object 
 9   Housing       304354 non-null  object 
 10  Birthday      304354 non-null  int64  
 11  EmplymntDate  304354 non-null  int64  
 12  mobil         304354 non-null  int64  
 13  workphn       304354 non-null  int64  
 14  phone         304354 non-null  int64  
 15  email         304354 non-null  int64  
 16  Occupation    304354 non-null  object 
 17  famsize       304354 non-null  float64
dtypes: float6

And a check for duplicated values.

In [11]:
application.duplicated(subset=['ID']).sum()

23

In the `Application` dataset, duplicated ID values can be removed since each ID uniquely represents a customer. However, in the Credit dataset, ID values correspond to the customer's monthly debt repayment status, so duplicates are expected and should not be removed.

In [12]:
application.drop_duplicates(subset=['ID'], inplace=True)


Two new columns, `Age` and `WorkingYears`, were created by transforming the original `DAYS_BIRTH` and `DAYS_EMPLOYED` values into a more interpretable format (in years). To simplify the dataset and improve model training, the original `Birthday` and `EmplymntDate` columns were removed after transformation.


In [13]:
application['Age']=(-application['Birthday']/365).astype(int)

In the `WorkingYears` column (originally `DAYS_EMPLOYED`), negative values represent the number of days a person has been employed, counted backward from today. These values are converted into **years** and cast as **float** for better interpretability. Positive values indicate that the person is currently **not working**; these are labeled as `-1` using Label Encoding to distinguish them clearly from employed individuals.


In [14]:
application["WorkingYears"] = application["EmplymntDate"].apply(lambda x: float(-x / 365) if pd.notnull(x) and x < 0 else -1)

In [15]:
application=application.drop(['Birthday', 'EmplymntDate'], axis=1)


### Visualization of the Application dataset


We need to visualize the datasets to be able to see more clearly about the relationship between variables.

In [16]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

def pie_charts(df, columns, titles=None, max_cols=2):
    n = len(columns)
    rows = (n + max_cols - 1) // max_cols


    fig = make_subplots(rows=rows, cols=max_cols,
                        specs=[[{'type': 'domain'}]*max_cols for _ in range(rows)],
                        subplot_titles=titles if titles else columns)

    for i, col in enumerate(columns):
        row = i // max_cols + 1
        col_num = i % max_cols + 1

        value_counts = df[col].value_counts()
        labels = value_counts.index.tolist()
        values = value_counts.values.tolist()

        fig.add_trace(
            go.Pie(labels=labels, values=values, name=col, hole=0.3),
            row=row, col=col_num
        )

    fig.update_layout(
        height=300 * rows,
        width=400 * max_cols,
        title_text="Distributions of Applicants",
        showlegend=False
    )

    fig.show()

In [17]:
pie_charts(
    df=application,
    columns=['Edu', 'Incometype', 'Fam', 'Housing', 'Gender', 'Car', 'Realty', 'Occupation'],
    titles=['Education', 'Income Type', 'Marital Status', 'Housing', 'Gender', 'Car', 'Realty', 'Occupation'],
    max_cols=2
)


In [36]:
import plotly.express as px

fig = px.histogram(
    df_app,
    x='TotalIncome',
    nbins=30,
    marginal='rug',  
    opacity=0.7,
    title='Annual Income Distribution (Histogram)',
    color_discrete_sequence=['skyblue']
)
fig.add_vline(
    x=df_app['TotalIncome'].median(),
    line_dash="dash",
    line_color="red",
    annotation_text="Median",
    annotation_position="top left"
)
fig.show()


In [37]:
fig = px.box(
    df_app,
    x='TotalIncome',
    title='Annual Income Boxplot',
    color_discrete_sequence=['orange']
)
fig.show()


### Credit Record Dataset

In [18]:
credit=pd.read_csv("/kaggle/input/credit-card-approval-prediction/credit_record.csv")
credit.head(5)

Unnamed: 0,ID,MONTHS_BALANCE,STATUS
0,5001711,0,X
1,5001711,-1,0
2,5001711,-2,0
3,5001711,-3,0
4,5001712,0,C


In [19]:
credit['STATUS'].value_counts()

STATUS
C    442031
0    383120
X    209230
1     11090
5      1693
2       868
3       320
4       223
Name: count, dtype: int64

The `MONTHS_BALANCE` column indicates the month of each credit record, counted backward from the current month (`0` = current month, `-1` = previous month, etc.). The `STATUS` column shows the payment status and includes values such as `X` (no loan), `C` (paid off), and `0–4` (various levels of overdue payments). For modeling purposes, these values are grouped into two categories: **`good_debt`** (no delay, values `C` and `X`) and **`bad_debt`** (any delay, values `0`, `1`, `2`, `3`, `4`).


In [20]:
credit["STATUS"]=credit["STATUS"].astype(str)
credit['STATUS']=credit['STATUS'].apply(lambda x: "Good_Debt" if x in ["X", "C"] else "Bad_Debt")


In [21]:
credit['STATUS'].value_counts(normalize=True)

STATUS
Good_Debt    0.621091
Bad_Debt     0.378909
Name: proportion, dtype: float64

After changing the variables in the `Status` column integer to categorical values, made a new dataframe with the total counts of status by ID. 

In [22]:
status_counts = credit.groupby("ID")['STATUS'].value_counts().unstack(fill_value=0)


Let's visualize the customers' status counts to understand more clearly about this dataframe.


In [23]:
import plotly.express as px

debt_summary = pd.DataFrame({
    'Debt_Type': ['Good Debt', 'Bad Debt'],
    'Count': [status_counts['Good_Debt'].sum(), status_counts['Bad_Debt'].sum()]
})

fig = px.pie(debt_summary, names='Debt_Type', values='Count', title='Debt status', hole=0.3)
fig.show()


Adding this new dataframe to application dataframe.

In [24]:
df_app=application.set_index("ID").join(status_counts)

In [25]:
df_app.info()
print(df_app.shape)

<class 'pandas.core.frame.DataFrame'>
Index: 304331 entries, 5008806 to 6842885
Data columns (total 19 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Gender        304331 non-null  object 
 1   Car           304331 non-null  object 
 2   Realty        304331 non-null  object 
 3   Childnmbr     304331 non-null  int64  
 4   TotalIncome   304331 non-null  float64
 5   Incometype    304331 non-null  object 
 6   Edu           304331 non-null  object 
 7   Fam           304331 non-null  object 
 8   Housing       304331 non-null  object 
 9   mobil         304331 non-null  int64  
 10  workphn       304331 non-null  int64  
 11  phone         304331 non-null  int64  
 12  email         304331 non-null  int64  
 13  Occupation    304331 non-null  object 
 14  famsize       304331 non-null  float64
 15  Age           304331 non-null  int64  
 16  WorkingYears  304331 non-null  float64
 17  Bad_Debt      25134 non-null   float64
 18  Go

# 2.Dealing with Imbalanced Data

When merging the `Status` DataFrame with the `Application` DataFrame, missing values (NaNs) will appear. This is expected because the `Status` information comes from the `Credit` DataFrame, which only includes historical records of customers who have previously taken out loans. As a result, not all applicants will have corresponding credit history data.

To address this imbalance:

1. **Fill missing values with `0`**, assuming no credit history indicates no debt issues.
2. **Create three classes** based on the `good_debt` and `bad_debt` definitions to better capture different credit behaviors for classification purposes.




In [26]:
df_app[['Good_Debt', 'Bad_Debt']] = df_app[['Good_Debt', 'Bad_Debt']].fillna(0)
((df_app['Good_Debt'] == 0) & (df_app['Bad_Debt'] == 0)).sum()

279197

In [27]:
df_app['debt_status'] = np.where(
    (df_app['Good_Debt'] == 0) & (df_app['Bad_Debt'] == 0), 
    'good',  
    np.where(
        df_app['Good_Debt'] > df_app['Bad_Debt'], 
        'good',   
        'bad'))
print(df_app['debt_status'].value_counts())

debt_status
good    292294
bad      12037
Name: count, dtype: int64


"no_record" values needs to be labeled as "good" because in this project the model works to classify customers based on their credit record.

## Encoding Categorical Features

For model training, categorical variables need to be converted into numerical format. To achieve this, we use **Label Encoding**.

First, a copy of the main DataFrame is created to preserve the original data. Then, **LabelEncoder** is applied to all categorical columns **except** the `debt_status` column, which is already the target variable and will be handled separately.


In [28]:
df_encoded=df_app.copy()


In [29]:

categorical_cols = ['Gender', 'Car', 'Realty', 'Incometype', 'Edu', 'Fam', 'Housing', 'Occupation','debt_status']

label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    df_encoded[col] = le.fit_transform(df_encoded[col].astype(str))
    label_encoders[col] = le 

In [30]:
df_encoded['debt_status'].value_counts()

debt_status
1    292294
0     12037
Name: count, dtype: int64

In [31]:
df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
Index: 304331 entries, 5008806 to 6842885
Data columns (total 20 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Gender        304331 non-null  int64  
 1   Car           304331 non-null  int64  
 2   Realty        304331 non-null  int64  
 3   Childnmbr     304331 non-null  int64  
 4   TotalIncome   304331 non-null  float64
 5   Incometype    304331 non-null  int64  
 6   Edu           304331 non-null  int64  
 7   Fam           304331 non-null  int64  
 8   Housing       304331 non-null  int64  
 9   mobil         304331 non-null  int64  
 10  workphn       304331 non-null  int64  
 11  phone         304331 non-null  int64  
 12  email         304331 non-null  int64  
 13  Occupation    304331 non-null  int64  
 14  famsize       304331 non-null  float64
 15  Age           304331 non-null  int64  
 16  WorkingYears  304331 non-null  float64
 17  Bad_Debt      304331 non-null  float64
 18  Go

In [37]:
df_encoded.corr()

Unnamed: 0,Gender,Car,Realty,Childnmbr,TotalIncome,Incometype,Edu,Fam,Housing,mobil,workphn,phone,email,Occupation,famsize,Age,WorkingYears,Bad_Debt,Good_Debt,debt_status
Gender,1.0,0.331666,-0.034239,0.040314,0.153764,0.015468,0.040543,-0.042518,0.046243,,-0.016474,-0.02794,-0.003179,0.307669,0.063062,-0.106156,-0.106556,0.001551,0.002263,-0.000265
Car,0.331666,1.0,0.012058,0.076338,0.173572,-0.018457,-0.073924,-0.08999,-0.011374,,-0.035635,-0.01033,0.02378,0.032535,0.114243,-0.067988,-0.042734,0.006088,0.007035,0.002247
Realty,-0.034239,0.012058,1.0,0.014345,0.029055,-0.022703,0.00915,-0.008503,-0.179857,,-0.180182,-0.077381,0.071083,-0.010183,0.019271,0.091761,0.024376,-0.016895,-0.017753,0.00819
Childnmbr,0.040314,0.076338,0.014345,1.0,-0.02125,0.027592,-0.008468,-0.146252,0.005709,,-0.021572,-0.044691,0.006431,0.022442,0.899205,-0.261153,-0.072993,-0.00012,-0.001759,-0.000538
TotalIncome,0.153764,0.173572,0.029055,-0.02125,1.0,-0.154294,-0.207163,0.010583,-0.025421,,-0.069974,0.005998,0.094298,-0.133517,-0.025198,0.063212,0.021938,0.008373,0.001214,-0.001745
Incometype,0.015468,-0.018457,-0.022703,0.027592,-0.154294,1.0,0.128649,-0.000597,-0.001941,,0.072568,0.009653,-0.059678,0.113332,0.025184,-0.008295,0.029564,-0.006003,-0.000535,0.003499
Edu,0.040543,-0.073924,0.00915,-0.008468,-0.207163,0.128649,1.0,-0.026791,-0.008093,,0.009291,-0.038072,-0.105867,0.195744,0.005,0.113432,0.033767,0.001097,-0.004487,-0.004166
Fam,-0.042518,-0.08999,-0.008503,-0.146252,0.010583,-0.000597,-0.026791,1.0,0.064371,,-0.024406,-0.011805,0.006269,-0.014646,-0.515466,0.008263,-0.001096,0.001867,-0.00459,-0.007817
Housing,0.046243,-0.011374,-0.179857,0.005709,-0.025421,-0.001941,-0.008093,0.064371,1.0,,-3e-06,-0.028457,-0.00048,0.025779,-0.031483,-0.201515,-0.088176,0.000972,0.002168,-0.003333
mobil,,,,,,,,,,,,,,,,,,,,


In [33]:
fig = px.imshow(
    df_encoded.corr(),
    text_auto=True,  
    color_continuous_scale='RdBu_r',  
    title='Correlation Matrix'
)


fig.show()

## SMOTE

We applied `SMOTE` to address severe class imbalance in the dataset (e.g., far fewer *bad/no_record* cases than *good*). Imbalanced data can bias models toward the majority class, leading to poor detection of high-risk cases.

In [38]:
X = df_encoded.drop(['debt_status', 'Good_Debt', 'Bad_Debt'], axis=1)
y = df_encoded['debt_status']

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

print(pd.Series(y_resampled).value_counts())

debt_status
1    292294
0    292294
Name: count, dtype: int64


# 3.Modelling

## Model Comparison Summary:

To address the imbalanced credit risk dataset, we evaluated Logistic Regression, KNN, and Random Forest using weighted F1-score and cross-validation. Random Forest outperformed others, thanks to its ability to handle non-linear patterns, feature interactions, and class imbalance via class_weight='balanced'. While slower to train, its robust predictions and interpretability justified the choice.

**Why Random Forest?**
Unlike Logistic Regression’s linear limitations or KNN’s sensitivity to scaling, Random Forest excelled in capturing complex relationships (e.g., Income × Age) and minimizing overfitting. Its superior performance on minority classes (bad/no_record) made it the clear winner for this real-world risk prediction task.


### Splitting data set into train and test sets with resampled data

In [39]:
X_train, X_test, y_train, y_test = train_test_split(
    X_resampled, y_resampled, 
    test_size=0.3, 
    random_state=42)

print("Train set:\n", y_train.value_counts())
print("\nTest set:\n", y_test.value_counts())

Train set:
 debt_status
1    204663
0    204548
Name: count, dtype: int64

Test set:
 debt_status
0    87746
1    87631
Name: count, dtype: int64


### Now we will use Optuna to hyperparameter optimization of Random Forest Classification model.

In [40]:
import optuna
from sklearn.model_selection import cross_val_score
def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 10, 200),
        'max_depth': trial.suggest_int('max_depth', 5, 30),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 10)
    }
    model = RandomForestClassifier(**params)
    return cross_val_score(model, X_train, y_train, cv=3, scoring='f1_weighted').mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=20)       

print("Best parameters:", study.best_params)
print("F1 Score:", study.best_value)

[I 2025-07-07 15:13:29,458] A new study created in memory with name: no-name-81c3eba3-2e5c-411a-b345-92bb591e0037
[I 2025-07-07 15:15:39,397] Trial 0 finished with value: 0.9195957425457705 and parameters: {'n_estimators': 113, 'max_depth': 16, 'min_samples_split': 5}. Best is trial 0 with value: 0.9195957425457705.
[I 2025-07-07 15:19:35,628] Trial 1 finished with value: 0.9682134868116563 and parameters: {'n_estimators': 182, 'max_depth': 24, 'min_samples_split': 9}. Best is trial 1 with value: 0.9682134868116563.
[I 2025-07-07 15:21:01,502] Trial 2 finished with value: 0.6770901708118408 and parameters: {'n_estimators': 127, 'max_depth': 7, 'min_samples_split': 6}. Best is trial 1 with value: 0.9682134868116563.
[I 2025-07-07 15:24:21,396] Trial 3 finished with value: 0.9059673758087922 and parameters: {'n_estimators': 190, 'max_depth': 15, 'min_samples_split': 8}. Best is trial 1 with value: 0.9682134868116563.
[I 2025-07-07 15:25:33,578] Trial 4 finished with value: 0.735595315278

Best parameters: {'n_estimators': 150, 'max_depth': 30, 'min_samples_split': 2}
F1 Score: 0.9768832752209896


### **Random Forest Classification Model**

Best parameters for Random Forest Classficiation model according to Optuna:
        -'n_estimators': 127
        -'max_depth': 30
        -'min_samples_split': 5

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

#Define the model and train it
rf_model = RandomForestClassifier(
    n_estimators=127,       
    max_depth=30,           
    min_samples_split=5,    
    random_state=42,        
    n_jobs=-1               
)

rf_model.fit(X_train, y_train)

#Prediction on the trained set
y_pred = rf_model.predict(X_test)
y_proba = rf_model.predict_proba(X_test)[:, 1]  

#Performance Metrics
print("📊 Classification Report:")
print(classification_report(y_test, y_pred))

#Confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix(y_test, y_pred), 
            annot=True, fmt='d', cmap='Blues',
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])
plt.title('Confusion Matrix')
plt.show()

#Importance Score Plot
plt.figure(figsize=(10, 6))
sns.barplot(x=rf_model.feature_importances_, y=X_train.columns, palette='viridis')
plt.title('Feature Importances')
plt.xlabel('Importance Score')
plt.tight_layout()
plt.show()