# ML Challenge 

<img src="https://imageio.forbes.com/specials-images/imageserve/5ecd179f798e4c00060d2c7c/0x0.jpg?format=jpg&height=600&width=1200&fit=bounds" width="500" height="300">

In the bustling city of Financia, the Central Lending Institution (CLI) is the largest provider of loans to individuals and businesses. With a mission to support economic growth and financial stability, CLI processes thousands of loan applications every month. However, the traditional manual review process is time-consuming and prone to human error, leading to delays and inconsistencies in loan approvals.
To address these challenges, CLI has decided to leverage the power of machine learning to streamline their loan approval process. They have compiled a comprehensive dataset containing historical loan application records, including various factors such as credit scores, income levels, employment status, loan terms(measured in years), loan amounts, asset values, and the final loan status (approved or denied).


**Your task is to develop a predictive model that can accurately determine the likelihood of loan approval based on the provided features. By doing so, you will help CLI make faster, more accurate, and fairer lending decisions, ultimately contributing to the financial well-being of the community.**

It is recommended that you follow the typical machine learning workflow, though you are not required to strictly follow each steps: 
1. Data Collection: Gather the data you need for your model. (Already done for you)

2. Data Preprocessing: Clean and prepare the data for analysis. (Already done for you)

3. Exploratory Data Analysis (EDA): Understand the data and its patterns. (Partially done for you)

4. Feature Engineering: Create new features or modify existing ones to improve model performance. (Partially done for you)

5. Model Selection: Choose the appropriate machine learning algorithm.

6. Model Training: Train the model using the training dataset.

7. Model Evaluation: Evaluate the model's performance using a validation dataset.

8. Model Optimization: Optimize the model's parameters to improve performance.

9. Model Testing: Test the final model on a separate test dataset.

**Please include ALL your work and thought process in this notebook**

In [None]:
# You may include any package you deem fit. We sugggest looking into Scikit-learn
import pandas as pd

## Dataset


In [None]:
# DO NOT MODIFY
loan_data = pd.read_csv("../../data/loan_approval.csv")


## EDA
Uncomment to see desired output. Add more analysis if you like

In [None]:

import matplotlib.pyplot as plt

# ------ Display basic information ------
print(loan_data.columns)
print(loan_data.describe())

# ------ Check for missing values ------
print(loan_data.isnull().sum())

# ------ Visualize the distribution of loan status ------
loan_status_counts = loan_data['loan_status'].value_counts()
plt.bar(loan_status_counts.index, loan_status_counts.values)
plt.title('Distribution of Loan Status')
plt.xlabel('Loan Status')
plt.ylabel('Count')

# ------ Visualize the distribution of numerical features ------ 
loan_data.hist(bins=30, figsize=(20, 15))

# ------ Correlation matrix ------
corr_matrix = loan_data.corr()
fig, ax = plt.subplots(figsize=(10, 8))
cax = ax.matshow(corr_matrix, cmap='coolwarm')
fig.colorbar(cax)
plt.xticks(range(len(corr_matrix.columns)), corr_matrix.columns, rotation=90)
plt.yticks(range(len(corr_matrix.columns)), corr_matrix.columns)

# ----- MORE (Encouraged but not required) ------
# TODO 

Index(['loan_id', 'no_of_dependents', 'education', 'self_employed',
       'income_annum', 'loan_amount', 'loan_term', 'cibil_score',
       'residential_assets_value', 'commercial_assets_value',
       'luxury_assets_value', 'bank_asset_value', 'loan_status'],
      dtype='object')


## Feature Engineering

You may want to convert categorical variables to numerical. For example, education takes on the value Graduate and Not Graduate. But we want it to be 0 or 1 for machine learning algorithms to use.

In [None]:
loan_data['education'] = loan_data['education'].map({'Graduate': 1, 'Not Graduate': 0})
# Hint: Other categorical variables are self_employed and loan_status
import numpy as np
import pandas as pd

df = loan_data.copy()

# Mapping education
df["education"] = (
    df["education"]
    .astype(str)
    .str.strip()
    .str.lower()
    .map({"graduate": 1, "not graduate": 0})
)

# Mapping self_employed
def yes_no_to_int(x):
    if pd.isna(x):
        return np.nan
    s = str(x).strip().lower()
    if s in {"yes", "y", "true", "1"}:
        return 1
    if s in {"no", "n", "false", "0"}:
        return 0
    return np.nan

df["self_employed"] = df["self_employed"].apply(yes_no_to_int)

# Map target loan_status -> y (Approved = 1, Denied = 0)
def status_to_int(x):
    if pd.isna(x):
        return np.nan
    s = str(x).strip().lower()
    if s in {"approved", "approve", "1", "yes", "y", "true"}:
        return 1
    if s in {"denied", "deny", "0", "no", "n", "false"}:
        return 0
    return np.nan

df["loan_status"] = df["loan_status"].apply(status_to_int)

# Define features
target_col = "loan_status"
drop_cols = ["loan_id"] if "loan_id" in df.columns else []

X = df.drop(columns=drop_cols + [target_col])
y = df[target_col]

print("\nFeature dtypes summary:")
print(X.dtypes.value_counts())

print("\nX shape:", X.shape, "y shape:", y.shape)
print("Class balance (y):")
print(y.value_counts(dropna=False))

## Model Selection

You are free to use any classification machine learning models you like: Logistic Regression, Decision Trees/Random Forests, Support Vector Machines, KNN ... 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y,
    test_size=0.20,
    random_state=42,
    stratify=y
)

X_train, X_val, y_train, y_val = train_test_split(
    X_trainval, y_trainval,
    test_size=0.25,
    random_state=42,
    stratify=y_trainval
)

print("Train:", X_train.shape, "Val:", X_val.shape, "Test:", X_test.shape)

numeric_features = X_train.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns.tolist()

print("\nNumeric features:", numeric_features)
print("Categorical features:", categorical_features)

# Preprocessing:
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# Baseline models:
log_reg = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", LogisticRegression(max_iter=2000, class_weight="balanced", random_state=42))
])

rf = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", RandomForestClassifier(
        n_estimators=400,
        random_state=42,
        class_weight="balanced",
        n_jobs=-1
    ))
])

models = {
    "LogisticRegression": log_reg,
    "RandomForest": rf
}

print("Baselines ready.")

## Model Training and Evaluation

In [None]:
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report
)

def evaluate(model, X_tr, y_tr, X_va, y_va, name = "model"):
    model.fit(X_tr, y_tr)

    #Predictions
    y_pred = model.predict(X_va)

    # ROC-AUC
    if hasattr(model, "predict_proba"):
        y_proba = model.predict_proba(X_va)[:,1]
        auc = roc_auc_score(y_va, y_proba)
    else:
        y_proba = None
        auc = np.nan

    metrics = {
        "model": name,
        "accuracy": accuracy_score(y_va, y_pred),
        "precision": precision_score(y_va, y_pred, zero_division=0),
        "recall": recall_score(y_va, y_pred, zero_division=0),
        "f1": f1_score(y_va, y_pred, zero_division=0),
        "roc_auc": auc
    }

    print()
    print("=== {} (Validation) ===".format(name))
    print("Confusion Matrix:")
    print(confusion_matrix(y_va, y_pred))
    print()
    print("Classification Report:")
    print(classification_report(y_va, y_pred, digits=4))
    print("Metrics:", metrics)

results = []
trained_models = {}

for name, mdl in models.items():
    metrics, trained = evaluate(mdl, X_train, y_train, X_val, y_val, name=name)
    results.append(metrics)
    trained_models[name] = trained

results_df = pd.DataFrame(results).sort_values(by=["f1", "roc_auc", "accuracy"], ascending=False)
print("\nSummary (sorted):")
print(results_df)

best_name = results_df.iloc[0]["model"]
best_baseline = trained_models[best_name]
print("\nBest baseline:", best_name)

## Model Optimization and Testing

In [None]:
from sklearn.model_selection import GridSearchCV

if best_name == "LogisticRegression":
    param_grid = {
        "model__C": [0.01, 0.1, 10],
        "model__penalty": ["12"],
        "model__solver": ["lbfgs"]
    }
    base_for_search = log_reg
else:
    param_grid = {
        "model__n_estimators": [300, 500],
        "model__max_depth": [None, 6, 12],
        "model__min_samples_split": [2, 8, 16],
        "model__min_samples_leaf": [1, 3, 6]
    }
    base_for_search = rf

grid = GridSearchCV(
    estimator=base_for_search,
    param_grid=param_grid,
    scoring="f1",
    cv=5,
    n_jobs=-1,
    verbose=1
)

grid.fit(X_trainval, y_trainval)

print("\nBest params:", grid.best_params_)
print("Best CV f1:", grid.best_score_)

best_model = grid.best_estimator_


In [None]:
y_test_pred = best_model.predict(X_test)
if hasattr(best_model, "predict_proba"):
    y_test_proba = best_model.predict_proba(X_test)[:, 1]
    test_auc = roc_auc_score(y_test, y_test_proba)
else:
    test_auc = np.nan

print("\n=== FINAL TEST RESULTS ===")
print("Confusion Matrix:\n", confusion_matrix(y_test, y_test_pred))
print("\nClassification Report:\n", classification_report(y_test, y_test_pred, digits=4))
print("Accuracy:", accuracy_score(y_test, y_test_pred))
print("Precision:", precision_score(y_test, y_test_pred, zero_division=0))
print("Recall:", recall_score(y_test, y_test_pred, zero_division=0))
print("F1:", f1_score(y_test, y_test_pred, zero_division=0))
print("ROC-AUC:", test_auc)