
Build and evaluate a machine learning model to predict loan payback . The evaluation should include relevant metrics and a summary of the model's performance and key findings.

## Load Data


Load the provided 'train_sample.csv' file into a pandas DataFrame.


In [2]:
import pandas as pd

# Load the train_sample.csv file into a pandas DataFrame
df = pd.read_csv("https://raw.githubusercontent.com/ushareng/LoanPrediction_ColabVSCode/refs/heads/main/train_sample.csv")

# Display the first few rows of the DataFrame
print("First 5 rows of the DataFrame:")
print(df.head())

# Display the column names and their data types
print("\nDataFrame Info:")
df.info()

First 5 rows of the DataFrame:
       id  annual_income  debt_to_income_ratio  credit_score  loan_amount  \
0  404674       52470.61                 0.241           724     27172.82   
1  549728       28424.82                 0.033           779     15895.96   
2  125237       25229.55                 0.195           569     15216.06   
3  512666       91612.58                 0.166           659     13166.90   
4  101001       79712.79                 0.079           767     23642.37   

   interest_rate  gender marital_status education_level employment_status  \
0          13.35    Male        Married             PhD          Employed   
1          11.57    Male        Married           Other          Employed   
2          13.83  Female         Single        Master's          Employed   
3          12.47  Female         Single        Master's          Employed   
4          11.23    Male         Single      Bachelor's          Employed   

         loan_purpose grade_subgrade  loan_

## Preprocess Data


Clean and prepare the data for machine learning by handling missing values, encoding categorical features, scaling numerical features, and splitting the data into training and testing sets.


In [3]:
print("Missing values in each column:")
print(df.isnull().sum())

Missing values in each column:
id                      0
annual_income           0
debt_to_income_ratio    0
credit_score            0
loan_amount             0
interest_rate           0
gender                  0
marital_status          0
education_level         0
employment_status       0
loan_purpose            0
grade_subgrade          0
loan_paid_back          0
dtype: int64


In [4]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
import pandas as pd # Ensure pandas is imported if not already in this cell

# Drop the 'id' column as it is an identifier and not a feature
df_processed = df.drop('id', axis=1)

# Separate target variable
X = df_processed.drop('loan_paid_back', axis=1)
y = df_processed['loan_paid_back']

# Identify categorical and numerical features
categorical_features = X.select_dtypes(include=['object']).columns
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns

# --- Manual Preprocessing Steps to ensure correct shapes ---

# 1. Scale numerical features
scaler = StandardScaler()
X_numerical_scaled = scaler.fit_transform(X[numerical_features])
numerical_cols_names = [f'scaled__{col}' for col in numerical_features]
X_numerical_scaled_df = pd.DataFrame(X_numerical_scaled, columns=numerical_cols_names, index=X.index)

# 2. One-hot encode categorical features
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False) # Explicitly ensure dense output
X_categorical_encoded = encoder.fit_transform(X[categorical_features])
categorical_cols_names = encoder.get_feature_names_out(categorical_features)
X_categorical_encoded_df = pd.DataFrame(X_categorical_encoded, columns=categorical_cols_names, index=X.index)

# 3. Combine preprocessed features into a single DataFrame
X_preprocessed_df = pd.concat([X_numerical_scaled_df, X_categorical_encoded_df], axis=1)

# Convert back to numpy array for train_test_split (if preferred by subsequent steps, or use DataFrame directly)
X_preprocessed = X_preprocessed_df.values

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X_preprocessed, y, test_size=0.2, random_state=42, stratify=y
)

print("Shape of preprocessed features (X_preprocessed):"), X_preprocessed.shape
print("Shape of training features (X_train):"), X_train.shape
print("Shape of testing features (X_test):"), X_test.shape
print("Shape of training target (y_train):"), y_train.shape
print("Shape of testing target (y_test):"), y_test.shape
print("\nFirst 5 rows of X_preprocessed_df (optional for viewing):\n")
print(X_preprocessed_df.head())


Shape of preprocessed features (X_preprocessed):
Shape of training features (X_train):
Shape of testing features (X_test):
Shape of training target (y_train):
Shape of testing target (y_test):

First 5 rows of X_preprocessed_df (optional for viewing):

   scaled__annual_income  scaled__debt_to_income_ratio  scaled__credit_score  \
0               0.158550                      1.757044              0.783552   
1              -0.736036                     -1.275195              1.776526   
2              -0.854911                      1.086453             -2.014830   
3               1.614766                      0.663688             -0.389963   
4               1.172053                     -0.604604              1.559877   

   scaled__loan_amount  scaled__interest_rate  gender_Female  gender_Male  \
0             1.755050               0.491512            0.0          1.0   
1             0.125067              -0.391179            0.0          1.0   
2             0.026793             

## Train Classifier Model

**Reasoning**:
Now that the data is preprocessed and split into training and testing sets, the next step is to train a classifier model. I will use a `RandomForestClassifier` as specified. The model will be initialized with a `random_state` for reproducibility and then trained using the `fit` method on the training data (`X_train`, `y_train`). This will prepare the model for subsequent evaluation.

In [5]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the RandomForestClassifier with a random_state for reproducibility
model = RandomForestClassifier(random_state=42)

# Train the model using the training data
model.fit(X_train, y_train)

print("RandomForestClassifier model trained successfully.")

RandomForestClassifier model trained successfully.


## Evaluate Model Performance


Assess the performance of the trained model using appropriate evaluation metrics such as accuracy, precision, recall, and F1-score. A classification report and confusion matrix will be generated to visualize the results.


**Reasoning**:
To evaluate the trained RandomForestClassifier model, I will first make predictions on the test set, then calculate and print key classification metrics including accuracy, precision, recall, and F1-score. Finally, I will generate and print a confusion matrix and a classification report to provide a comprehensive assessment of the model's performance.



In [6]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# 1. Use the trained model to make predictions on the test set
y_pred = model.predict(X_test)

# 2. Calculate and print accuracy, precision, recall, and F1-score
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.4f}")
print(f"Model Precision: {precision:.4f}")
print(f"Model Recall: {recall:.4f}")
print(f"Model F1-Score: {f1:.4f}")

# 3. Generate and print the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", conf_matrix)

# 4. Generate and print the classification report
class_report = classification_report(y_test, y_pred)
print("\nClassification Report:\n", class_report)

Model Accuracy: 0.8981
Model Precision: 0.9020
Model Recall: 0.9784
Model F1-Score: 0.9386

Confusion Matrix:
 [[1405 1006]
 [ 205 9264]]

Classification Report:
               precision    recall  f1-score   support

         0.0       0.87      0.58      0.70      2411
         1.0       0.90      0.98      0.94      9469

    accuracy                           0.90     11880
   macro avg       0.89      0.78      0.82     11880
weighted avg       0.90      0.90      0.89     11880

