<a href="https://colab.research.google.com/github/sathvika-balabhadra/Loan-Prediction/blob/main/Credit_Based_Loan_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [51]:
# ==============================
# LOAN PREDICTION - ML PROJECT
# Author: Sathvika
# ==============================

# -----------------------------
# (1) IMPORT LIBRARIES
# -----------------------------
import pandas as pd
import numpy as np

# ML imports
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import tree, svm
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

In [52]:
# -----------------------------
# (2) LOAD DATA
# -----------------------------
# Replace <username> and <repo-name> with your GitHub repo details
train_url = 'https://raw.githubusercontent.com/sathvika-balabhadra/Loan-Prediction/refs/heads/main/Data/train.csv'
test_url = 'https://raw.githubusercontent.com/sathvika-balabhadra/Loan-Prediction/refs/heads/main/Data/test.csv'

train = pd.read_csv(train_url)
test  = pd.read_csv(test_url)

print("Train shape:", train.shape)
print("Test shape:", test.shape)

Train shape: (614, 13)
Test shape: (367, 12)


In [53]:
# Keep copy of Loan_ID for submission
test_ids = test['Loan_ID'].copy()

In [54]:
# -----------------------------
# (3) DATA CLEANING AND PREPROCESSING
# -----------------------------
# Fill numeric columns with mean
num_cols = train.select_dtypes(include='number').columns
train[num_cols] = train[num_cols].fillna(train[num_cols].mean())
test[num_cols]  = test[num_cols].fillna(test[num_cols].mean())

In [55]:
# Fill categorical columns with mode
cat_cols = ['Gender', 'Married', 'Dependents', 'Self_Employed']
for col in cat_cols:
    train[col].fillna(train[col].mode()[0], inplace=True)
    test[col].fillna(test[col].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train[col].fillna(train[col].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test[col].fillna(test[col].mode()[0], inplace=True)


In [56]:
# Outlier treatment: log transform Loan_Amount_Term
train['Loan_Amount_Term'] = np.log(train['Loan_Amount_Term'])
test['Loan_Amount_Term']  = np.log(test['Loan_Amount_Term'])

In [57]:
# -----------------------------
# (4) FEATURE ENGINEERING
# -----------------------------
#create TotalIncome feature
train['TotalIncome'] = train['ApplicantIncome'] + train['CoapplicantIncome']
test['TotalIncome']  = test['ApplicantIncome'] + test['CoapplicantIncome']

In [58]:
# Drop irrelevant columns
train = train.drop('Loan_ID', axis=1)
test  = test.drop('Loan_ID', axis=1)

In [59]:
# Separate features and target
X = train.drop('Loan_Status', axis=1)
y = train['Loan_Status']

In [60]:
# One-hot encode categorical variables
X = pd.get_dummies(X)
train = pd.get_dummies(train)
test  = pd.get_dummies(test)

In [61]:
# Align test dataset to train columns
test = test.reindex(columns=X.columns, fill_value=0)

In [62]:
# -----------------------------
# (5) TRAIN/VALIDATION SPLIT
# -----------------------------
x_train, x_cv, y_train, y_cv = train_test_split(X, y, test_size=0.2, random_state=42)

In [63]:
# Scale numeric features for LR, SVM, KNN
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_cv_scaled    = scaler.transform(x_cv)
test_scaled    = scaler.transform(test)  # for models that require scaling

In [64]:
# -----------------------------
# (6) DEFINE AND TRAIN MODELS
# -----------------------------
models = {
    "Logistic Regression": LogisticRegression(max_iter=2000),
    "Decision Tree": tree.DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42),
    "SVM": svm.SVC(),
    "Naive Bayes": GaussianNB(),
    "KNN": KNeighborsClassifier(),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42)
}

In [65]:
# Train and evaluate
model_scores = {}
for name, model in models.items():
    if name in ["Logistic Regression", "SVM", "KNN"]:
        model.fit(x_train_scaled, y_train)
        pred_cv = model.predict(x_cv_scaled)
    else:
        model.fit(x_train, y_train)
        pred_cv = model.predict(x_cv)

    acc = accuracy_score(y_cv, pred_cv)
    print(f"{name} Accuracy: {acc*100:.2f}%")
    print("Confusion Matrix:\n", confusion_matrix(y_cv, pred_cv), "\n")
    model_scores[name] = acc

Logistic Regression Accuracy: 78.86%
Confusion Matrix:
 [[18 25]
 [ 1 79]] 

Decision Tree Accuracy: 70.73%
Confusion Matrix:
 [[18 25]
 [11 69]] 

Random Forest Accuracy: 78.05%
Confusion Matrix:
 [[19 24]
 [ 3 77]] 

SVM Accuracy: 79.67%
Confusion Matrix:
 [[18 25]
 [ 0 80]] 

Naive Bayes Accuracy: 78.86%
Confusion Matrix:
 [[19 24]
 [ 2 78]] 

KNN Accuracy: 73.98%
Confusion Matrix:
 [[16 27]
 [ 5 75]] 

Gradient Boosting Accuracy: 74.80%
Confusion Matrix:
 [[17 26]
 [ 5 75]] 



In [66]:
# -----------------------------
# (7) SELECT BEST MODEL
# -----------------------------
best_model_name = max(model_scores, key=model_scores.get)
best_model = models[best_model_name]
print(f"Best model: {best_model_name} with accuracy {model_scores[best_model_name]*100:.2f}%")

Best model: SVM with accuracy 79.67%


In [67]:
# -----------------------------
# (8) PREDICT TEST DATA AND SAVE CSV
# -----------------------------
# Fit best model on full training data
if best_model_name in ["Logistic Regression", "SVM", "KNN"]:
    best_model.fit(scaler.fit_transform(X), y)
    pred_test = best_model.predict(test_scaled)
else:
    best_model.fit(X, y)
    pred_test = best_model.predict(test)

In [68]:
# Save submission CSV
output = pd.DataFrame({'Loan_ID': test_ids, 'Loan_Status': pred_test})
output.to_csv('Credit_Predictions.csv', index=False)
print("Test predictions saved to Credit_Predictions.csv")

Test predictions saved to Credit_Predictions.csv
