# EMI Predict AI ‚Äî Model Training
This notebook trains classification and regression models and saves them.


In [1]:
import pandas as pd
import numpy as np
import joblib

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, mean_squared_error

from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from xgboost import XGBClassifier, XGBRegressor


# ========================
# LOAD DATA
# ========================
df = pd.read_csv("../Data/featured_emi_dataset.csv")

# Encode categorical columns
cat_cols = df.select_dtypes(include=["object", "string"]).columns
le = LabelEncoder()
for col in cat_cols:
    df[col] = le.fit_transform(df[col].astype(str))


# ========================
# CLASSIFICATION (EMI ELIGIBILITY)
# ========================
y_class = df["emi_eligibility"]

# Remove BOTH targets from features
X_class = df.drop(columns=[
    "emi_eligibility",
    "max_monthly_emi"
])

X_train, X_test, y_train, y_test = train_test_split(
    X_class, y_class, test_size=0.2, random_state=42, stratify=y_class
)

clf_models = {
    "logistic": LogisticRegression(max_iter=4000),
    "rf": RandomForestClassifier(n_estimators=200, random_state=42),
    "xgb": XGBClassifier(eval_metric="mlogloss", random_state=42)
}

print("\nClassification Results")
for name, model in clf_models.items():
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    acc = accuracy_score(y_test, preds)
    joblib.dump(model, f"../models/{name}_classifier.pkl")
    print(name, "accuracy:", acc)


# ========================
# REGRESSION (MAX EMI)
# ========================
y_reg = df["max_monthly_emi"]

# Remove BOTH targets from features
X_reg = df.drop(columns=[
    "max_monthly_emi",
    "emi_eligibility"
])

Xr_train, Xr_test, yr_train, yr_test = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

reg_models = {
    "linear": LinearRegression(),
    "rf": RandomForestRegressor(n_estimators=200, random_state=42),
    "xgb": XGBRegressor(random_state=42)
}

print("\nRegression Results")
for name, model in reg_models.items():
    model.fit(Xr_train, yr_train)
    preds = model.predict(Xr_test)
    rmse = np.sqrt(mean_squared_error(yr_test, preds))
    joblib.dump(model, f"../models/{name}_regressor.pkl")
    print(name, "RMSE:", rmse)


Classification Results


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=4000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


logistic accuracy: 0.8753458498023715
rf accuracy: 0.9477890316205534
xgb accuracy: 0.977050395256917

Regression Results
linear RMSE: 3728.487530006363
rf RMSE: 257.8778197429596
xgb RMSE: 424.0291524855336


In [None]:
üîé Classification Model Results ‚Äî EMI Eligibility Prediction

In this section, we trained multiple classification models to predict whether a customer is eligible for EMI approval.

Models Trained

Logistic Regression

Random Forest Classifier

XGBoost Classifier

Observations

Logistic Regression achieved moderate performance, indicating that some linear relationships exist between financial variables and eligibility.

Random Forest significantly improved accuracy by capturing nonlinear interactions between income, expenses, and credit history.

XGBoost achieved the highest accuracy among all models, demonstrating strong capability in modeling complex financial risk patterns.

Final Metrics

Model	Accuracy

Logistic Regression	87.5%

Random Forest	94.8%

XGBoost	97.7% ‚≠ê

Conclusion

Tree-based ensemble models outperform linear models for EMI eligibility prediction.
After correcting data leakage, XGBoost remains the most reliable classifier and is selected as the final classification model for deployment.

üí∞ Regression Model Results ‚Äî Maximum EMI Prediction

In this section, we trained regression models to estimate the maximum EMI amount a customer can safely afford.

Models Trained

Linear Regression

Random Forest Regressor

XGBoost Regressor

Observations

Linear Regression produced high prediction error, confirming that EMI affordability does not follow a simple linear relationship.

Random Forest achieved the lowest RMSE, showing strong ability to capture nonlinear financial behavior.

XGBoost performed well but slightly underperformed Random Forest in predicting EMI capacity.

Final Metrics

Model	RMSE (‚Çπ)
            
Linear Regression	3728

Random Forest	258 ‚≠ê

XGBoost	424

Conclusion

Ensemble models significantly outperform linear regression for EMI estimation.
Random Forest Regressor is selected as the final regression model for deployment.