# Project Objective

This notebook aims to prepare and explore the `emi_prediction_dataset.csv` to support building a reliable model for predicting EMI (Equated Monthly Installment) eligibility.

Goals:
- Perform exploratory data analysis to understand distributions, missingness, and relationships between features and the target `emi_eligibility`.
- Clean and normalize raw fields (fix types, remove formatting issues), impute missing values, and encode categorical variables.
- Create robust feature engineering to capture affordability and repayment capacity (e.g. debt ratios, net income after expenses).
- Produce a reproducible preprocessing pipeline that can be reused for modeling and inference.

Deliverables from this notebook:
- Cleaned and typed dataset ready for modeling.
- A documented preprocessing pipeline (transformations, imputations, encodings).
- Summary EDA plots and tables describing key predictors and data quality issues.

Assumptions & Notes:
- The data contains inconsistent string/number formats (e.g. `monthly_salary`, `bank_balance`, `age`) and missing values; these will be corrected and imputed.
- Class balance for `emi_eligibility` will be checked and handled during modeling in downstream notebooks.

Next steps: finalize preprocessing, persist the cleaned dataset, and move to model training and evaluation notebooks.

In [0]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import KNNImputer
import warnings
warnings.filterwarnings('ignore')

In [0]:
emi_pred=pd.read_csv("emi_prediction_dataset.csv", low_memory=False)

In [0]:
emi_pred.head()

In [0]:
emi_pred.shape

In [0]:
emi_pred.info()

In [0]:
emi_pred.columns

In [0]:
col_x=['age', 'gender', 'marital_status', 'education', 'monthly_salary',
       'employment_type', 'years_of_employment', 'company_type', 'house_type',
       'monthly_rent', 'family_size', 'dependents', 'school_fees',
       'college_fees', 'travel_expenses', 'groceries_utilities',
       'other_monthly_expenses', 'existing_loans', 'current_emi_amount',
       'credit_score', 'bank_balance', 'emergency_fund', 'emi_scenario',
       'requested_amount', 'requested_tenure']

In [0]:
emi_pred["emi_eligibility"].value_counts()

In [0]:
emi_pred[emi_pred["emi_eligibility"]=="Not_Eligible"].head()

In [0]:
emi_pred[emi_pred["emi_eligibility"]=="Eligible"].head()

In [0]:
emi_pred.isna().sum()

In [0]:
emi_pred[emi_pred["education"].isna()].head()

In [0]:
emi_pred["education"].value_counts()

In [0]:
emi_pred[emi_pred["monthly_rent"].isna()].head()

In [0]:
emi_pred[emi_pred["credit_score"].isna()].head()

In [0]:
emi_pred[emi_pred["bank_balance"].isna()].head()

In [0]:
emi_pred[emi_pred["emergency_fund"].isna()].head()

In [0]:
emi_pred.dtypes

In [0]:
emi_pred[emi_pred["bank_balance"].isna()]

In [0]:
emi_pred[pd.notna(emi_pred["bank_balance"])==False]

In [0]:
def type_corrector(x):
    # Skip if x is NaN
    if pd.isna(x):
        return np.nan
    # Convert safely to string and handle decimals
    try:
        return int(float(str(x).split(".")[0]))
    except ValueError:
        return np.nan

In [0]:
# Montlhly salary should be in int or float but showing in object type. So converting it to int type. there are multiple ". ." observed in the data so removing that also.

emi_pred["monthly_salary"] = emi_pred["monthly_salary"].apply(type_corrector)

In [0]:
emi_pred["bank_balance"] = emi_pred["bank_balance"].apply(type_corrector)

In [0]:
emi_pred["age"] = emi_pred["age"].apply(type_corrector)

In [0]:
emi_pred.dtypes

In [0]:
x_data=emi_pred[col_x].copy()

In [0]:
emi_pred

In [0]:
# Formatting string columns to have consistent capitalization and removing leading/trailing spaces
x_data[x_data.select_dtypes(include=['object']).columns]=x_data[x_data.select_dtypes(include=['object']).columns].apply(lambda x: x.str.strip().str.title())

In [0]:
x_data.select_dtypes(include=['object']).columns

In [0]:
numeric_cols=x_data.select_dtypes(include=['int64', 'float64']).columns

In [0]:
x_data.nunique()

In [0]:
x_data.replace({"existing_loans":{"Yes":1, "No":0}}, inplace=True)

In [0]:
x_data.replace({"marital_status":{"Married":1, "Single":0}}, inplace=True)

In [0]:
x_data["gender"].value_counts()

In [0]:
x_data.replace({"gender":{"M":1, "F":0, "Male":1, "Female":0}}, inplace=True)

In [0]:
x_data["gender"].value_counts()

In [0]:
x_data["emi_scenario"].value_counts()