# Objective
### Forecast the likelihood of a loan applicant defaulting or successfully repaying the loan using their financial and personal data

**Personal and Demographic Features:**
1. Gender: Male/Female (Categorical)
2. Married: Wheather the applicant is married (Yes/No)
3. Dependents: Number of Dependents(numerical\categorical)
4. Education: Education level of applicant(Graduate\Not Graduate)
5. Self_Employed: Whether the applicant is self-employed (Yes\No)

**Financial Features:**

6. ApplicantIncome: Monthly or Yearly income of the applicant(numerical)
7. CoapplicantIncome: Income of the co-applicant, If applicable (numerical)
8. LoanAmount: Total loan amount requested(numerical)
9. Loan_Amount_Term: Term of the loan in months(e.g., 360 days for 30 years)
10. Credit_History: History of loan repayment(0=No,1=Yes)

**Loan Features:**

11.Loan_ID: Unique Identifier for each loan application (usually dropped as it's not predictive).

12.Property_Area: Area where the applicant resides (Urban, Semi-Urban, Rural)

13. Loan_status: Target variable indicating loan approval or rejection (Y=Approved, N=Rejected)
    

In [170]:
import pandas as pd

In [318]:
data = pd.read_csv("C:\\Users\\user\\Desktop\\New_Loan_Status_Prediction-Project\\loan_prediction.csv")

### 1. Display Top 5 Rows

In [320]:
data.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


### Display Last 5 Rows

In [322]:
data.tail()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y
613,LP002990,Female,No,0,Graduate,Yes,4583,0.0,133.0,360.0,0.0,Semiurban,N


### 3. Shape of the dataset(No.of rows and columns)

In [324]:
data.shape

(614, 13)

In [180]:
print("No. of Rows: ", data.shape[0])
print("No. of Rows: ", data.shape[1])

No. of Rows:  614
No. of Rows:  13


### Info about the dataset

In [326]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


Gender has 601 non-null values -> 13 missing values (614 - 601).

Married has 611 non-null values -> 3 Missing values.

Dependents has 599 non-null values -> 15 missing values.

Non-Null count: Number of rows where the value is not null (non-missing).

Total rows = 614.

if the non-null count is less than 614, the remaining rows have missing values (NaN).

**The Blank cells in our csv files are indeed null values (missing values). When the file is loaded into a pandas DataFrame, these Blank cells will be automatically recognized as NaN (Not A Number), which is Python's standard way of representing missing data.**

### 5. Check Null Values

In [328]:
# Check for missing values in each column
data.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [330]:
data.isnull().sum()*100/len(data) # % of missing values

Loan_ID              0.000000
Gender               2.117264
Married              0.488599
Dependents           2.442997
Education            0.000000
Self_Employed        5.211726
ApplicantIncome      0.000000
CoapplicantIncome    0.000000
LoanAmount           3.583062
Loan_Amount_Term     2.280130
Credit_History       8.143322
Property_Area        0.000000
Loan_Status          0.000000
dtype: float64

**Drop rows where missing values are less than 5%**

**The 5% threshold is a practical choice to balance data quality and quantity:
small enough to minimize data loss.
Large enough to ensure the missing values don't negatively impact model performance**

### 6.Handling Missing Values

In [332]:
data.columns

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

In [334]:
data=data.drop('Loan_ID', axis = 1)
data.head(1)

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y


In [336]:
data.isnull().sum()*100/len(data) # of missing values

Gender               2.117264
Married              0.488599
Dependents           2.442997
Education            0.000000
Self_Employed        5.211726
ApplicantIncome      0.000000
CoapplicantIncome    0.000000
LoanAmount           3.583062
Loan_Amount_Term     2.280130
Credit_History       8.143322
Property_Area        0.000000
Loan_Status          0.000000
dtype: float64

**Practical Strategy:**

The 5% rule is a guideline, not a hard rule. By dropping only the rows with missing values for these selected columns:

1. You avoid significant data loss

2. You focus on features that are critical for your analysis

3. You leave room to impute missing values in other columns where dropping data would be less efficient.

**Why only these columns were dropped? ('Gender','Dependents','LoanAmount','Loan_Amount_Term')**

**Critical Features:**

Gender,Dependents, LoanAmount, and Loan_Amount_Term have been identified as important Predictors for the task (like Loan Status Prediction).
Dropping rows with missing values in these specific columns ensures that these key variables remains clean.

**Selective Dropping:**

Dropping rows with missing values across all columns, especially when the percentage of missing data is very small (e.g., Married: 0.48%), mat lead to unnecessary data loss.
Instead , focusing only on important features with missing values ensures data retension while maintaining model quality.

**Imputation for other columns: (Self_Employed, Credit_History)**

Columns like Self_Employed and Credit_History may have been left intact because the missing values in these columns can be imputed later (e.g., using mode or other strategies).

In [338]:
data['Self_Employed'].mode()[0]

'No'

In [340]:
data['Self_Employed'] = data['Self_Employed'].fillna(data['Self_Employed'].mode()[0])

In [342]:
data.isnull().sum()*100/len(data) # % of missing values

Gender               2.117264
Married              0.488599
Dependents           2.442997
Education            0.000000
Self_Employed        0.000000
ApplicantIncome      0.000000
CoapplicantIncome    0.000000
LoanAmount           3.583062
Loan_Amount_Term     2.280130
Credit_History       8.143322
Property_Area        0.000000
Loan_Status          0.000000
dtype: float64

In [344]:
data['Credit_History'].unique()

array([ 1.,  0., nan])

In [346]:
data['Self_Employed'].unique()

array(['No', 'Yes'], dtype=object)

In [348]:
data['Credit_History'].mode()[0]

1.0

In [350]:
data['Credit_History']=data['Credit_History'].fillna(data['Credit_History'].mode()[0])

In [352]:
data.isnull().sum()*100/len(data) # % of missing values

Gender               2.117264
Married              0.488599
Dependents           2.442997
Education            0.000000
Self_Employed        0.000000
ApplicantIncome      0.000000
CoapplicantIncome    0.000000
LoanAmount           3.583062
Loan_Amount_Term     2.280130
Credit_History       0.000000
Property_Area        0.000000
Loan_Status          0.000000
dtype: float64

### 7. Handling Categorical Columns (Gender,Married,Education,Self_Empoloyed)

In [354]:
data.sample(5)

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
123,Male,Yes,2,Graduate,No,2957,0.0,81.0,360.0,1.0,Semiurban,Y
93,Male,No,0,Graduate,No,4133,0.0,122.0,360.0,1.0,Semiurban,Y
95,Male,No,0,Graduate,No,6782,0.0,,360.0,1.0,Urban,N
167,Male,Yes,0,Graduate,No,2439,3333.0,129.0,360.0,1.0,Rural,Y
153,Male,Yes,2,Not Graduate,No,2281,0.0,113.0,360.0,1.0,Rural,N


In [356]:
data['Dependents'].unique()

array(['0', '1', '2', '3+', nan], dtype=object)

**As Dependents has 3+ we can give it 4**

In [358]:
data['Dependents']=data['Dependents'].replace(to_replace='3+',value = '4')

In [360]:
data['Dependents'].unique()

array(['0', '1', '2', '4', nan], dtype=object)

**Simplification for Machine Learning**
Machine learning models require numerical input , and categorical values like 3+ need to be converted into numerical format.

By choosing 4, we are approximating the meaning of 3+ as a small integer greater than 3.
This is a simplification that allows the model to treat all values consistently and avoids complications caused by the  3+ label

In [362]:
data['Gender'].unique()#Categorical

array(['Male', 'Female', nan], dtype=object)

In [364]:

data = data.dropna(subset=['Gender'])
data['Gender'] = data['Gender'].map({'Male': 1, 'Female': 0}).astype(int)


In [366]:
data['Gender'].unique() # Numeric

array([1, 0])

In [368]:
data['Married'].unique()#Categorical

array(['No', 'Yes', nan], dtype=object)

In [370]:
data['Education'].unique()

array(['Graduate', 'Not Graduate'], dtype=object)

In [372]:
data['Self_Employed'].unique()

array(['No', 'Yes'], dtype=object)

In [374]:
data['Property_Area'].unique()

array(['Urban', 'Rural', 'Semiurban'], dtype=object)

In [376]:
data['Loan_Status'].unique()

array(['Y', 'N'], dtype=object)

In [378]:
# Converting to numeric
data = data.dropna(subset=['Married'])
data['Married'] = data['Married'].map({'Yes': 1, 'No': 0}).astype(int)


In [380]:
data['Married'].unique()

array([0, 1])

In [382]:
# Converting to numeric
data['Education']=data['Education'].map({'Graduate':1,'Not Graduate':0}).astype('int')

In [384]:
data['Self_Employed']=data['Self_Employed'].map({'Yes':1,'No':0}).astype('int')
data['Property_Area']=data['Property_Area'].map({'Rural':0,'Urban':1,'Semiurban':2}).astype('int')
data['Loan_Status']=data['Loan_Status'].map({'N':0,'Y':1}).astype('int')

In [386]:
data['Married'].unique()

array([0, 1])

In [388]:
data['Education'].unique()

array([1, 0])

In [390]:
data['Self_Employed'].unique()

array([0, 1])

In [392]:
data['Property_Area'].unique()

array([1, 0, 2])

In [394]:
data['Loan_Status'].unique()

array([1, 0])

In [396]:
data.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,1,0,0,1,0,5849,0.0,,360.0,1.0,1,1
1,1,1,1,1,0,4583,1508.0,128.0,360.0,1.0,0,0
2,1,1,0,1,1,3000,0.0,66.0,360.0,1.0,1,1
3,1,1,0,0,0,2583,2358.0,120.0,360.0,1.0,1,1
4,1,0,0,1,0,6000,0.0,141.0,360.0,1.0,1,1


In [398]:
data['LoanAmount'].unique()

array([ nan, 128.,  66., 120., 141., 267.,  95., 158., 168., 349.,  70.,
       109., 200., 114.,  17., 125., 100.,  76., 133., 115., 104., 315.,
       116., 151., 191., 122., 110.,  35., 201.,  74., 106., 320., 144.,
       184.,  80.,  47.,  75., 134.,  96.,  88.,  44., 112., 286.,  97.,
       135., 180.,  99., 165., 258., 126., 312., 136., 172.,  81., 187.,
       113., 176., 130., 111., 167., 265.,  50., 210., 175., 131., 188.,
        25., 137., 225., 216.,  94., 139., 152., 118., 185., 154.,  85.,
       259., 194.,  93., 160., 182., 650., 102., 290.,  84., 242., 129.,
        30., 244., 600., 255.,  98., 275., 121.,  63.,  87., 101., 495.,
        67.,  73., 260., 108.,  58.,  48., 164., 170.,  83.,  90., 166.,
       124.,  55.,  59., 127., 214., 240.,  72.,  60., 138.,  42., 280.,
       140., 155., 123., 279., 192., 304., 330., 150., 207., 436.,  78.,
        54.,  89., 143., 105., 132., 480.,  56., 300., 376., 117.,  71.,
       490., 173.,  46., 228., 308., 236., 570., 38

In [400]:
data = data.dropna(subset=['LoanAmount'])


In [402]:
print(data['LoanAmount'].isna().sum())  # Should print 0


0


In [404]:
data=data.dropna()
data

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,1,1,1,1,0,4583,1508.0,128.0,360.0,1.0,0,0
2,1,1,0,1,1,3000,0.0,66.0,360.0,1.0,1,1
3,1,1,0,0,0,2583,2358.0,120.0,360.0,1.0,1,1
4,1,0,0,1,0,6000,0.0,141.0,360.0,1.0,1,1
5,1,1,2,1,1,5417,4196.0,267.0,360.0,1.0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
609,0,0,0,1,0,2900,0.0,71.0,360.0,1.0,0,1
610,1,1,4,1,0,4106,0.0,40.0,180.0,1.0,0,1
611,1,1,1,1,0,8072,240.0,253.0,360.0,1.0,1,1
612,1,1,2,1,0,7583,0.0,187.0,360.0,1.0,1,1


### 8. Store Features(independent variables) in X & Response(dependent)(Target) in y

In [406]:
X = data.drop('Loan_Status',axis = 1)

In [408]:
X

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
1,1,1,1,1,0,4583,1508.0,128.0,360.0,1.0,0
2,1,1,0,1,1,3000,0.0,66.0,360.0,1.0,1
3,1,1,0,0,0,2583,2358.0,120.0,360.0,1.0,1
4,1,0,0,1,0,6000,0.0,141.0,360.0,1.0,1
5,1,1,2,1,1,5417,4196.0,267.0,360.0,1.0,1
...,...,...,...,...,...,...,...,...,...,...,...
609,0,0,0,1,0,2900,0.0,71.0,360.0,1.0,0
610,1,1,4,1,0,4106,0.0,40.0,180.0,1.0,0
611,1,1,1,1,0,8072,240.0,253.0,360.0,1.0,1
612,1,1,2,1,0,7583,0.0,187.0,360.0,1.0,1


In [410]:
X.shape

(553, 11)

In [412]:
y=data['Loan_Status']

In [414]:
y

1      0
2      1
3      1
4      1
5      1
      ..
609    1
610    1
611    1
612    1
613    0
Name: Loan_Status, Length: 553, dtype: int32

In [416]:
y.shape

(553,)

**The reason the last row number is 613 but the total number of rows is 553 is due to the presence of missing rows (dropped rows) during data preprocessing**

**However, the index numbers remain unchanged after dropping rows. Pandas does not reset the index by default when rows are removed**

**If you want the index to start from 0 and go sequentially, you can reset it using:
data.reset_index(drop=True, inplace=True)**

### 9.Feature Scaling

**FS is important for ML Algo that calculate distances between data. If not data features with high value range starts dominating when calculating distances, K-nearest , SVM, Linear Reg, Logistic , NN , (Non linear models are not affected by FS - DT,RF,NB)**

**If features like ApplicantIncome,CoapplicantIncome,LoanAmount & Loan_Amount_Term are not scaled , their large magnitudes can dominate smaller-scaled features, resulting in biased or poor model performance**

In [419]:
data.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,1,1,1,1,0,4583,1508.0,128.0,360.0,1.0,0,0
2,1,1,0,1,1,3000,0.0,66.0,360.0,1.0,1,1
3,1,1,0,0,0,2583,2358.0,120.0,360.0,1.0,1,1
4,1,0,0,1,0,6000,0.0,141.0,360.0,1.0,1,1
5,1,1,2,1,1,5417,4196.0,267.0,360.0,1.0,1,1


**Gender , Married , Dependents, Education,  Self_Employed are in same range but other columns (ApplicantIncome, CoapplicantIncome, LoanAmount,LoanAmountTerm) are not so we need to scale them**

Performing FS only on columns ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term

In [421]:
cols=['ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term']

In [423]:
from sklearn.preprocessing import StandardScaler
st = StandardScaler()
X[cols] = st.fit_transform(X[cols])

In [425]:
X # scaled features

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
1,1,1,1,1,0,-0.128694,-0.049699,-0.214368,0.279961,1.0,0
2,1,1,0,1,1,-0.394296,-0.545638,-0.952675,0.279961,1.0,1
3,1,1,0,0,0,-0.464262,0.229842,-0.309634,0.279961,1.0,1
4,1,0,0,1,0,0.109057,-0.545638,-0.059562,0.279961,1.0,1
5,1,1,2,1,1,0.011239,0.834309,1.440866,0.279961,1.0,1
...,...,...,...,...,...,...,...,...,...,...,...
609,0,0,0,1,0,-0.411075,-0.545638,-0.893134,0.279961,1.0,0
610,1,1,4,1,0,-0.208727,-0.545638,-1.262287,-2.468292,1.0,0
611,1,1,1,1,0,0.456706,-0.466709,1.274152,0.279961,1.0,1
612,1,1,2,1,0,0.374659,-0.545638,0.488213,0.279961,1.0,1


### 10.Split data into Training & Testing Set & Apply K-Fold Cross Validation 

In [427]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
import numpy as np

In [429]:
# Train model on X_train, y_train and perform prediction using X_test(unseen samples) and compare
# our predicted results by our model with (y_test)

#with train test split well use K-Fold cross validation (allows us to compare diff ML algo and get a sense how well they will work on practice)

model_df={}
def model_val(model,X,y):
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.20,random_state=42)
    model.fit(X_train,y_train)
    y_pred= model.predict(X_test)
    print(f'{model} accuracy is {accuracy_score(y_test,y_pred)}')
    score= cross_val_score(model,X,y, cv = 5) #5 fold CV
    print(f'{model} Avg. cross val score is : {np.mean(score)}')
    model_df[model]= round(np.mean(score)*100,2) # Key value

### 11. Logistic Regression

In [431]:
data.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,1,1,1,1,0,4583,1508.0,128.0,360.0,1.0,0,0
2,1,1,0,1,1,3000,0.0,66.0,360.0,1.0,1,1
3,1,1,0,0,0,2583,2358.0,120.0,360.0,1.0,1,1
4,1,0,0,1,0,6000,0.0,141.0,360.0,1.0,1,1
5,1,1,2,1,1,5417,4196.0,267.0,360.0,1.0,1,1


In [433]:
from sklearn.linear_model import LogisticRegression
model= LogisticRegression()
model_val(model,X,y)

LogisticRegression() accuracy is 0.8018018018018018
LogisticRegression() Avg. cross val score is : 0.8047829647829647


In [435]:
model_df

{LogisticRegression(): 80.48}

### 12. SVC

In [438]:
from sklearn import svm
model = svm.SVC()
model_val(model,X,y)

SVC() accuracy is 0.7927927927927928
SVC() Avg. cross val score is : 0.7938902538902539


In [440]:
model_df

{LogisticRegression(): 80.48, SVC(): 79.39}

### 13.DT Classifier

In [443]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model_val(model,X,y)

DecisionTreeClassifier() accuracy is 0.7117117117117117
DecisionTreeClassifier() Avg. cross val score is : 0.7198034398034397


In [445]:
model_df

{LogisticRegression(): 80.48, SVC(): 79.39, DecisionTreeClassifier(): 71.98}

### 14. RF Classifier

In [448]:
from sklearn.ensemble import RandomForestClassifier
model=RandomForestClassifier()
model_val(model,X,y)

RandomForestClassifier() accuracy is 0.7657657657657657
RandomForestClassifier() Avg. cross val score is : 0.7848812448812449


In [450]:
model_df

{LogisticRegression(): 80.48,
 SVC(): 79.39,
 DecisionTreeClassifier(): 71.98,
 RandomForestClassifier(): 78.49}

### 15. Gradient Boosting Classifier

In [455]:
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
model_val(model,X,y)

GradientBoostingClassifier() accuracy is 0.7927927927927928
GradientBoostingClassifier() Avg. cross val score is : 0.7794103194103194


In [457]:
model_df

{LogisticRegression(): 80.48,
 SVC(): 79.39,
 DecisionTreeClassifier(): 71.98,
 RandomForestClassifier(): 78.49,
 GradientBoostingClassifier(): 77.94}

### 16. Hyperparameter Tuning

**We have trained our models with default parameters but we can tune models**

**In ML we have 

1. Model Param(model learns these during training phase, fixed params)

2. Hyper Param(adjustable params, they must be tuned in order to obtain a model with optimum performance)**

**How to find optimal parameters ? 

1. GridSearch CV(goes through all the intermediate combination of parameters which makes GDCV computationally very expensive)

2. Randomsized Searched CV(solves drawbacks of GDCV, it moves within a grid in a random fashion)**


In [462]:
from sklearn.model_selection import RandomizedSearchCV

# Logistic Regression (C,solver)

In [465]:
log_reg_grid = {'C':np.logspace(-4,4,20),'solver':['liblinear']}

In [467]:
rs_log_reg= RandomizedSearchCV(LogisticRegression(),  # estimator (ML model is passed)
                               param_distributions = log_reg_grid,
                               n_iter=20,cv=5,verbose=True)

In [469]:
rs_log_reg.fit(X,y)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


In [471]:
rs_log_reg.best_score_

0.8047829647829647

In [473]:
rs_log_reg.best_params_

{'solver': 'liblinear', 'C': 0.23357214690901212}

### SVC (c and kernel)

In [476]:
svc_grid = {'C':[0.25,0.50,0.75,1],'kernel':['linear']}

In [478]:
rs_svc= RandomizedSearchCV(svm.SVC(),
                          param_distributions = svc_grid,
                          cv=5,
                          n_iter=20,
                          verbose=True)

In [480]:
rs_svc.fit(X,y)



Fitting 5 folds for each of 4 candidates, totalling 20 fits


In [482]:
rs_svc.best_score_

0.8066011466011467

In [484]:
rs_svc.best_params_

{'kernel': 'linear', 'C': 0.25}

In [491]:
import warnings 
warnings.filterwarnings('ignore')

# RF CLassifier (n_estimators,max_features,max_depth,min_sample_split,min_samples_leaf)

In [495]:
rf_grid = {
    'n_estimators' : np.arange(10,1000,10),
    'max_features' : ['auto','sqrt'],
    'max_depth' : [None,3,5,10,20,30],
    'min_samples_split' : [2,5,20,50,100],
    'min_samples_leaf' : [1,2,5,10]

}

In [497]:
rs_rf = RandomizedSearchCV(RandomForestClassifier(),
                          param_distributions = rf_grid,
                          cv=5,
                          n_iter=20,
                          verbose=True)

In [499]:
rs_rf.fit(X,y)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


In [501]:
rs_rf.best_score_

0.8084193284193285

In [503]:
rs_rf.best_params_

{'n_estimators': 70,
 'min_samples_split': 20,
 'min_samples_leaf': 2,
 'max_features': 'sqrt',
 'max_depth': 3}

## Model Performance Before and After Hyperparameter Tuning

---

### Logistic Regression
- **Before Tuning**:
  - Accuracy: **0.8018**
  - Cross-Validation Score: **0.8047**

- **After Tuning**:
  - Best Cross-Validation Score: **0.8047**

---

### Support Vector Classifier (SVC)
- **Before Tuning**:
  - Accuracy: **0.7927**
  - Cross-Validation Score: **0.7938**

- **After Tuning**:
  - Best Cross-Validation Score: **0.8066**

---

### Random Forest Classifier
- **Before Tuning**:
  - Accuracy: **0.7747**
  - Cross-Validation Score: **0.7920**

- **After Tuning**:
  - Best Cross-Validation Score: **0.8066**

---

## Summary:
- **SVC** and **Random Forest Classifier** saw improvements after hyperparameter tuning.
- **Logistic Regression** maintained similar performance despite tuning.


### 17. Save the Model

**We can chose either SVC or RFC it's our choice, in this case let's go with RFC**

**We have to train the model on enitre dataset with best Parameters**

In [512]:
X = data.drop('Loan_Status',axis = 1)
y = data['Loan_Status']

In [514]:
rf = RandomForestClassifier(n_estimators= 880,
 min_samples_split= 50,
 min_samples_leaf= 5,
 max_features= 'sqrt',
 max_depth= 5)

In [516]:
rf.fit(X,y)

In [518]:
import joblib

In [520]:
joblib.dump(rf,'loan_status_predict')

['loan_status_predict']

In [522]:
model = joblib.load('loan_status_predict') # loaded the model

### Testing RFC on new data

In [525]:
import pandas as pd

df = pd.DataFrame({   # for given values we will predict whether loan is approved or not ?
    'Gender': 1,
    'Married': 1,
    'Dependents': 2,
    'Education': 0,
    'Self_Employed': 0,
    'ApplicantIncome': 2889,
    'CoapplicantIncome': 0.0,
    'LoanAmount': 45,
    'Loan_Amount_Term': 180,
    'Credit_History': 0,
    'Property_Area': 1
}, index=[0])


In [527]:
df

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,1,1,2,0,0,2889,0.0,45,180,0,1


In [529]:
result = model.predict(df)

In [531]:
if result == 1:
    print('Loan Approved !')
else:
    print('Not Approved !') # with accuracy of 80.66%

Not Approved !


### GUI

In [534]:
from tkinter import *
import joblib
import pandas as pd

In [None]:

# Function to process inputs and predict loan status
def show_entry():
    # Collecting inputs
    p1 = float(e1.get())
    p2 = float(e2.get())
    p3 = float(e3.get())
    p4 = float(e4.get())
    p5 = float(e5.get())
    p6 = float(e6.get())
    p7 = float(e7.get())
    p8 = float(e8.get())
    p9 = float(e9.get())
    p10 = float(e10.get())
    p11 = float(e11.get())
    
    # Load pre-trained model
    model = joblib.load('loan_status_predict')
    
    # Create a DataFrame for prediction
    df = pd.DataFrame({
        'Gender': p1,
        'Married': p2,
        'Dependents': p3,
        'Education': p4,
        'Self_Employed': p5,
        'ApplicantIncome': p6,
        'CoapplicantIncome': p7,
        'LoanAmount': p8,
        'Loan_Amount_Term': p9,
        'Credit_History': p10,
        'Property_Area': p11
    }, index=[0])
    
    # Prediction
    result = model.predict(df)
    
    # Display result
    if result == 1:
        Label(master, text="Loan Approved", fg="green").grid(row=31)
    else:
        Label(master, text="Loan Not Approved", fg="red").grid(row=31)

# Setting up the GUI
master = Tk()
master.title("Loan Status Prediction")
label = Label(master, text="Loan Status Prediction", bg="black", fg="white")
label.grid(row=0, columnspan=2)

# Input Labels
Label(master, text="Gender (1:Male, 0:Female)").grid(row=1)
Label(master, text="Married (1:Yes, 0:No)").grid(row=2)
Label(master, text="Dependents [1,2,3,4]").grid(row=3)
Label(master, text="Education").grid(row=4)
Label(master, text="Self_Employed").grid(row=5)
Label(master, text="ApplicantIncome").grid(row=6)
Label(master, text="CoapplicantIncome").grid(row=7)
Label(master, text="LoanAmount").grid(row=8)
Label(master, text="Loan_Amount_Term").grid(row=9)
Label(master, text="Credit_History").grid(row=10)
Label(master, text="Property_Area").grid(row=11)

# Entry fields
e1 = Entry(master)
e2 = Entry(master)
e3 = Entry(master)
e4 = Entry(master)
e5 = Entry(master)
e6 = Entry(master)
e7 = Entry(master)
e8 = Entry(master)
e9 = Entry(master)
e10 = Entry(master)
e11 = Entry(master)

# Positioning Entry fields
e1.grid(row=1, column=1)
e2.grid(row=2, column=1)
e3.grid(row=3, column=1)
e4.grid(row=4, column=1)
e5.grid(row=5, column=1)
e6.grid(row=6, column=1)
e7.grid(row=7, column=1)
e8.grid(row=8, column=1)
e9.grid(row=9, column=1)
e10.grid(row=10, column=1)
e11.grid(row=11, column=1)

# Predict Button
Button(master, text="Predict", command=show_entry).grid()

# Start the GUI loop
mainloop()
