# Objective: 
### Predict whether a loan applicant will default or repay the loan based on their financial and personal information.

**Personal and Demographic Features:**

1. Gender: Male/Female (categorical)
2. Married: Whether the applicant is married (Yes/No)
3. Dependents: Number of dependents (numerical or categorical)
4. Education: Education level of the applicant (Graduate/Not Graduate)
5. Self_Employed: Whether the applicant is self-employed (Yes/No)

**Financial Features:**

6. ApplicantIncome: Monthly or yearly income of the applicant (numerical)
7. CoapplicantIncome: Income of the co-applicant, if applicable (numerical)
8. LoanAmount: Total loan amount requested (numerical)
9. Loan_Amount_Term: Term of the loan in months (e.g., 360 for 30 years)
10. Credit_History: History of loan repayment (0 = No, 1 = Yes)

**Loan Features:**

11. Loan_ID: Unique identifier for each loan application (usually dropped as it’s not predictive).
12. Property_Area: Area where the applicant resides (Urban, Semi-Urban, Rural)
13. Loan_Status: Target variable indicating loan approval or rejection (Y = Approved, N = Rejected).

In [6]:
import pandas as pd

In [7]:
data = pd.read_csv('loan_prediction.csv')

### 1. Display top 5 rows

In [9]:
data.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


### 2. Display last 5 rows

In [11]:
data.tail()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y
613,LP002990,Female,No,0,Graduate,Yes,4583,0.0,133.0,360.0,0.0,Semiurban,N


### 3. Shape of dataset (No. of rows & columns)

In [13]:
data.shape # shape is not a method it is an attribute of pandas dataframe

(614, 13)

In [14]:
print("No. of Rows: ", data.shape[0])
print("No. of Rows: ", data.shape[1])

No. of Rows:  614
No. of Rows:  13


### 4. Info about dataset

In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


Gender has 601 non-null values → 13 missing values (614 - 601).

Married has 611 non-null values → 3 missing values.

Dependents has 599 non-null values → 15 missing values.

Non-null count: Number of rows where the value is not null (non-missing).

Total rows = 614.

If the non-null count is less than 614, the remaining rows have missing values (NaN).

**The blank cells in our CSV file are indeed null values (missing values). When the file is loaded into a pandas DataFrame, these blank cells will be automatically recognized as NaN (Not a Number), which is Python's standard way of representing missing data.**

### 5. Check Null Values

In [21]:
# Check for missing values in each column
data.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [22]:
data.isnull().sum()*100 / len(data) # % of missing values

Loan_ID              0.000000
Gender               2.117264
Married              0.488599
Dependents           2.442997
Education            0.000000
Self_Employed        5.211726
ApplicantIncome      0.000000
CoapplicantIncome    0.000000
LoanAmount           3.583062
Loan_Amount_Term     2.280130
Credit_History       8.143322
Property_Area        0.000000
Loan_Status          0.000000
dtype: float64

**Drop rows where missing values are less than 5%** 

**The 5% threshold is a practical choice to balance data quality and quantity:
Small enough to minimize data loss.
Large enough to ensure missing values don’t negatively impact model performance**

### 6. Handling Missing Values

In [25]:
data.columns

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

In [26]:
data = data.drop('Loan_ID',axis = 1)
data.head(1)

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y


In [27]:
data.columns

Index(['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
       'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

In [28]:
columns =  ['Gender','Dependents','LoanAmount','Loan_Amount_Term']

In [29]:
data = data.dropna(subset = columns)

In [30]:
data.isnull().sum()*100 / len(data) # % of missing values

Gender               0.000000
Married              0.000000
Dependents           0.000000
Education            0.000000
Self_Employed        5.424955
ApplicantIncome      0.000000
CoapplicantIncome    0.000000
LoanAmount           0.000000
Loan_Amount_Term     0.000000
Credit_History       8.679928
Property_Area        0.000000
Loan_Status          0.000000
dtype: float64

**Practical Strategy:**

The 5% rule is a guideline, not a hard rule. By dropping only the rows with missing values for these selected columns:

1. You avoid significant data loss.

2. You focus on features that are critical for your analysis.

3. You leave room to impute missing values in other columns where dropping data would be less efficient.

**Why Only These Columns Were Dropped? ('Gender','Dependents','LoanAmount','Loan_Amount_Term')**

**Critical Features:**

Gender, Dependents, LoanAmount, and Loan_Amount_Term may have been identified as important predictors for the task (like Loan Status Prediction).
Dropping rows with missing values in these specific columns ensures that these key variables remain clean.

**Selective Dropping:**

Dropping rows with missing values across all columns, especially when the percentage of missing data is very small (e.g., Married: 0.48%), may lead to unnecessary data loss.
Instead, focusing only on important features with missing values ensures data retention while maintaining model quality.

**Imputation for Other Columns: (Self_Employed ,Credit_History )**

Columns like Self_Employed and Credit_History may have been left intact because the missing values in these columns can be imputed later (e.g., using mode or other strategies).

In [34]:
data['Self_Employed'].mode()[0]

'No'

In [35]:
data['Self_Employed'] = data['Self_Employed'].fillna(data['Self_Employed'].mode()[0])


In [36]:
data.isnull().sum()*100 / len(data) # % of missing values

Gender               0.000000
Married              0.000000
Dependents           0.000000
Education            0.000000
Self_Employed        0.000000
ApplicantIncome      0.000000
CoapplicantIncome    0.000000
LoanAmount           0.000000
Loan_Amount_Term     0.000000
Credit_History       8.679928
Property_Area        0.000000
Loan_Status          0.000000
dtype: float64

In [37]:
data['Credit_History'].unique()

array([ 1.,  0., nan])

In [38]:
data['Self_Employed'].unique()

array(['No', 'Yes'], dtype=object)

In [39]:
data['Credit_History'].mode()[0]

1.0

In [40]:
data['Credit_History'] = data['Credit_History'].fillna(data['Credit_History'].mode()[0])

In [41]:
data.isnull().sum()*100 / len(data) # % of missing values

Gender               0.0
Married              0.0
Dependents           0.0
Education            0.0
Self_Employed        0.0
ApplicantIncome      0.0
CoapplicantIncome    0.0
LoanAmount           0.0
Loan_Amount_Term     0.0
Credit_History       0.0
Property_Area        0.0
Loan_Status          0.0
dtype: float64

### 7. Handling Categorical Columns (Gender,Married,Education,Self_Employed)

In [43]:
data.sample(5)

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
154,Male,No,0,Graduate,No,3254,0.0,50.0,360.0,1.0,Urban,Y
374,Female,No,0,Graduate,No,2764,1459.0,110.0,360.0,1.0,Urban,Y
141,Male,No,0,Graduate,No,5417,0.0,168.0,360.0,1.0,Urban,Y
158,Male,No,0,Graduate,No,2980,2083.0,120.0,360.0,1.0,Rural,Y
595,Male,No,0,Not Graduate,No,3833,0.0,110.0,360.0,1.0,Rural,Y


In [44]:
data['Dependents'].unique()

array(['1', '0', '2', '3+'], dtype=object)

**As Dependents has 3+ we can give it 4**

In [46]:
data['Dependents'] = data['Dependents'].replace(to_replace='3+', value = '4')

In [47]:
data['Dependents'].unique()

array(['1', '0', '2', '4'], dtype=object)

**Simplification for Machine Learning:**

Machine learning models require numerical input, and categorical values like 3+ need to be converted into a numerical format.

By choosing 4, we are approximating the meaning of 3+ as a small integer greater than 3.
This is a simplification that allows the model to treat all values consistently and avoids complications caused by the 3+ label.

In [49]:
data['Gender'].unique() # categorical

array(['Male', 'Female'], dtype=object)

In [50]:
data['Gender'] = data['Gender'].map({'Male': 1,'Female': 0}).astype('int')

In [51]:
data['Gender'].unique() # numeric

array([1, 0])

In [52]:
data['Married'].unique()  # categorical

array(['Yes', 'No'], dtype=object)

In [53]:
data['Education'].unique() # categorical

array(['Graduate', 'Not Graduate'], dtype=object)

In [54]:
data['Self_Employed'].unique() # categorical

array(['No', 'Yes'], dtype=object)

In [55]:
data['Property_Area'].unique() # categorical

array(['Rural', 'Urban', 'Semiurban'], dtype=object)

In [56]:
data['Loan_Status'].unique() # categorical

array(['N', 'Y'], dtype=object)

In [57]:
# converting to numeric

data['Married'] = data['Married'].map({'Yes': 1,'No': 0}).astype('int')
data['Education'] = data['Education'].map({'Graduate': 1,'Not Graduate': 0}).astype('int')
data['Self_Employed'] = data['Self_Employed'].map({'Yes': 1,'No': 0}).astype('int')
data['Property_Area'] = data['Property_Area'].map({'Rural': 0,'Urban': 1,'Semiurban':2}).astype('int')
data['Loan_Status'] = data['Loan_Status'].map({'N': 0,'Y': 1}).astype('int')

In [58]:
data['Married'].unique()  # numeric

array([1, 0])

In [59]:
data['Education'].unique()  # numeric

array([1, 0])

In [60]:
data['Self_Employed'].unique()  # numeric

array([0, 1])

In [61]:
data['Property_Area'].unique()  # numeric

array([0, 1, 2])

In [62]:
data['Loan_Status'].unique()  # numeric

array([0, 1])

In [105]:
data.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,1,1,1,1,0,4583,1508.0,128.0,360.0,1.0,0,0
2,1,1,0,1,1,3000,0.0,66.0,360.0,1.0,1,1
3,1,1,0,0,0,2583,2358.0,120.0,360.0,1.0,1,1
4,1,0,0,1,0,6000,0.0,141.0,360.0,1.0,1,1
5,1,1,2,1,1,5417,4196.0,267.0,360.0,1.0,1,1


### 8. Store Features(independent variables) in X & Response(dependent)(Target) in y

In [109]:
X = data.drop('Loan_Status',axis = 1)

In [111]:
X

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
1,1,1,1,1,0,4583,1508.0,128.0,360.0,1.0,0
2,1,1,0,1,1,3000,0.0,66.0,360.0,1.0,1
3,1,1,0,0,0,2583,2358.0,120.0,360.0,1.0,1
4,1,0,0,1,0,6000,0.0,141.0,360.0,1.0,1
5,1,1,2,1,1,5417,4196.0,267.0,360.0,1.0,1
...,...,...,...,...,...,...,...,...,...,...,...
609,0,0,0,1,0,2900,0.0,71.0,360.0,1.0,0
610,1,1,4,1,0,4106,0.0,40.0,180.0,1.0,0
611,1,1,1,1,0,8072,240.0,253.0,360.0,1.0,1
612,1,1,2,1,0,7583,0.0,187.0,360.0,1.0,1


In [113]:
X.shape

(553, 11)

In [115]:
y = data['Loan_Status']

In [117]:
y

1      0
2      1
3      1
4      1
5      1
      ..
609    1
610    1
611    1
612    1
613    0
Name: Loan_Status, Length: 553, dtype: int32

In [121]:
y.shape

(553,)

**The reason the last row number is 613 but the total number of rows is 553 is due to the presence of missing rows (dropped rows) during data preprocessing.**

**However, the index numbers remain unchanged after dropping rows. Pandas does not reset the index by default when rows are removed.**

**If you want the index to start from 0 and go sequentially, you can reset it using:
data.reset_index(drop=True, inplace=True)**

### 9. Feature Scaling

**FS is Important for ML Algo that calculate distances between data. If not data features with high value range starts dominating when calculating distances , K-Nerarest , SVM,Linear Reg ,Logistic ,NN , (Non Linear models are not affected by FS  - DT,RF,NB)**

**If features like ApplicantIncome, CoapplicantIncome, LoanAmount &Loan_Amount_Term are not scaled, their large magnitudes can dominate smaller-scaled features, resulting in biased or poor model performance.**

In [127]:
data.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,1,1,1,1,0,4583,1508.0,128.0,360.0,1.0,0,0
2,1,1,0,1,1,3000,0.0,66.0,360.0,1.0,1,1
3,1,1,0,0,0,2583,2358.0,120.0,360.0,1.0,1,1
4,1,0,0,1,0,6000,0.0,141.0,360.0,1.0,1,1
5,1,1,2,1,1,5417,4196.0,267.0,360.0,1.0,1,1


**Gender	,Married	,Dependents	,Education	,Self_Employed	are in same range but other columns (ApplicantIncome	,CoapplicantIncome	,LoanAmount	,Loan_Amount_Term	)are not so we need to scale them**

Performing FS only on columns ApplicantIncome	,CoapplicantIncome	,LoanAmount	,Loan_Amount_Term

In [139]:
cols = ['ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term']

In [141]:
from sklearn.preprocessing import StandardScaler
st = StandardScaler()
X[cols] = st.fit_transform(X[cols])

In [143]:
X # scaled features

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
1,1,1,1,1,0,-0.128694,-0.049699,-0.214368,0.279961,1.0,0
2,1,1,0,1,1,-0.394296,-0.545638,-0.952675,0.279961,1.0,1
3,1,1,0,0,0,-0.464262,0.229842,-0.309634,0.279961,1.0,1
4,1,0,0,1,0,0.109057,-0.545638,-0.059562,0.279961,1.0,1
5,1,1,2,1,1,0.011239,0.834309,1.440866,0.279961,1.0,1
...,...,...,...,...,...,...,...,...,...,...,...
609,0,0,0,1,0,-0.411075,-0.545638,-0.893134,0.279961,1.0,0
610,1,1,4,1,0,-0.208727,-0.545638,-1.262287,-2.468292,1.0,0
611,1,1,1,1,0,0.456706,-0.466709,1.274152,0.279961,1.0,1
612,1,1,2,1,0,0.374659,-0.545638,0.488213,0.279961,1.0,1


### 10. Split data into Training & Testing Set & Apply K-Fold Cross Validation

In [157]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
import numpy as np

In [159]:
# Train model on X_train,y_train and perform prediction using X_test(unseen samples) and compare 
# our predicted results by our model with (y_test)

# with train test split well use K-Fold cross validation (allows us to compare diff ML algo and get a sense how well they will work on practice)

model_df = {}
def model_val(model,X,y):
   X_train,X_test,y_train,y_test =  train_test_split(X,y,test_size=0.20,random_state=42)
   model.fit(X_train,y_train)
   y_pred = model.predict(X_test)
   print(f'{model} accuracy is {accuracy_score(y_test,y_pred)}')
   score = cross_val_score(model,X,y, cv = 5)  # 5 fold CV
   print(f'{model} Avg. cross val score is: {np.mean(score)}')
   model_df[model] = round(np.mean(score)*100,2)  # key value

### 11. Logistic Regression

In [161]:
data.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,1,1,1,1,0,4583,1508.0,128.0,360.0,1.0,0,0
2,1,1,0,1,1,3000,0.0,66.0,360.0,1.0,1,1
3,1,1,0,0,0,2583,2358.0,120.0,360.0,1.0,1,1
4,1,0,0,1,0,6000,0.0,141.0,360.0,1.0,1,1
5,1,1,2,1,1,5417,4196.0,267.0,360.0,1.0,1,1


In [163]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model_val(model,X,y)

LogisticRegression() accuracy is 0.8018018018018018
LogisticRegression() Avg. cross val score is: 0.8047829647829647


In [165]:
model_df

{LogisticRegression(): 80.48}

### 12. SVC

In [168]:
from sklearn import svm
model = svm.SVC()
model_val(model,X,y)

SVC() accuracy is 0.7927927927927928
SVC() Avg. cross val score is: 0.7938902538902539


In [170]:
model_df

{LogisticRegression(): 80.48, SVC(): 79.39}

### 13. DT Classifier

In [173]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model_val(model,X,y)

DecisionTreeClassifier() accuracy is 0.7657657657657657
DecisionTreeClassifier() Avg. cross val score is: 0.7161834561834561


In [175]:
model_df

{LogisticRegression(): 80.48, SVC(): 79.39, DecisionTreeClassifier(): 71.62}

### 14. RF Classifier

In [178]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model_val(model,X,y)

RandomForestClassifier() accuracy is 0.7747747747747747
RandomForestClassifier() Avg. cross val score is: 0.792088452088452


In [180]:
model_df

{LogisticRegression(): 80.48,
 SVC(): 79.39,
 DecisionTreeClassifier(): 71.62,
 RandomForestClassifier(): 79.21}

### 15. Gradient Boosting Classifier

In [184]:
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
model_val(model,X,y)

GradientBoostingClassifier() accuracy is 0.7927927927927928
GradientBoostingClassifier() Avg. cross val score is: 0.7721539721539721


In [186]:
model_df

{LogisticRegression(): 80.48,
 SVC(): 79.39,
 DecisionTreeClassifier(): 71.62,
 RandomForestClassifier(): 79.21,
 GradientBoostingClassifier(): 77.22}

### 16. Hyperparameter Tuning

**We have trained our models with default parameters but we can tune models**

**In ML we have 1. Model Param(model learns these during taining phase , fixed params)  2. Hyper Param(adjustable params , they must be tuned in order to obtain a model with optimal performance )**

**How to find optimal parameters ? 1. GridSearch CV(goes through all the intermediete combination of parameters which makes GDCV computationally very expensive) 2. Randomsied Searched CV(solves drawbacks of GDCV , it moves within a grid in a random fashion)**

In [194]:
from sklearn.model_selection import RandomizedSearchCV

### Logistic Regression (C ,solver)

In [199]:
log_reg_grid = {'C': np.logspace(-4,4,20),
                'solver': ['liblinear']} 

In [201]:
rs_log_reg = RandomizedSearchCV(LogisticRegression(), # estimator (ML model is passed)
                   param_distributions = log_reg_grid, 
                   n_iter=20, cv=5, verbose=True)   

In [203]:
rs_log_reg.fit(X,y)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


In [205]:
rs_log_reg.best_score_

0.8047829647829647

In [207]:
rs_log_reg.best_params_

{'solver': 'liblinear', 'C': 0.23357214690901212}

### SVC (c and kernel)

In [211]:
svc_grid = {'C': [0.25,0.50,0.75,1], 'kernel': ['linear']}

In [213]:
rs_svc = RandomizedSearchCV(svm.SVC(),
                              param_distributions = svc_grid,
                              cv = 5 ,
                              n_iter = 20 , 
                              verbose = True)

In [215]:
rs_svc.fit(X,y)



Fitting 5 folds for each of 4 candidates, totalling 20 fits


In [217]:
rs_svc.best_score_

0.8066011466011467

In [219]:
rs_svc.best_params_

{'kernel': 'linear', 'C': 0.25}

### RF Classifier (n_estimators,max_features,max_depth,min_sample_split,min_samples_leaf)

In [223]:
rf_grid = {
    'n_estimators': np.arange(10, 1000, 10),
    'max_features': ['auto', 'sqrt'],
    'max_depth': [None, 3, 5, 10, 20, 30],
    'min_samples_split': [2, 5, 20, 50, 100],
    'min_samples_leaf': [1, 2, 5, 10]
}

In [225]:
rs_rf = RandomizedSearchCV(RandomForestClassifier(),
                              param_distributions = rf_grid,
                              cv = 5 ,
                              n_iter = 20 , 
                              verbose = True)

In [227]:
rs_rf.fit(X,y)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


45 fits failed out of a total of 100.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\dipes\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\dipes\anaconda3\lib\site-packages\sklearn\base.py", line 1144, in wrapper
    estimator._validate_params()
  File "C:\Users\dipes\anaconda3\lib\site-packages\sklearn\base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\dipes\anaconda3\lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParamete

In [229]:
rs_rf.best_score_

0.8066011466011467

In [231]:
rs_rf.best_params_

{'n_estimators': 880,
 'min_samples_split': 50,
 'min_samples_leaf': 5,
 'max_features': 'sqrt',
 'max_depth': 5}

## Model Performance Before and After Hyperparameter Tuning

---

### Logistic Regression
- **Before Tuning**:
  - Accuracy: **0.8018**
  - Cross-Validation Score: **0.8047**

- **After Tuning**:
  - Best Cross-Validation Score: **0.8047**

---

### Support Vector Classifier (SVC)
- **Before Tuning**:
  - Accuracy: **0.7927**
  - Cross-Validation Score: **0.7938**

- **After Tuning**:
  - Best Cross-Validation Score: **0.8066**

---

### Random Forest Classifier
- **Before Tuning**:
  - Accuracy: **0.7747**
  - Cross-Validation Score: **0.7920**

- **After Tuning**:
  - Best Cross-Validation Score: **0.8066**

---

## Summary:
- **SVC** and **Random Forest Classifier** saw improvements after hyperparameter tuning.
- **Logistic Regression** maintained similar performance despite tuning.


### 17. Save The Model

**We can chose either SVC or RFC it's our choice, in this case let's go with RFC**

**We have to train the model on enitre dataset with best Parameters**

In [241]:
X = data.drop('Loan_Status',axis = 1)
y = data['Loan_Status']

In [243]:
rf = RandomForestClassifier(n_estimators= 880,
 min_samples_split= 50,
 min_samples_leaf= 5,
 max_features= 'sqrt',
 max_depth= 5)

In [245]:
rf.fit(X,y)

In [247]:
import joblib

In [249]:
joblib.dump(rf,'loan_status_predict')

['loan_status_predict']

In [251]:
model = joblib.load('loan_status_predict') # loaded the model

### Testing RFC on new data

In [254]:
import pandas as pd

df = pd.DataFrame({   # for given values we will predict whether loan is approved or not ?
    'Gender': 1,
    'Married': 1,
    'Dependents': 2,
    'Education': 0,
    'Self_Employed': 0,
    'ApplicantIncome': 2889,
    'CoapplicantIncome': 0.0,
    'LoanAmount': 45,
    'Loan_Amount_Term': 180,
    'Credit_History': 0,
    'Property_Area': 1
}, index=[0])


In [256]:
df

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,1,1,2,0,0,2889,0.0,45,180,0,1


In [260]:
result = model.predict(df)

In [264]:
if result == 1:
    print('Loan Approved !')
else:
    print('Not Approved !') # with accuracy of 80.66%

Not Approved !


### GUI

In [267]:
from tkinter import *
import joblib
import pandas as pd

In [269]:

# Function to process inputs and predict loan status
def show_entry():
    # Collecting inputs
    p1 = float(e1.get())
    p2 = float(e2.get())
    p3 = float(e3.get())
    p4 = float(e4.get())
    p5 = float(e5.get())
    p6 = float(e6.get())
    p7 = float(e7.get())
    p8 = float(e8.get())
    p9 = float(e9.get())
    p10 = float(e10.get())
    p11 = float(e11.get())
    
    # Load pre-trained model
    model = joblib.load('loan_status_predict')
    
    # Create a DataFrame for prediction
    df = pd.DataFrame({
        'Gender': p1,
        'Married': p2,
        'Dependents': p3,
        'Education': p4,
        'Self_Employed': p5,
        'ApplicantIncome': p6,
        'CoapplicantIncome': p7,
        'LoanAmount': p8,
        'Loan_Amount_Term': p9,
        'Credit_History': p10,
        'Property_Area': p11
    }, index=[0])
    
    # Prediction
    result = model.predict(df)
    
    # Display result
    if result == 1:
        Label(master, text="Loan Approved", fg="green").grid(row=31)
    else:
        Label(master, text="Loan Not Approved", fg="red").grid(row=31)

# Setting up the GUI
master = Tk()
master.title("Loan Status Prediction")
label = Label(master, text="Loan Status Prediction", bg="black", fg="white")
label.grid(row=0, columnspan=2)

# Input Labels
Label(master, text="Gender (1:Male, 0:Female)").grid(row=1)
Label(master, text="Married (1:Yes, 0:No)").grid(row=2)
Label(master, text="Dependents [1,2,3,4]").grid(row=3)
Label(master, text="Education").grid(row=4)
Label(master, text="Self_Employed").grid(row=5)
Label(master, text="ApplicantIncome").grid(row=6)
Label(master, text="CoapplicantIncome").grid(row=7)
Label(master, text="LoanAmount").grid(row=8)
Label(master, text="Loan_Amount_Term").grid(row=9)
Label(master, text="Credit_History").grid(row=10)
Label(master, text="Property_Area").grid(row=11)

# Entry fields
e1 = Entry(master)
e2 = Entry(master)
e3 = Entry(master)
e4 = Entry(master)
e5 = Entry(master)
e6 = Entry(master)
e7 = Entry(master)
e8 = Entry(master)
e9 = Entry(master)
e10 = Entry(master)
e11 = Entry(master)

# Positioning Entry fields
e1.grid(row=1, column=1)
e2.grid(row=2, column=1)
e3.grid(row=3, column=1)
e4.grid(row=4, column=1)
e5.grid(row=5, column=1)
e6.grid(row=6, column=1)
e7.grid(row=7, column=1)
e8.grid(row=8, column=1)
e9.grid(row=9, column=1)
e10.grid(row=10, column=1)
e11.grid(row=11, column=1)

# Predict Button
Button(master, text="Predict", command=show_entry).grid()

# Start the GUI loop
mainloop()
