# ***Assignment on Machine Learning***
The primary objective of this project is to develop accurate and robust machine learning models that can predict the creditworthiness of a borrower.
The goal is to identify individuals who are most likely to repay loans while reducing the risk for financial institutions.
You will be judged on your model’s ability to make accurate predictions, approach to feature engineering, and how well you handle the trade-off between precision and recall

### Firstly;
* We import the relevant python libraries (such as NumPy, Pandas, Matplotlib, Sklearn, etc) which will enable us peform the tasks easily.
* Then upload/Read the csv file into a Pandas dataframe.
* And perform some data cleaning (i.e Put the data together in a form that will allow smooth use of the libraries)

In [2]:
# Importing Relevant libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVR

In [3]:
path = r"C:\Users\Anthony\PORA ACADEMY\Data Science Class\Materials and Notebooks\Loan Data Train.csv"
df = pd.read_csv(path)
df.head()

Unnamed: 0,ID,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,Total_Income
0,74768,LP002231,1,1,0,1,0,8328,0.0,17,363,1,2,1,6000
1,79428,LP001448,1,1,0,0,0,150,3857.458782,188,370,1,1,0,6000
2,70497,LP002231,0,0,0,0,0,4989,314.472511,17,348,1,0,0,6000
3,87480,LP001385,1,1,0,0,0,150,0.0,232,359,1,1,1,3750
4,33964,LP002231,1,1,1,0,0,8059,0.0,17,372,1,0,1,3750


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5898 entries, 0 to 5897
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ID                 5898 non-null   int64  
 1   Loan_ID            5898 non-null   object 
 2   Gender             5898 non-null   int64  
 3   Married            5898 non-null   int64  
 4   Dependents         5898 non-null   object 
 5   Education          5898 non-null   int64  
 6   Self_Employed      5898 non-null   int64  
 7   ApplicantIncome    5898 non-null   int64  
 8   CoapplicantIncome  5898 non-null   float64
 9   LoanAmount         5898 non-null   int64  
 10  Loan_Amount_Term   5898 non-null   int64  
 11  Credit_History     5898 non-null   int64  
 12  Property_Area      5898 non-null   int64  
 13  Loan_Status        5898 non-null   int64  
 14  Total_Income       5898 non-null   int64  
dtypes: float64(1), int64(12), object(2)
memory usage: 691.3+ KB


In [8]:
group = df.groupby(['Gender', 'Married'])['ID'].count()
print(group)

Gender  Married
0       0            69
        1           457
1       0           789
        1          4583
Name: ID, dtype: int64


In [10]:
df.keys()

Index(['ID', 'Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status',
       'Total_Income'],
      dtype='object')

In [14]:
categorical_columns = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area']
numerical_columns = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Total_Income']

In [44]:
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numerical_columns),
    ('cat', OneHotEncoder(), categorical_columns)
])

In [46]:
X = df.drop(['Loan_Status', 'Loan_ID','ID'], axis=1)
y = df['Loan_Status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [48]:
# Random Forest Classifier Model
Rnd_For_class = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier())
])

# Logistic Regression Model
log_Reg = Pipeline([
    ('preprocessor', preprocessor),
    ('model', LogisticRegression())
])

In [50]:
# Fitting RandomForestClassifier
Rnd_For_class.fit(X_train,y_train)

In [24]:
# Fitting LogisticRegression
log_Reg.fit(X_train,y_train)

In [26]:
# Model Score score and prediction 
print(f"\nTrain score from Random Forest Classifier Model: {Rnd_For_class.score(X_train,y_train)}")
print(f"Test score from Random Forest Classifier Model: {Rnd_For_class.score(X_test,y_test)}")
Rnd_For_class_pred = Rnd_For_class.predict(X_test)
Rnd_For_class_pred


Train score from Random Forest Classifier Model: 0.9980924120389996
Test score from Random Forest Classifier Model: 0.8169491525423729


array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

In [28]:
# Model Score and prediction
print(f"\nTrain score from LogisticRegression Model: {log_Reg.score(X_train,y_train)}")
print(f"Test score from LogisticRegression Model: {log_Reg.score(X_test,y_test)}")
log_Reg_pred = log_Reg.predict(X_test)
log_Reg_pred


Train score from LogisticRegression Model: 0.8308605341246291
Test score from LogisticRegression Model: 0.8415254237288136


array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

In [32]:
from sklearn.metrics import classification_report, roc_auc_score

# Evaluation metrics
print("Random Forest Classifier:\n", classification_report(y_test, Rnd_For_class_pred))
print("Logistic Regression:\n", classification_report(y_test, log_Reg_pred))


print("\nROC AUC Random Forest Classifier: ", roc_auc_score(y_test, Rnd_For_class.predict_proba(X_test)[:, 1]))
print("ROC AUC Logistic Regression: ", roc_auc_score(y_test, log_Reg.predict_proba(X_test)[:, 1]))

Random Forest Classifier:
               precision    recall  f1-score   support

           0       0.16      0.04      0.06       187
           1       0.84      0.96      0.90       993

    accuracy                           0.82      1180
   macro avg       0.50      0.50      0.48      1180
weighted avg       0.73      0.82      0.77      1180

Logistic Regression:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00       187
           1       0.84      1.00      0.91       993

    accuracy                           0.84      1180
   macro avg       0.42      0.50      0.46      1180
weighted avg       0.71      0.84      0.77      1180


ROC AUC Random Forest Classifier:  0.5083041181317349
ROC AUC Logistic Regression:  0.5396438168785779


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


|**Accuracy:** |**Precision:** |**Recall (Sensitivity):**|**F1-score:**|
|--------------|---------------|---------------------|------------------|
|It measures how often the model's predictions are correct. It's the ratio of the number of correct predictions to the total number of predictions. It's a useful metric when the class distribution is balanced |This evaluates how many of the predicted positives are actually positive. It's the ratio of true positives (correctly predicted positive cases) to the sum of true positives and false positives. In the case of these data. High precision indicates a low false positive rate. | This measures how well the model identifies positive cases. It's the ratio of true positives to the sum of true positives and false negatives. High recall values of the fitted models means the model is good at capturing the actual positives. |The harmonic mean of precision and recall. It balances the two metrics, especially useful when the class distribution is uneven or when you need to find a balance between precision and recall |

## Evaluation metrics' Interpretation
### 1. Random Forest Classifier
- **Precision:** 16% of the loans that the model predicted as denied were actually denied. This suggests the model frequently falsely identifies loans as denied when they are not. On the other hand, 84% of the loans that the model predicted as approved were actually approved. This indicates relatively few false approvals

- **Recall:** The model correctly identifies only 4% of all actual loan denials. This indicates it often misses loans that should be denied. Whereas, the model correctly identifies 96% of all actual loan approvals. This shows the model is good at capturing most of the approved loans.

- **F1-score:** The low F1-score indicates poor performance in balancing precision and recall for loan denials. The high F1-score (90%) reflects strong performance in balancing precision and recall for loan approvals.

- **Support:** There are 187 instances in the dataset where loans were actually denied and there are 993 instances in the dataset where loans were actually approved.
  
- **Accuracy (0.82)**: The model correctly predicts the loan status 82% of the time. However, due to the imbalance in loan status, this high accuracy is largely driven by the model's performance on loan approvals.
#### Summary
- Random Forest Classifier performs well in predicting loan approvals but struggles with accurately predicting loan denials. This might be because the majority class (loan approvals) dominates the model's predictions

### 2. Logistic Regression
 - **Precision:** 0% indicates that the model did not correctly predict any loans as denied. A precision of 0 means that whenever the model predicted "Loan Denied," it was always incorrect. 84% of the loans that the model predicted as approved were actually approved. This suggests relatively few false approvals.
   
- **Recall:** The model did not identify any of the actual loan denials. A recall of 0 indicates that all loans that should have been denied were missed. The model correctly identifies all actual loan approvals. A recall of 1 indicates perfect detection of approved loans

- **F1-score:** The F1-score is 0 because both precision and recall are 0, indicating a complete failure to correctly identify loan denials. Whereas the high F1-score reflects (91%) a strong performance in balancing precision and recall for loan approvals.
  
- **Support:** There are 187 instances in the dataset where loans were actually denied and there are 993 instances in the dataset where loans were actually approved.

- **Accuracy (0.82)**: The model correctly predicts the loan status 84% of the time. However, this high accuracy is misleading because it is driven entirely by the model’s performance on loan approvals.

#### Summary
- Logistic Regression model performs well in predicting loan approvals but completely fails at predicting loan denials. This results might be misleading high accuracy due to the imbalanced dataset.

In [34]:
# Creating new features
import copy
df_plus_featurs = copy.copy(df)
df_plus_featurs['Income_Per_Person'] = df['Total_Income'] / (df['Dependents'].replace('3+', 3).astype(int) + 1)
df_plus_featurs['EDI'] = df['LoanAmount'] / df['Loan_Amount_Term'] #Equated Daily Installment (EMI)
df_plus_featurs.head()

Unnamed: 0,ID,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,Total_Income,Income_Per_Person,EDI
0,74768,LP002231,1,1,0,1,0,8328,0.0,17,363,1,2,1,6000,6000.0,0.046832
1,79428,LP001448,1,1,0,0,0,150,3857.458782,188,370,1,1,0,6000,6000.0,0.508108
2,70497,LP002231,0,0,0,0,0,4989,314.472511,17,348,1,0,0,6000,6000.0,0.048851
3,87480,LP001385,1,1,0,0,0,150,0.0,232,359,1,1,1,3750,3750.0,0.64624
4,33964,LP002231,1,1,1,0,0,8059,0.0,17,372,1,0,1,3750,1875.0,0.045699
