<a href="https://colab.research.google.com/github/shivam0988/Bliss_Browser_ActionServerPagesDotNET/blob/Bliss_Browser_ActionServerPagesDOTNET_Main-dev/Loan_Fraud_Detection_final_17_may_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team
##### **Team Member 1 -** Shivain Chhibber 2210992303
##### **Team Member 2 -** Shivam Chaudhary 2210992307
##### **Team Member 3 -** Shreshth Verma 2210992339
##### **Team Member 4 -** Shresth Kumar 2210992340

# **Project Summary -**

The project aims to develop a machine learning model for loan fraud detection using historical loan application data. The dataset consists of information such as applicant demographics, financial history, loan details, and loan status (approved or denied). Leveraging this dataset, the project seeks to build a predictive model capable of distinguishing between legitimate and fraudulent loan applications.

Key Steps:

Data Collection: Gather historical loan application data from the financial institution's database or other reliable sources.

Data Preprocessing: Clean the dataset by handling missing values, outliers, and performing feature engineering if necessary. Convert categorical variables into numerical format using techniques like one-hot encoding or label encoding.

Exploratory Data Analysis (EDA): Conduct exploratory data analysis to gain insights into the distribution of variables, identify patterns, and detect correlations between features and loan status.

Feature Selection: Select relevant features that are highly predictive of loan fraud while eliminating irrelevant or redundant features.

Model Building: Employ various machine learning algorithms such as logistic regression, random forest, or support vector machines to build the predictive model. Train the model using the training dataset and evaluate its performance using appropriate metrics such as accuracy, precision, recall, and F1-score.

Model Evaluation: Assess the model's performance using techniques like cross-validation and hyperparameter tuning to optimize its predictive ability.

Deployment: Deploy the trained model into a production environment where it can be integrated into the financial institution's loan application system for real-time fraud detection.

Monitoring and Maintenance: Continuously monitor the model's performance and retrain it periodically to adapt to evolving trends and emerging fraud patterns.

Expected Outcome:
The developed machine learning model will enable the financial institution to automate the process of loan fraud detection, thereby enhancing efficiency, reducing manual effort, and minimizing financial losses associated with fraudulent loan applications. By effectively identifying and rejecting fraudulent loan applications, the institution can safeguard its assets and maintain trust with legitimate customers.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

Loan fraud is a significant concern for financial institutions as it can result in substantial financial losses. Detecting fraudulent loan applications is crucial to mitigate risks and maintain the integrity of the lending process. The objective of this project is to develop a robust machine learning model that can effectively identify fraudulent loan applications based on various applicant attributes and historical data.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('/content/Merged-file.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()

duplicate_count

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
null_values = df.isnull().sum().reset_index()
null_values

In [None]:
# Visualizing the missing values
plt.figure(figsize = (20,7))
sns.heatmap(df.isnull(), cbar=False)

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(df.columns)

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = df.nunique().reset_index()
unique_values

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df.isnull().sum().reset_index()

In [None]:
# Fill missing LoanAmount values with the mean of the column
df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)

# Group by 'Property_Area' and calculate the mean of 'LoanAmount'
average_loan_amount_per_area = df.groupby('Property_Area')['LoanAmount'].mean()
print("\nAverage Loan Amount for each Property Area:\n")
print(average_loan_amount_per_area)


In [None]:
# Define a threshold for high income
high_income_threshold = 5000

# Filter DataFrame to include only applicants with ApplicantIncome greater than the threshold
high_income_applicants = df[df['ApplicantIncome'] > high_income_threshold]
print("\nApplicants with high income:\n")
print(high_income_applicants)


In [None]:
# Filter DataFrame for self-employed applicants
self_employed_applicants = df[df['Self_Employed'] == 'Yes']

# Calculate the average income for self-employed applicants
average_self_employed_income = self_employed_applicants['ApplicantIncome'].mean()
print("\nAverage income of self-employed applicants:\n")
print(average_self_employed_income)



In [None]:
# Filter for loans that were approved
approved_loans = df[df['Loan_Status'] == 'Y']

# Group by 'Education' and count the number of approved loans
approved_loans_per_education = approved_loans.groupby('Education')['Loan_ID'].count()
print("\nNumber of loans approved for each education level:\n")
print(approved_loans_per_education)


### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Plot 1: Histogram of Applicant Income
plt.figure(figsize=(8,6))
plt.hist(df['ApplicantIncome'], bins=10, color='skyblue', edgecolor='black')
plt.title('Distribution of Applicant Income')
plt.xlabel('Applicant Income')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Plot 2: Scatterplot of Applicant Income vs Loan Amount
plt.figure(figsize=(8,6))
plt.scatter(df['ApplicantIncome'], df['LoanAmount'], color='green')
plt.title('Applicant Income vs Loan Amount')
plt.xlabel('Applicant Income')
plt.ylabel('Loan Amount')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Plot 3: Pie chart of Loan Status
plt.figure(figsize=(8,6))
loan_status_counts = df['Loan_Status'].value_counts()
plt.pie(loan_status_counts, labels=loan_status_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Loan Status')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Plot 4: Pair plot of numerical features
plt.figure(figsize=(4,2))
sns.pairplot(df[['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']])
plt.suptitle('Pair Plot of Numerical Features', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Plot 5: Heatmap of correlations
# plt.figure(figsize=(10,8))
correlation_matrix = df[['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Plot 6: Bar plot of Loan Status by Education
plt.figure(figsize=(8,6))
loan_status_by_education = df.groupby('Education')['Loan_Status'].value_counts().unstack()
loan_status_by_education.plot(kind='bar', stacked=True, color=['orange', 'skyblue'])
plt.title('Loan Status by Education')
plt.xlabel('Education')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

# Plot 7: Box plot of Applicant Income by Property Area
plt.figure(figsize=(8,6))
sns.boxplot(x='Property_Area', y='ApplicantIncome', data=df, palette='Set2')
plt.title('Applicant Income by Property Area')
plt.xlabel('Property Area')
plt.ylabel('Applicant Income')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

# Plot 8: Line graph of Applicant Income
plt.figure(figsize=(8,6))
plt.plot(df['ApplicantIncome'], marker='o', linestyle='-', color='purple')
plt.title('Applicant Income over Index')
plt.xlabel('Index')
plt.ylabel('Applicant Income')
plt.grid(True)
plt.show()

#  Define a list of colors
# colors = ['purple', 'blue', 'green', 'orange', 'red']

# plt.figure(figsize=(8,6))
# for i, income in enumerate(df['ApplicantIncome']):
#     plt.plot(i, income, marker='o', linestyle='-', color=colors[i % len(colors)], markersize=8, linewidth=2)

# # Connecting the points with lines
# plt.plot(df['ApplicantIncome'], linestyle='-', color='grey', alpha=0.5)

# plt.title('Applicant Income over Index')
# plt.xlabel('Index')
# plt.ylabel('Applicant Income')
# plt.grid(True)
# plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Plot 9: Histogram of Loan Amount
plt.figure(figsize=(8,6))
plt.hist(df['LoanAmount'], bins=10, color='coral', edgecolor='black')
plt.title('Distribution of Loan Amount')
plt.xlabel('Loan Amount')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

# Plot 10: Scatterplot of Coapplicant Income vs Loan Amount
plt.figure(figsize=(8,6))
plt.scatter(df['CoapplicantIncome'], df['LoanAmount'], color='blue')
plt.title('Coapplicant Income vs Loan Amount')
plt.xlabel('Coapplicant Income')
plt.ylabel('Loan Amount')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(10,6))
plt.fill_between(df.index, df['ApplicantIncome'], color='skyblue', alpha=0.5, label='Applicant Income')
plt.fill_between(df.index, df['CoapplicantIncome'], color='orange', alpha=0.5, label='Coapplicant Income')
plt.title('Area Graph of Applicant and Coapplicant Income')
plt.xlabel('Index')
plt.ylabel('Income')
plt.legend()
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code
plt.figure(figsize=(8,6))
property_area_counts = df['Property_Area'].value_counts()
plt.pie(property_area_counts, labels=property_area_counts.index, autopct='%1.1f%%', startangle=140, colors=['lightblue', 'lightgreen', 'lightcoral'])
plt.title('Distribution of Property Area')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code
plt.figure(figsize=(10,6))
sns.kdeplot(df['LoanAmount'], shade=True, color='r')
plt.title('Density Plot of Loan Amount')
plt.xlabel('Loan Amount')
plt.ylabel('Density')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(12,8))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis', yticklabels=False)
plt.title('Heatmap of Missing Values')
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Convert categorical columns to numeric for better pair plotting
df_encoded = df.copy()
df_encoded['Married'] = df_encoded['Married'].map({'Yes': 1, 'No': 0})
df_encoded['Education'] = df_encoded['Education'].map({'Graduate': 1, 'Not Graduate': 0})
df_encoded['Self_Employed'] = df_encoded['Self_Employed'].map({'Yes': 1, 'No': 0})
df_encoded['Property_Area'] = df_encoded['Property_Area'].map({'Urban': 1, 'Semiurban': 2, 'Rural': 0})
df_encoded['Loan_Status'] = df_encoded['Loan_Status'].map({'Y': 1, 'N': 0})

sns.pairplot(df_encoded, hue='Loan_Status', vars=['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term'])
plt.suptitle('Pair Plot with Loan Status Hue', y=1.02)
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
null_values = df.isnull().sum().reset_index()
null_values

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments


# Define numerical columns
numerical_cols = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']

# Visualize outliers using box plots
plt.figure(figsize=(12, 8))
df[numerical_cols].boxplot(vert=False)
plt.title('Box Plot of Numerical Columns')
plt.xlabel('Value')
plt.show()

# Function to remove outliers using IQR method
def remove_outliers_iqr(df):
    Q1 = df[numerical_cols].quantile(0.25)
    Q3 = df[numerical_cols].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df_no_outliers = df.copy()
    for col in numerical_cols:
        df_no_outliers = df_no_outliers[(df_no_outliers[col] >= lower_bound[col]) & (df_no_outliers[col] <= upper_bound[col])]
    return df_no_outliers

# Remove outliers
df_no_outliers = remove_outliers_iqr(df)

# Visualize box plot after outlier removal
plt.figure(figsize=(12, 8))
df_no_outliers[numerical_cols].boxplot(vert=False)
plt.title('Box Plot of Numerical Columns (Outliers Removed)')
plt.xlabel('Value')
plt.show()



##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
from sklearn.preprocessing import LabelEncoder
# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Define categorical columns to be encoded
categorical_cols = ['Loan_Status']

# Perform Label Encoding for each categorical column
for col in categorical_cols:
    df[col] = label_encoder.fit_transform(df[col])

# Print the encoded DataFrame
print(df)


#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

# Creating an extended DataFrame to ensure a minimum number of samples per class
data_extended = {
    'Loan_ID': ['LP001002', 'LP001003', 'LP001005', 'LP001006', 'LP001008', 'LP001009', 'LP001010', 'LP001011', 'LP001012'],
    'Gender': ['Male', 'Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Male', 'Female'],
    'Married': ['No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'No', 'Yes'],
    'Dependents': [0, 1, 0, 0, 0, 0, 0, 1, 0],
    'Education': ['Graduate', 'Graduate', 'Graduate', 'Not Graduate', 'Graduate', 'Graduate', 'Not Graduate', 'Graduate', 'Graduate'],
    'Self_Employed': ['No', 'No', 'Yes', 'No', 'No', 'No', 'No', 'Yes', 'No'],
    'ApplicantIncome': [5849, 4583, 3000, 2583, 6000, 4500, 3200, 2567, 3000],
    'CoapplicantIncome': [0.0, 1508.0, 0.0, 2358.0, 0.0, 0.0, 0.0, 1000.0, 2000.0],
    'LoanAmount': [np.nan, 128.0, 66.0, 120.0, 141.0, 200.0, 110.0, 95.0, 120.0],
    'Loan_Amount_Term': [360.0, 360.0, 360.0, 360.0, 360.0, 360.0, 360.0, 360.0, 360.0],
    'Credit_History': [1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0],
    'Property_Area': ['Urban', 'Rural', 'Urban', 'Urban', 'Urban', 'Rural', 'Semiurban', 'Urban', 'Rural'],
    'Loan_Status': ['Y', 'N', 'Y', 'Y', 'Y', 'N', 'Y', 'N', 'Y']
}

df = pd.DataFrame(data_extended)

# Handle missing values by imputing with mean
df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)

# Encode categorical variables
le = LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender'])
df['Married'] = le.fit_transform(df['Married'])
df['Education'] = le.fit_transform(df['Education'])
df['Self_Employed'] = le.fit_transform(df['Self_Employed'])
df['Property_Area'] = le.fit_transform(df['Property_Area'])
df['Loan_Status'] = le.fit_transform(df['Loan_Status'])

# Define features and target variable
X = df.drop(columns=['Loan_ID', 'Loan_Status'])
y = df['Loan_Status']

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the data into training and testing sets with stratification
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y)

# Create and train the Logistic Regression model
logistic_model = LogisticRegression(max_iter=1000)
logistic_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = logistic_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy with Logistic Regression:", accuracy)

# Print classification report with zero_division parameter
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred, zero_division=1))





#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

In [None]:

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, classification_report
from imblearn.over_sampling import SMOTE



# Encode categorical variables
le = LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender'])
df['Married'] = le.fit_transform(df['Married'])
df['Education'] = le.fit_transform(df['Education'])
df['Self_Employed'] = le.fit_transform(df['Self_Employed'])
df['Property_Area'] = le.fit_transform(df['Property_Area'])
df['Loan_Status'] = le.fit_transform(df['Loan_Status'])

# Define features and target variable
X = df.drop(columns=['Loan_ID', 'Loan_Status'])
y = df['Loan_Status']

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the data into training and testing sets with stratification
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y)

# Apply SMOTE to balance the classes with a reduced number of neighbors
smote = SMOTE(random_state=42, k_neighbors=1)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Create and train the Decision Tree model
decision_tree_model = DecisionTreeClassifier(random_state=42)
decision_tree_model.fit(X_train_smote, y_train_smote)

# Make predictions on the test set
y_pred = decision_tree_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy with Decision Tree:", accuracy)

# Print classification report
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report



# Encode categorical variables
le = LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender'])
df['Married'] = le.fit_transform(df['Married'])
df['Education'] = le.fit_transform(df['Education'])
df['Self_Employed'] = le.fit_transform(df['Self_Employed'])
df['Property_Area'] = le.fit_transform(df['Property_Area'])
df['Loan_Status'] = le.fit_transform(df['Loan_Status'])

# Define features and target variable
X = df.drop(columns=['Loan_ID', 'Loan_Status'])
y = df['Loan_Status']

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the data into training and testing sets with stratification
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y)

# Create and train the SVM model
svm_model = SVC(kernel='linear', random_state=42)
svm_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = svm_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("SVM Model Accuracy:", accuracy)

# Print classification report with zero_division parameter
print("\nSVM Model Classification Report:\n")
print(classification_report(y_test, y_pred, zero_division=1))


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***