# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    -Team
##### **Team Member 1 -**  2210992457
##### **Team Member 2 -**  2210992467
##### **Team Member 3 -**  2210992272
##### **Team Member 4 -**  2210992499

# **Project Summary -**

The project aims to develop a robust predictive model to assess the risk of loan default for borrowers by leveraging various features, including credit score, income, employment history, and more. This end-to-end project involves crucial steps such as data preprocessing, feature engineering, and training a binary classification model using advanced algorithms like logistic regression, decision trees, or gradient boosting.

**Introduction:**
In the modern financial landscape, effective risk assessment is essential for lending institutions to make informed decisions about loan approvals and minimize the risk of default. By leveraging machine learning techniques, this project seeks to automate and enhance the loan approval process, providing a more accurate and efficient means of evaluating borrower risk.

**Data Preprocessing:**
The first step involves thorough data preprocessing to ensure the quality and reliability of the dataset. This includes handling missing values, addressing outliers, and encoding categorical variables. Robust preprocessing is crucial to build a reliable and accurate predictive model.

**Feature Engineering:**
Feature engineering plays a pivotal role in extracting valuable information from the available dataset. In this project, features like credit score, income, employment history, and other relevant attributes will be carefully crafted to enhance the predictive power of the model. This may involve creating new features, transforming existing ones, and selecting the most influential variables for the model.

**Exploratory Data Analysis (EDA):**
Before diving into model development, it's essential to gain insights into the data through exploratory data analysis. This involves creating visualizations to understand the distribution of key features, identifying patterns, and uncovering potential correlations between variables. EDA informs subsequent decisions in feature selection and model building.

**Model Selection:**
For binary classification tasks like predicting loan default risk, several algorithms can be considered. Logistic regression, decision trees, and gradient boosting are popular choices due to their interpretability and effectiveness. The selection of the most suitable algorithm depends on the dataset's characteristics and the desired balance between interpretability and predictive power.

**Model Training and Evaluation:**
The chosen algorithm will be trained on a subset of the dataset, and its performance will be evaluated using relevant metrics such as accuracy, precision, recall, and the area under the receiver operating characteristic curve (AUC-ROC). The model will undergo iterative refinement to enhance its predictive accuracy.

**Hyperparameter Tuning:**
To optimize model performance, hyperparameter tuning will be conducted. This involves fine-tuning the parameters of the chosen algorithm to achieve the best possible predictive outcomes. Techniques such as grid search or randomized search will be employed to explore the hyperparameter space.

**Model Validation:**
The model's robustness and generalizability will be assessed through rigorous validation techniques, such as k-fold cross-validation. This ensures that the model's performance is consistent across different subsets of the data and guards against overfitting.

**Interpretability and Explainability:**
Understanding the decisions made by the model is crucial for building trust in its predictions. Interpretability and explainability techniques will be applied to make the model's output more transparent, providing stakeholders with insights into how different features contribute to the risk assessment.

**Deployment and Monitoring:**
Upon achieving satisfactory performance, the predictive model will be deployed for real-world use. Continuous monitoring will be implemented to assess the model's ongoing accuracy and effectiveness, allowing for adjustments if the underlying data distribution or patterns change over time.

**Conclusion:**
This project encapsulates the end-to-end process of developing a predictive model for loan default risk assessment. From data preprocessing and feature engineering to model training and deployment, each step contributes to building a robust and reliable tool for lending institutions. The ultimate goal is to empower these institutions with a predictive model that enhances decision-making, streamlines loan approval processes, and minimizes the risk of default in their portfolios.

# **GitHub Link -**

https://github.com/Vansh6224/AIML_project

# **Problem Statement**


Develop a model to predict the risk of loan default for borrowers based on features such as credit score, income, employment history, etc. This project involves data preprocessing, feature engineering, and training a binary classification model using algorithms like logistic regression, decision trees, or gradient boosting.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from plotly.subplots import make_subplots


### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv("/content/Loan_default.csv")
df.sample(10)

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
du=df.duplicated().value_counts()
du

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
m=df.isnull().sum()
m

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=True, cmap='viridis')
plt.title("Missing Values Heatmap")
plt.show()

### What did you know about your dataset?

My dataset is about ->'LoanID', 'Age', 'Income', 'LoanAmount', 'CreditScore',
       'MonthsEmployed', 'NumCreditLines', 'InterestRate', 'LoanTerm',
       'DTIRatio', 'Education', 'EmploymentType', 'MaritalStatus',
       'HasMortgage', 'HasDependents', 'LoanPurpose', 'HasCoSigner',
       'Default'.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in df.columns:
    unique_values = df[column].unique()
    print(f"Unique values for {column}:\n{unique_values}\n")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
df['LoanToIncomeRatio'] = df['LoanAmount'] / df['Income']
print("Updated dataset with new column:")
print(df.head())

### What all manipulations have you done and insights you found?

Loading the Dataset:

The initial step involves loading the dataset into a Pandas DataFrame to facilitate analysis.
No manipulations are performed at this stage.

Checking for Missing Values:

Identify and quantify missing values in each column.
Impute missing values using statistical measures (mean, median) or drop rows/columns with missing values based on the nature and extent of missingness.

Checking for Duplicate Rows:

Identify and quantify duplicate rows in the dataset.
Drop duplicate rows to avoid redundancy and ensure uniqueness.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 age distribution
sns.histplot(df['Age'], bins=20, kde=True)
plt.title('Age Distribution')
plt.show()


##### 1. Why did you pick the specific chart?

The chart allows you to observe the concentration of individuals in different age groups. For example, you may see whether the dataset is skewed towards younger or older individuals, or if there is a relatively even distribution across age ranges.

##### 2. What is/are the insight(s) found from the chart?

The "Age Distribution" histogram is a powerful tool for understanding the age composition of your dataset and gaining insights into the age-related characteristics of the individuals represented in the data. It is a common exploratory data analysis (EDA) technique to uncover patterns and trends.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the "Age Distribution" chart can potentially have a positive business impact, but whether they lead to negative growth depends on the specific context of your business and the industry. Here are some scenarios to consider:

# **Positive Business Impact:**
**Targeted Marketing:** If there is a significant concentration of individuals in a specific age group, it allows for more targeted marketing efforts. Understanding the age distribution helps tailor products, services, and marketing messages to better appeal to the predominant demographic.

**Product Development**: Knowledge of the age distribution can guide product development. For instance, if there is a sizable population in a certain age range, the company might develop products that cater specifically to the needs and preferences of that age group.

# **Potential Negative Impact:**
**Limited Demographic Reach:** If the age distribution is heavily skewed towards a narrow age range, the business might be overly reliant on a specific demographic. This could be a risk if market conditions change, or if the business wants to expand its reach to a broader audience.

**Failure to Adapt:** If the business fails to adapt its strategies and offerings based on the age distribution, it may miss out on opportunities. Ignoring the preferences and needs of certain age groups can result in declining customer satisfaction and market share.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
sns.histplot(df['Income'], bins=20, kde=True)
plt.title('Income Distribution')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram for the "Income Distribution" because it provides a visual representation of how income levels are distributed in the dataset. A histogram is suitable for displaying the distribution of a continuous variable (in this case, income) and helps identify patterns, outliers, and the overall shape of the distribution.

##### 2. What is/are the insight(s) found from the chart?

By examining the "Income Distribution" chart, you can gain insights into the spread of income levels among the individuals in the dataset. Key points to observe include:

**Central Tendency:** Look for the central tendency or the most common income range.
**Outliers:** Identify any extreme income values that might be considered outliers.
**Skewness:** Assess if the distribution is skewed to one side.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights from the "Income Distribution" chart can have positive business implications:

**Targeted Marketing:** Businesses can use income insights to tailor their marketing strategies for different income brackets.

**Risk Assessment:** If the dataset is related to loans or financial services, understanding income distribution helps in assessing the risk associated with different income levels.

One potential negative impact could be if there is a significant concentration of individuals with very low income levels. In the context of a business, especially in the financial sector, a high proportion of individuals with low incomes might indicate a higher risk of defaults on loans, leading to negative growth or financial losses. This emphasizes the importance of understanding the income distribution for effective risk management and decision-making.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
sns.histplot(df['LoanAmount'], bins=20, kde=True)
plt.title('LoanAmount Distribution')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram for the "LoanAmount Distribution" because it provides a visual representation of how loan amounts are distributed in the dataset. A histogram is suitable for displaying the distribution of a continuous variable (in this case, loan amounts) and helps identify patterns, ranges, and potential outliers.

##### 2. What is/are the insight(s) found from the chart?

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
sns.histplot(df['CreditScore'], bins=20, kde=True)
plt.title('CreditScore Distribution')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram for the "CreditScore Distribution" because it provides a visual representation of how credit scores are distributed in the dataset. A histogram is suitable for displaying the distribution of a continuous variable (in this case, credit scores) and helps identify patterns, outliers, and the overall shape of the distribution.

##### 2. What is/are the insight(s) found from the chart?

**Central Tendency:** Look for the central tendency or the most common credit score range.

**Outliers:** Identify any extreme credit score values that might be considered 6 outliers.
**Skewness:** Assess if the distribution is skewed to one side.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights from the "CreditScore Distribution" chart can have positive business implications:

**Risk Assessment:** Businesses, especially in finance, can use credit score insights to assess the risk associated with different credit score levels. Higher credit scores might indicate lower risk, influencing decisions on loan approvals and interest rates.

**Customer Segmentation**: Understanding the distribution helps in segmenting customers based on creditworthiness, allowing for targeted marketing strategies and personalized services.

One potential negative impact could be if there is a concentration of individuals with very low credit scores. In the context of a business, especially in the financial sector, a high proportion of individuals with low credit scores might indicate a higher risk of defaults on loans. This could lead to negative growth if not managed properly, as it might result in financial losses and increased default rates.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
age_credit_stats=df.groupby('Age')['CreditScore'].agg(['mean'])
plt.figure(figsize=(12, 6))
sns.barplot(x='Age', y='mean', data=age_credit_stats)
plt.title('Mean CreditScore by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Mean CreditScore')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

The specific chart chosen for displaying the mean credit score by age group is a bar plot. The choice of a bar plot is based on several considerations:

**Comparison Across Categories:** Bar plots are well-suited for comparing the values of a variable across different categories. In this case, you can easily compare the mean credit scores for each age group by looking at the heights of the bars.

**Readability:** Bar plots are easy to read and interpret. Each bar represents a category, and the height of the bar corresponds to the value of the variable being measured (mean credit score).

**Mean as a Summary Statistic:** The mean (average) is a commonly used summary statistic that provides a measure of central tendency. It gives a sense of the typical credit score within each age group.

##### 2. What is/are the insight(s) found from the chart?

The insight(s) obtained from the chart displaying the mean credit score by age group depend on the specific patterns or trends observed in the data. Here are some potential insights that can be derived:

**Age-Related Trends:** Look for any discernible trends or patterns related to age. For instance, is there a noticeable increase or decrease in mean credit scores as age increases or decreases?

**Targeted Strategies:** Identify age groups with lower mean credit scores that may require targeted financial education or credit-building programs. Conversely, recognize age groups with higher mean credit scores that might be potential candidates for premium financial services.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights from the mean credit score by age group can potentially help create a positive business impact, but it depends on the nature of the insights and the specific goals of the business. Here's an analysis:

# **Positive Business Impact:**
**Targeted Marketing Strategies:**

**Insight:** If certain age groups show higher mean credit scores, the business can tailor marketing strategies to attract and serve customers in those age brackets.

**Positive Impact:** Targeted marketing can lead to increased engagement and conversion rates, potentially contributing to positive business growth.
Risk Assessment and Customized Products:

# **Potential Negative Growth:**
**Concentration of Risk:**

**Insight:** If a business heavily relies on age groups with lower mean credit scores, there may be a concentration of risk, especially if those customers are more likely to default on loans.

**Negative Impact:** A concentration of risk can lead to financial losses and negative growth if default rates increase, impacting the business's overall financial health.

#### Chart - 6

In [None]:
# Calculate mean credit score for each education level
education_credit_stats = df.groupby('Education')['CreditScore'].agg(['mean'])
plt.figure(figsize=(12, 6))
sns.barplot(x='Education', y='mean', data=education_credit_stats)
plt.title('Mean CreditScore by Education Level')
plt.xlabel('Education Level')
plt.ylabel('Mean CreditScore')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar plot for displaying the mean credit score by education level for several reasons:

**Categorical Data:** The 'Education' column represents categorical data, and bar plots are well-suited for visualizing the distribution of a numerical variable (mean credit score) across different categories.

**Comparison Across Categories:** Bar plots allow for easy comparison of mean credit scores across different education levels. Each bar represents a specific education category, making it straightforward to identify variations.



##### 2. What is/are the insight(s) found from the chart?

The insights gained from the chart displaying the mean credit score by education level depend on the patterns or trends observed in the data. Here are potential insights that can be derived:

# **Educational Impact on Credit Scores:**

**Insight:** Examining the mean credit scores across different education levels allows for understanding whether there is a correlation between education and creditworthiness.

**Interpretation:** Higher mean credit scores in certain education levels may suggest that individuals with higher education tend to have better credit scores on average.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact:** If insights reveal that individuals with higher education levels have higher mean credit scores, the business can tailor marketing strategies and financial products to cater to this segment.

**Negative Impact:** Making decisions solely based on education-level mean credit scores may lead to stereotyping or discrimination.

#### Chart - 7

In [None]:
sns.countplot(x='Education', data=df)
plt.title('Education Level Distribution')
plt.show()


##### 1. Why did you pick the specific chart?

The count plot is suitable for visualizing the distribution of categorical data, such as the education levels in this case. It helps in understanding the frequency or count of each category, providing a clear overview of the distribution.

##### 2. What is/are the insight(s) found from the chart?

The count plot reveals the distribution of education levels in the dataset. You can observe which education level has the highest count and compare the frequencies of different levels. For example, you might find that a significant portion of the dataset has a particular education level.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights could be valuable for various business decisions. For instance, if a particular education level dominates the dataset, it may influence marketing strategies, product development, or customer targeting. Understanding the education distribution can help tailor business strategies to better suit the prevalent demographic.

**Potential Negative Growth:**

Negative growth might occur if there is an imbalanced distribution, and the majority of the data falls into a category that is not favorable for the business goals. For instance, if the majority of customers have a lower education level, and the business aims to target a more educated demographic, it could be a challenge. The insights gained from the chart could signal areas where the business needs to focus efforts to attract the desired customer base.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
education_mortgage_counts = df.groupby(['Education', 'HasMortgage']).size().unstack()
plt.figure(figsize=(16, 6))
education_mortgage_counts.plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart: Education vs HasMortgage')
plt.xlabel('Education Level')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

The stacked bar chart is chosen because it effectively displays the distribution of the 'HasMortgage' variable within each category of the 'Education' variable. It allows for a clear comparison of the proportion of individuals with and without mortgages across different education levels.

##### 2. What is/are the insight(s) found from the chart?

The chart provides insights into how the presence of a mortgage is distributed among different education levels. It helps to identify patterns and trends, such as whether certain education levels have a higher likelihood of having a mortgage or if there are notable differences in mortgage distribution across education categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can influence business strategies. For example, if there's a correlation between higher education levels and a higher likelihood of having a mortgage, it might impact marketing efforts or product offerings. Understanding these relationships can help businesses tailor their approach to specific customer segments.

**Potential Negative Growth:**

Negative growth could occur if the business heavily relies on a particular education level for its target market, and this group has a lower likelihood of having a mortgage. It would be important to assess whether this aligns with the business goals and, if not, consider adjustments in strategies to attract the desired customer base.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
marital_loan_counts = df.groupby(['MaritalStatus', 'LoanPurpose']).size().unstack()
plt.figure(figsize=(12, 6))
marital_loan_counts.plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart: MaritalStatus vs LoanPurpose')
plt.xlabel('Marital Status')
plt.ylabel('Count')
plt.show()


##### 1. Why did you pick the specific chart?

The stacked bar chart is chosen because it effectively displays the distribution of loan purposes within each category of marital status. This type of chart helps to compare the composition of different loan purposes for each marital status category.

##### 2. What is/are the insight(s) found from the chart?

The chart provides insights into how the distribution of loan purposes varies across different marital status categories. You can observe the proportion of each loan purpose (stacked) within each marital status group. This could reveal patterns or trends in the types of loans preferred by individuals based on their marital status.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can have implications for business decisions. For example, if there are specific loan purposes that are more common for certain marital statuses, it could influence marketing strategies, product development, or customer targeting. Understanding these relationships can help tailor business approaches to better suit the preferences of different marital status groups.

**Potential Negative Growth:**

Negative growth might occur if the distribution of loan purposes is not aligned with the business objectives. For instance, if a significant market segment has a lower likelihood of taking certain types of loans, and those loans are a key part of the business strategy, it could impact growth. The business might need to reassess its loan offerings or marketing strategies based on these insights.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
education_employment_credit = df.groupby(['Education', 'EmploymentType'])['CreditScore'].mean().unstack()
plt.figure(figsize=(12, 6))
education_employment_credit.plot(kind='bar')
plt.title('Grouped Bar Chart: Education vs EmploymentType (Mean CreditScore)')
plt.xlabel('Education Level')
plt.ylabel('Mean CreditScore')
plt.show()


##### 1. Why did you pick the specific chart?

The grouped bar chart is selected as it effectively displays the mean credit scores for different education levels and employment types. This type of chart allows for a clear comparison of mean credit scores across multiple categories.

##### 2. What is/are the insight(s) found from the chart?

The chart provides insights into how mean credit scores vary among different education levels and employment types. You can observe and compare the average credit scores for each combination of education level and employment type, identifying any patterns or trends.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can be valuable for business decisions. For example, if certain education levels or employment types are associated with higher or lower mean credit scores, this information can inform credit risk assessment, financial product offerings, or customer targeting strategies. Understanding these relationships can help tailor business approaches to better suit the credit profile of different customer segments.

**Potential Negative Growth:**

Negative growth could occur if the mean credit scores reveal patterns that are not aligned with the business objectives. For instance, if a significant market segment with lower mean credit scores is targeted, and the business relies heavily on creditworthiness, it could impact the business's financial performance. The business might need to consider adjusting its credit policies or developing targeted strategies for improving credit scores within specific segments.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Calculate the count of individuals for each loan purpose
loan_purpose_counts = df['LoanPurpose'].value_counts()
plt.figure(figsize=(8, 8))
plt.pie(loan_purpose_counts, labels=loan_purpose_counts.index, autopct='%1.1f%%')
plt.title('Pie Chart: Distribution of Individuals by Loan Purpose')
plt.show()


##### 1. Why did you pick the specific chart?

The pie chart is chosen to represent the proportion of individuals for each loan purpose. It provides a clear visual representation of the distribution of individuals across different categories, making it easy to observe the relative sizes of each segment.

##### 2. What is/are the insight(s) found from the chart?

The pie chart provides a quick overview of the distribution of individuals based on loan purposes. You can easily identify the most prevalent loan purposes and see the relative contribution of each category to the overall distribution.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can have implications for business decisions. For example, if there's a dominant loan purpose, it could influence marketing strategies, product development, or resource allocation. Understanding the distribution of loan purposes helps in tailoring business approaches to better meet the needs of the majority of customers.

**Potential Negative Growth:**
Negative growth might occur if the business heavily relies on a particular loan purpose that has a declining share in the market. If the distribution reveals a shrinking segment for a key loan purpose, the business might need to adapt its strategies to diversify its offerings or focus on other growing segments.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
education_income_mean = df.groupby('Education')['Income'].mean().reset_index()
plt.figure(figsize=(12, 6))
sns.barplot(x='Education', y='Income', data=education_income_mean)
plt.title('Bar Chart: Average Income by Education Level')
plt.xlabel('Education Level')
plt.ylabel('Average Income')
plt.show()

##### 1. Why did you pick the specific chart?

The bar chart is a suitable choice for comparing the average income across different education levels. It allows for a clear comparison of quantitative values (average income) for each category (education level).

##### 2. What is/are the insight(s) found from the chart?

The chart provides insights into how average income levels vary among different education levels. You can observe and compare the average income for each education level, identifying any trends or disparities. This information can be valuable for understanding the relationship between education and income in your dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can influence business decisions, especially in areas such as product pricing, targeted marketing, or customer segmentation. Understanding the correlation between education and income helps in tailoring business strategies to different customer segments based on their income levels.

**Potential Negative Growth:**

Negative growth might occur if there is a significant disparity in average income levels across education categories, and the business heavily relies on customers from a specific education level with lower incomes. In such a case, the business might need to adapt its strategies to address the financial preferences and constraints of its target audience.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
sns.countplot(x='Default', data=df)
plt.title('Default vs Non-Default')
plt.show()

##### 1. Why did you pick the specific chart?

The count plot is a suitable choice for comparing the frequency or count of different categories in a categorical variable. It allows for a straightforward representation of the number of instances for each category.

##### 2. What is/are the insight(s) found from the chart?

The chart provides insights into the distribution of defaults and non-defaults in your dataset. You can observe and compare the count of instances for each category, giving you a quick overview of the balance or imbalance between default and non-default cases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can have significant implications for business decisions, particularly in areas related to risk assessment, financial planning, and strategies for managing default risk. Understanding the distribution of defaults is crucial for businesses in industries such as banking, lending, or finance, where managing and mitigating default risks is a key concern.

**Potential Negative Growth:**

Negative growth might occur if the dataset has a high proportion of defaults, and the business relies on a customer base with a lower likelihood of defaulting. In such cases, the business might need to reassess its risk management strategies, adjust lending criteria, or implement measures to reduce the occurrence of defaults.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
numerical_attributes = ['Age', 'Income', 'LoanAmount', 'CreditScore', 'DTIRatio']
correlation_df = df[numerical_attributes]
correlation_matrix = correlation_df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Heatmap of Numerical Attributes')
plt.show()

##### 1. Why did you pick the specific chart?

The heatmap is a suitable choice for visualizing the correlation matrix, especially when dealing with numerical attributes. It provides an easy-to-read color-coded representation of the strength and direction of the relationships between pairs of variables.

##### 2. What is/are the insight(s) found from the chart?

The heatmap provides insights into the correlation between different numerical attributes. The color intensity and the annotation values help in identifying which pairs of variables are positively or negatively correlated, and the strength of those correlations. For example, a high positive correlation between Income and LoanAmount may indicate that people with higher incomes tend to take larger loans.

#### Chart - 15 - Pair Plot

In [None]:
simple_numerical_attributes = ['Age', 'Income', 'LoanAmount']

# Creating a pair plot
sns.pairplot(df[simple_numerical_attributes])
plt.suptitle('Simple Pairplot of Numerical Attributes', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

The pair plot with histograms is suitable for understanding the univariate distributions of each numerical attribute and, at the same time, providing a glimpse of potential bivariate relationships. It is particularly useful for exploring the spread and shape of the distributions.

##### 2. What is/are the insight(s) found from the chart?

The pair plot with histograms allows you to observe the individual distributions of each numerical attribute on the diagonal and the bivariate relationships between attributes in the off-diagonal histograms. You can identify patterns and understand how the values are distributed for each variable. For instance, you can assess the distribution of CreditScore, Income, etc.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***