<a href="https://colab.research.google.com/github/tanvi-dataenthusiast/EDA-PROJECT-/blob/main/EDA_PROJECT_PAISABAZAAR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **PROJECT NAME- PAISABAZAAR BANKING FRAUD ANALYSIS**

**Project Type - EDA**

**Contribution - Individual**

**Name         -** **Tanvi Rastogi**


# **Project Summary -**

# **BUSINESS PROBLEM OVERVIEW**

Paisabazaar, a leading financial services company, plays a key role in helping customers choose and apply for credit cards, loans, and other banking products. One of the most critical parts of their operations is assessing a customer’s creditworthiness. This is mainly done using the credit score, which serves as an important indicator of how likely a customer is to repay borrowed funds.

A wrong assessment of credit scores could either increase loan default risks or prevent eligible customers from accessing the right products. Hence, building a robust and accurate system for credit score classification becomes vital for Paisabazaar.

**WHAT ARE THE CHALLENGES FACED BY PAISABAZAAR?**




*   How to check the accurate creditworthiness
*   How to advise banking products to customers on the basis of what variables
*   How to mitigate the risk of loan defaults
*   How to strengthen lending decisions for banks and NBFCs
*   How to deliver personalized financial advice and better product   recommendations
*   How to calculate a perfect credit score
*   What all attributes to be taken into consideration for deciding creditworthiness

**NOW THE QUESTION IS HOW TO OVERCOME THESE CHALLENGES?**


*   Accumulating data of the client for analysis
*   Analysing all types of attributes related to financial habits of client
*   Comparing variables with each other and gaining insights through it
*   Ensuring data driven steps to conclude the task
*   Applying statistical/behavioral methods to calculate and manage risk profile
*   Ranging the outcomes into categories like low to high risk
*   Visualizing the outcomes together and converting them to a predictive model
*   Lastly using this predictive models to cater their clients in an efficient manner










# **GitHub Link -**

# **Problem Statement** **-**

"Can we accurately predict an individual’s credit score using their income, spending, and repayment behavior — to help Paisabazaar reduce loan risks and deliver smarter, personalized financial solutions?”

## #### **Define Your Business Objective?**

The primary business objective of this project is to develop a reliable credit score classification system that enables Paisabazaar to:



1.   **Analyze customer financial and behavioral data** (income, loans, repayment history, outstanding debt, credit utilization, etc.).


2.   **Identify key factors that influence credit scores** through exploratory data analysis and visualizations.


1.   **Build a predictive classification model** that accurately categorizes customers into credit score ranges (e.g., Good, Standard, Poor).

1.   **Minimize loan default risks** by enabling Paisabazaar to assess customer creditworthiness more precisely.
2.   **Support better lending decisions** for banks and financial institutions by providing reliable credit insights.


2.   **Enhance customer experience** by offering personalized financial advice and product recommendations based on predicted credit scores.

# **General Guidelines : -**

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import plotly.express as px

### Dataset Loading

In [None]:
#METHOD 1-
#Loading dataset from local PC
import pandas as pd
from google.colab import files

# Upload file
uploaded = files.upload()

# Load into DataFrame (replace with your filename)
df = pd.read_csv("dataset_paisabazaar.csv")

# Check the first rows
df.head()


In [None]:
#METHOD 2-
#LOADING DATASET FROM GOOGLE DRIVE
from google.colab import drive
drive.mount('/content/drive')


In [None]:
#explore your Google Drive directory to find where your dataset is stored.
import os
os.listdir('/content/drive/MyDrive')         # top-level view
!ls -lh "/content/drive/MyDrive/datasets"    # shell listing (use quotes if spaces)


In [None]:
#Load Dataset into Pandas

path = '/content/drive/MyDrive/datasets/dataset_paisabazaar.csv'
print(os.path.exists(path))   # True if path is correct
df = pd.read_csv(path)



**Dataset First View**

In [None]:
df.head() # RETURNS FIRST 5 ROWS OF DATAFRAME BY DEFAULT

In [None]:
# Display all columns in the output (disable column truncation in DataFrame display)
pd.set_option('display.max_columns', None) #None means all columns

In [None]:
# FULL COLUMNS ARE VISIBLE
df.head()

### Dataset Rows & Columns count

In [None]:
# EDA
df.shape #DEFINING HOW MANY ROWS AND COLUMNS ARE THERE. RETURNS TUPLE (ROWS,COLUMNS)



### Dataset Information

In [None]:
df.info() # GIVES YOU SUMMARY OF DATAFRAME USED

**Datatype info**

In [None]:
# GIVES DATA TYPES
df.dtypes

**Statistical information from dataset**

In [None]:
# summarizes numeric, categorical and datetime columns by calculating count, min,max std etc.it gives quick statistical view of every column
df.describe(include= 'all')

**Finding Null values present or not**

In [None]:
df.isnull()

**Missing values/Null values count**

In [None]:
df.isnull().sum() # creates a DataFrame of booleans where each cell is True if the value is missing (NaN, None, NaT) and False otherwise. .sum() sums up true=1 and false= 0

**DUPLICATE VALUES**

In [None]:
# DF.DUPLICATED() Returns a boolean Series (one value per row).True if the row is a duplicate of a previous row.
#False if it’s the first occurrence.

df.duplicated().sum()
#the above code checks for and displays any duplicate rows in the dataset DataFrame.
#The output shows there are no duplicate rows in the dataset.

# What did you know about your dataset?


The given datasetis from Paisabazaar Bank fraud, and we have to analyze and develope a prediction model for credit scores based on customer data which can improve decision-making for recommending better personalized financial products and for mitigating risk so as to contribute to better customer satisfaction.

The above dataset has 1,00,000 rows and 28 columns. There are no mising values and no duplicate values in the dataset.

The dataset consists of 100,000 customer records with 28 columns detailing various financial and personal attributes. Key features include:

* Demographic Information: Age, Occupation, SSN, Name
* Financial Attributes: Annual Income, Monthly In-hand Salary, Number of Bank Accounts, Credit Cards, Loans, Interest Rate
* Credit Information: Credit Utilization Ratio, Outstanding Debt, Credit History Age, Credit Mix
* Behavioral Metrics: Payment Behavior, Delayed Payments, Credit Inquiries, Payment of Minimum Amount
* Target Variable: Credit Score (Good, Standard, scores.

# **2. Understanding Your Variables**

**Showing columns**

In [None]:
# IT SHOWS THE NAME OF ALL COLUMNS IN THE DATASET. WITH THIS WE CAN GET THE IDEA OF WHAT ALL INFO WE HAVE FOR ANALYSIS
df.columns

# Dataset Describe

In [None]:
df.describe()

**Variable Description**

1) ID: unique identifier for each row

2) Customer_ID: Unique ID for each customer

3) Month: Time/Month of record

4) Name: Full name of customer

5) Age: Age of the customer in years

6) SSN: Social security number

7) Occupation: Profession of customer

8) Annual_Income: Total yearly income

9) Monthly_Inhand_Salary: Net Salary received per month

10) Num_Bank_Accounts: Number of active bank accounts customer holds

11) Num_Credit_Card: Total number of credit cards customer holds

12) Interest rate: avg Interest rate on running loans

13) Num_of_loan: Total number of loans taken by the customer

14) Type_of_loan: Types of loans taken by the customer

15) Delay_from_due_date: Average number of days the customer delayed their EMI payments.

16) Num_of_Delayed_Payment: Count of delayed EMI payments.

17) changed_credit_limit: How much the customer's credit limit has changed (in ₹ or %).

18) Num_credit_inquiries: Number of credit inquiries made recently.

19) Credit_Mix: Categorical rating of the customer's credit profile (e.g., Good, Bad, Standard).

20) Outstanding_Debt: Total outstanding loan or credit card balance (in ₹).

21) Credit_Utilization_Ratio: Percentage of available credit being used (important risk metric).

22) Credit_History_Age: Total duration of credit history in months.

23) Payment_of_Min_Amount: Whether the customer is paying just the minimum amount due (Yes/No)-like revolving facility in credit card payment.

24) Total_EMI_per_month: Total EMI paid across all loans per month (in ₹).

25) Amount_invested_monthly: Monthly amount invested (in SIPs, mutual funds, PPFs, stocks etc) by the customer (in ₹).

26) Payment_Behaviour: Categorized behavior pattern (e.g. spending high value amount, low value amount).

27) Monthly_Balance: Remaining balance in customer account at present/on particular date (in ₹).

28) Credit_Score: Target variable: Score of customer calculated as per his credit behaviour (Good, Standard, Poor).


**Check Unique Values for each variable.**

In [None]:
df.nunique()

## 3. ***Data Wrangling***

(Also called as Data Cleaning or Data Preprocessing) is the process of transforming raw, messy data into a clean and usable format before analysis.
It allows us to handle missing data, remove duplicates,correction of datatypes, dealing with outliers, calculating new columns

### Data Wrangling Code

In [None]:
df.info() #get information about data

In [None]:
# data wrangling
## Creating a copy of the current dataset and assigning to new_df
new_df=df.copy()

In [None]:
#dropping the  unnecessary columns as these are unique and cannot be related to other variables for analysis
drop_columns = ['ID', 'Customer_ID', 'Name', 'SSN']
new_df.drop(columns = drop_columns)

In [None]:
#converting data types
new_df['Age'] = new_df['Age'].astype('int64')
new_df['Num_Bank_Accounts'] = new_df['Num_Bank_Accounts'].astype('int64')
new_df['Num_Credit_Card'] = new_df['Num_Credit_Card'].astype('int64')
new_df['Num_of_Loan'] = new_df['Num_of_Loan'].astype('int64')
new_df['Num_of_Delayed_Payment'] = new_df['Num_of_Delayed_Payment'].astype('int64')
new_df['Credit_History_Age'] = new_df['Credit_History_Age'].astype('int64')
new_df['Delay_from_due_date'] = new_df['Delay_from_due_date'].astype('int64')
new_df['Num_Credit_Inquiries'] = new_df['Num_Credit_Inquiries'].astype('int64')
new_df['Payment_of_Min_Amount'] = new_df['Payment_of_Min_Amount'].astype('category')
new_df['Occupation'] = new_df['Occupation'].astype('category')
new_df['Credit_Mix'] = new_df['Credit_Mix'].astype('category')
new_df['Payment_Behaviour'] = new_df['Payment_Behaviour'].astype('category')
new_df['Credit_Score'] = new_df['Credit_Score'].astype('category')

In [None]:
# Rounding the numbers in float cols to 2.
new_df['Annual_Income'] = new_df['Annual_Income'].round(2)
new_df['Monthly_Inhand_Salary'] = new_df['Monthly_Inhand_Salary'].round(2)
new_df['Interest_Rate'] = new_df['Interest_Rate'].round(2)
new_df['Total_EMI_per_month'] = new_df['Total_EMI_per_month'].round(2)
new_df['Outstanding_Debt'] = new_df['Outstanding_Debt'].round(2)
new_df['Changed_Credit_Limit'] = new_df['Changed_Credit_Limit'].round(2)
new_df['Amount_invested_monthly'] = new_df['Amount_invested_monthly'].round(2)
new_df['Monthly_Balance'] = new_df['Monthly_Balance'].round(2)
new_df['Credit_Utilization_Ratio'] = new_df['Credit_Utilization_Ratio'].round(2)

In [None]:
#check whether changes implemented or not
new_df.head()

In [None]:
# To check unique info in Column "Payment_of_Min_Amount"
new_df['Payment_of_Min_Amount'].unique()

**Replacing "NM" to "No"**

As the value "NM" does not support any analysis, its kind of missing value, so we can replace it with value "NO". It will help us in creating better visualization



In [None]:
#Firstly check how many counts are there for each value

new_df['Payment_of_Min_Amount'].value_counts()

In [None]:
# Secondly Replacing "NM" to "No" in 'Payment_of_Min_Amount' column


new_df['Payment_of_Min_Amount']= new_df['Payment_of_Min_Amount'].replace('NM','No')
new_df['Payment_of_Min_Amount'].unique()


In [None]:
new_df['Payment_of_Min_Amount'].value_counts()

In [None]:
#Checking values in column "Credit_Mix"
df['Credit_Mix'].value_counts()

In [None]:
#Checking the values in column "Credit_Score"
df['Credit_Score'].value_counts()


As we can see in column "Credit_Mix", values are [Standard,Good,Bad] while in column "Credit_Score" values are [Standard,Good,Poor]. To make it even, we are changing 'Bad' to 'Poor' in column "Credit_Mix"

In [None]:
#Changing 'Bad' to 'Poor' in column "Credit_Mix"
new_df['Credit_Mix']= new_df['Credit_Mix'].replace('Bad','Poor')
new_df['Credit_Mix'].value_counts()

### What all manipulations have you done and insights you found?

1. Rounded the contunious numerical columns to 2 decimals.

2. Dropped unnecessary columns like 'ID', 'Customer_ID', 'Name' and 'SSN'.

3. Changing dtypes of columns. So that it will give accurate results while analysing.

4. When checking the unique values in data we found that Payment_of_Min_Amount has 3 unique values. Changed another input 'NM' to 'No'.Missing or wrong values create jerks in analysis.

5. In Credit mix column there are Good, Standard and Bad classification. We will change Bad category to Poor so that we can analyse it if there's any association of credit mix with credit score.Standardization is required for better visualization

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

## **CHART 1- DISTRIBUTION OF DIFFERENT OCCUPATIONS**-**BAR CHART(Univariate Analysis)**

It looks at one variable at a time to understand its distribution, patterns, and basic characteristics.


In [None]:
# Chart 1 visualization code
# Calculate the count of each occupation
occupation_counts = new_df['Occupation'].value_counts().reset_index()
occupation_counts.columns = ['Occupation', 'Number of Individuals']


category_count = px.bar(occupation_counts, y='Occupation', x='Number of Individuals',
    title="Distribution of Occupation")

category_count.update_layout( height=600,width=900,title_font=dict(size=20, color='darkblue'),
    font=dict(size=14,color='darkblue'),
    bargap=0.2)

category_count.show()

 **1. Why did you pick the specific chart?**

  Ans:- A bar chart is ideal for visualizing categorical variables like "Occupation." It effectively shows the frequency or count of individuals in each occupation, making it easy to compare participation across different professions.

**2. What is/are the insight(s) found from the chart?**
  
  Ans:- Lawyers have the highest loan/banking product applicants comparatively, with over 7000 individuals.
  Engineers and Architects also show strong participation, suggesting that technical professions are actively engaged in financial services.
  While Writers and Musicians have comparatively fewer applicants, the difference is not drastic, indicating that all listed occupations have a fair level of financial product adoption.

3.**Will** **the** **gained insights help creating** **positive** **business** **impact?**
**Are there any insights that lead to negative growth? Justify** **with specific** **reason.**


   Ans:- Yes, the insights can create a positive business impact by helping financial institutions tailor their products occupation-wise,marketing or risk assessment strategies based on occupation.

   For example:

   High applicant volumes from Lawyers, Engineers, and Architects suggest stable professional groups that may be targeted for premium products or pre-approved loans.

   Although Writers and Musicians have lower representation, the gap is not significant. This implies a diverse customer base across professions, enabling inclusive financial strategies.

   No clear signs of negative growth were observed, as all professions show meaningful levels of engagement.

## **CHART 2- DISTRIBUTION OF CREDIT SCORE TYPES-PIE CHART(Univariate Analysis)**


In [None]:


# Count how many individuals fall into each Credit_Score category
credit_counts = new_df['Credit_Score'].value_counts().reset_index()
credit_counts.columns = ['Credit_Score_type', 'Count']

# Create a pie chart
fig_creditscore_pie = px.pie(
    credit_counts,
    names='Credit_Score_type',
    values='Count',
    title="Distribution of Credit Scores",
    hole=0.4  # makes it a donut chart, remove if you want full pie
)

# Customize layout
fig_creditscore_pie.update_traces(textinfo='percent+label')  # show % and labels
fig_creditscore_pie.update_layout(
    title_font=dict(size=20, color='darkblue'),
    font=dict(size=14, color='darkblue'),
    height=600,
    width=800
)

# Show chart
fig_creditscore_pie.show()


**1. Why did you pick the specific chart?**

   Ans:- Pie charts are effective for visualizing proportions in categorical data. Since the goal was to analyze the distribution of credit scores across the dataset, the pie chart was ideal to illustrate how much each credit score category (Standard, Poor, Good) contributes to the whole.

**2. What is/are the insight(s) found from the chart?**

  Ans:-

  Standard = 53.17% :- Indicates that the majority fall within an average creditworthiness range.

  Poor = 29% :- This suggests a significant risk of loan getting default.

  Good = 17.83% :- This means creditworthy individuals are even less than 20% which is not a good sign.

  The dataset is skewed towards the "Standard" credit score class.This imbalance can lead to biased predictions, where the model may favor the majority class and underperform in identifying "Poor" and "Good" credit scores.

**3.Will the gained insights help creating a positive business impact?**
**Are there any insights that lead to negative growth? Justify with specific** **reason.**

  Ans:- Yes, these insights can inform better targeting strategies — offering customized financial products to different segments. However, the imbalance in the dataset and high proportion of "Poor" scores raise red flags. After analysing this, institute can make changes in their lending policy or changes in selection policy for lending as we can see only 17.83% individuals show sign of creditworthiness.If not addressed, it can negatively affect credit risk modeling, leading to poor decision-making and financial losses.

## **CHART 3- DISTRIBUTION OF CREDIT MIX TYPES-PIE CHART(Univariate Analysis)**

In [None]:
# Count how many individuals fall into each Credit_Mix category
credit_mix_counts = new_df['Credit_Mix'].value_counts().reset_index()
credit_mix_counts.columns = ['Credit_Mix_type', 'Count']

# Create a bar chart
fig_creditmix_bar = px.bar(
    credit_mix_counts,
    x='Credit_Mix_type',
    y='Count',
    text='Count',
    title="Distribution of Individuals by Credit Mix Type",
    color='Credit_Mix_type',
    height=600,
    width=800
)

# Customize layout
fig_creditmix_bar.update_traces(textposition='outside')  # show counts above bars
fig_creditmix_bar.update_layout(
    title_font=dict(size=20, color='darkblue'),
    font=dict(size=14, color='darkblue'),
    xaxis_title="Credit Mix Type",
    yaxis_title="Number of Individuals",
    showlegend=False
)

# Show chart
fig_creditmix_bar.show()


**1. Why did you pick the specific chart?**

   I chose bar chart as this is more effective than a pie chart when comparing categorical data across distinct groups. Since the goal was to examine how individuals are distributed across different Credit Mix types (e.g., Standard, Good, Poor), the bar chart provides a clearer, side-by-side comparison of the counts, making it easy to identify which category dominates and by how much.

**2. What is/are the insight(s) found from the chart?**

Standard Credit Mix dominates the dataset, meaning most individuals fall into this mid-level credit mix range.

Good Credit Mix is relatively smaller, showing fewer individuals maintain a consistently good balance of credit accounts.

Poor Credit Mix exists in non-negligible proportion, indicating a risky group with poor handling of credit.

The imbalance across categories (with "Standard" being far larger) suggests potential bias in predictive modeling, similar to the credit score imbalance.

**3.Will the gained insights help creating a positive business impact?**
**Are there any insights that lead to negative growth? Justify with specific** **reason.**

Positive Impact:

Paisabazaar can design risk-adjusted loan products by identifying individuals with "Poor" Credit Mix and applying stricter eligibility or higher interest rates.
Individuals with "Good" Credit Mix can be targeted for premium loan offers or higher credit limits, boosting customer satisfaction and business growth.

Potential Negative Growth:

The dataset’s heavy tilt toward the "Standard" Credit Mix reduces granularity in identifying truly "Good" or "Bad" customers.
If not handled (e.g., with balancing techniques in modeling), predictive models may underperform in capturing risky (Bad) or highly reliable (Good) credit customers, leading to higher default risks or missed business opportunities.

## **CHART 4-DISTRIBUTION OF ANNUAL INCOME-HISTOGRAM(Univariate Analysis)**

In [None]:
# Histogram to show distribution of Annual Income
fig_income = px.histogram(
    new_df,
    x="Annual_Income",
    nbins=50,
    title="Distribution of Annual Income",
    color_discrete_sequence=['skyblue']
)

# Customize layout
fig_income.update_layout(
    height=600, width=900,
    title_font=dict(size=20, color='darkblue'),
    font=dict(size=14, color='darkblue')
)

fig_income.show()


**1. Why did you pick the specific chart?**

  I chose a histogram because income is a numerical variable, and histograms are the most effective way to visualize the spread of continuous values.
  px.histogram() → shows the distribution (spread) of Annual Income across all individuals.
  nbins=50 → breaks income range into 50 bins for better granularity.
  This helps us see income ranges where most customers lie.

**2. What is/are the insight(s) found from the chart?**
Most customers in the dataset fall in the lower-to-mid income range , with fewer people in very high-income categories. This suggests the dataset is slightly skewed towards middle-class earners.
25% of indiviuals earn below 20k.
Most people fall in the lower-to-mid income range (₹20K–₹70K).
A small percentage earn very high incomes (>₹1.1L), which pulls the mean (₹50K) higher than the median (₹37K).

**3.Will the gained insights help creating a positive business impact?**
**Are there any insights that lead to negative growth? Justify with specific** **reason.**

Positive: Knowing that most customers are in a moderate income bracket, businesses can design affordable financial products (e.g., lower EMI loans, mid-tier credit cards).
Negative: The lack of high-income individuals means premium products may not perform well, leading to wasted marketing spend.

### **CHART 5- DISTRIBUTION OF LOAN APPLICANTS AGEWISE-HISTOGRAM(Univariate Analysis)**

In [None]:
# Histogram: distribution of loan applicants by Age
fig_age_loan = px.histogram(
    new_df,
    x="Age",
    nbins=20,  # divide age into 20 bins
    title="Age-wise Distribution of Loan Applicants",
    color_discrete_sequence=["seagreen"]
)

# Customize layout
fig_age_loan.update_layout(
    height=600, width=900,
    title_font=dict(size=20, color="darkblue"),
    font=dict(size=14, color="darkblue"),
    bargap=0.2
)

fig_age_loan.show()

**1. Why did you pick the specific chart?**

I chose a histogram because Age is a continuous numerical variable. A histogram makes it easy to see which age groups dominate loan applications.Taken bin size=20 to split age into 20 bins for a clear view.

**2. What is/are the insight(s) found from the chart?**
The dataset shows a very small number of applicants below age 20.
The bulk of applicants start appearing from age 20 onwards, with the highest concentration between 30 and 50.
After around 55 years, the number of applicants declines again.
Very few applicants are below age 20, likely because younger individuals have not yet reached financial independence. Loan demand rises sharply from the mid-20s, peaks between 30-50 (prime earning years), and then decreases after 55.”

**3.Will the gained insights help creating a positive business impact?**
**Are there any insights that lead to negative growth? Justify with specific** **reason.**

Positive: Businesses should target 30–50 year olds for loan products (personal loans, home loans, etc.) as they are the most active borrowers.
Negative: Younger customers (under 20) and older customers (above 55) show weak loan demand, meaning marketing spend on these groups may lead to negative ROI. Instead, they can be engaged with different products (e.g., credit cards for youth, retirement planning for seniors).

### **CHART 6- RELATION BETWEEN OCCUPATION Vs CREDITSCORE-HISTOGRM(Bivariate Analysis)**

Bivariate analysis is the statistical study of the relationship between two variables.
"Bi" → two
"Variate" → variables
It helps you understand how one variable changes with respect to another.

In [None]:
# Grouped Bar chart to compare credit scores across occupations

fig_occ_credit = px.histogram(
    new_df,
    x="Occupation",
    color="Credit_Score",
    barmode="group",
    title="Credit Score Distribution Across Occupations",
    color_discrete_map={
        "Poor": "red",
        "Good": "lightgreen",
        "Standard": "skyblue"
    }
)

# Layout customization
fig_occ_credit.update_layout(
    height=600,
    width=1000,
    title_font=dict(size=20, color='darkblue'),
    font=dict(size=14, color='darkblue'),
    xaxis_tickangle=-45
)

fig_occ_credit.show()



**1. Why did you pick the specific chart?**

 I chose a grouped bar chart because it allows me to compare two categorical variables: Occupation and Credit Score. This type of chart makes it easy to see how different job types are distributed across Good, Standard, and Poor credit categories.

**2. What is/are the insight(s) found from the chart?**
Across most occupations, the majority of individuals fall in the "Standard" credit score group. For example:

Accountants,Engineers,Scientists: More than 50% falls in Standard score,very less percentage falls under Good.
Some occupations like Scientists and Engineers have a higher share of Good credit scores, while certain other jobs have relatively more Poor scores. This shows that occupation is a strong factor in credit health.On average, only ~18% of individuals in these professions have a "Good" score, while ~30% are in the "Poor" segment.

This shows that occupation does influence credit health, but the difference between professions is not extreme — all professions display a similar pattern: half Standard, one-third Poor, less than one-fifth Good.

**3.Will the gained insights help creating a positive business impact?**
**Are there any insights that lead to negative growth? Justify with specific** **reason.**

Positive: Businesses can prioritize safer professions when giving credit (e.g., scientists, engineers), reducing default risk.

Negative: Being biased against some occupations may limit business reach. The company should balance risk with inclusivity.

**CHART 7- INCOME VS PAYMENT BEHAVIOUR PATTERN-BOXPLOT(Bivariate Analysis)**

In [None]:
# Boxplot to show income distribution across payment behaviors
fig_payment_income = px.box(
    new_df,
    x="Payment_Behaviour",
    y="Annual_Income",
    color="Payment_Behaviour",
    title="Relationship between Payment Behaviour and Annual Income"
)

# Customize layout
fig_payment_income.update_layout(
    height=600, width=950,
    title_font=dict(size=20, color="darkblue"),
    font=dict(size=14, color="darkblue"),
    xaxis_title="Payment Behaviour",
    yaxis_title="Annual Income (₹)"
)

fig_payment_income.show()


**1. Why did you pick the specific chart?**

I chose a boxplot because it clearly shows how income is distributed across different payment behaviors. This is more effective than bar charts since income is a continuous variable with wide variation.

**2. What is/are the insight(s) found from the chart?**

*High_spent_Large_value_payments-*
Median income (middle line inside the box): is quite high (~₹65K+).
Box (IQR range): Most individuals in this group earn between ~₹40K and ₹100K.
Whiskers: Some go as high as ~₹170K (very high income earners).

Interpretation: These individuals are high earners who also spend big, making them both high-potential customers and possibly risky if spending exceeds repayment.

*Low_spent_Small_value_payments-*
Median income: ~₹25K (much lower than others).
Box: Majority earn between ₹15K and ₹40K.
Whiskers: A few outliers exist with high income (>₹120K), but very rare.

Interpretation: These are low earners with conservative spending, meaning they may be safer for small credit products but not ideal for premium ones.

*Comparing Groups-*

High spenders (large/medium payments): tend to have higher median incomes and wider variation.
Low spenders (small payments): cluster around lower income levels, with fewer high earners.
This shows a clear relationship between spending/payment behaviour and income.


From the box plot, we see that income levels are closely tied to spending behavior. Customers making "high-spent large-value payments" usually earn much more (median ~₹65K+), whereas those with "low-spent small-value payments" cluster around lower income levels (median ~₹25K). However, each group has outliers, meaning income alone does not define payment behavior completely.

This insight suggests businesses can use both income and payment patterns together to design credit offerings:

Offer premium, high-limit products to high-spent, high-income groups.

Offer basic, low-limit credit to low-income, low-spent groups, reducing default risk.”

**3.Will the gained insights help creating a positive business impact?**
**Are there any insights that lead to negative growth? Justify with specific** **reason.**

This insight suggests businesses can use both income and payment patterns together to design credit offerings:
Offer premium, high-limit products to high-spent, high-income groups.
Offer basic, low-limit credit to low-income, low-spent groups, reducing default risk.”

Positive: Businesses can identify which income segments have responsible payment habits, helping design safer loan/credit card strategies.
Negative: If high-income individuals also show poor payment behavior, targeting them just based on income may increase default risk. This highlights the need for behavioral scoring in addition to income checks.

**CHART 8- NUMBER OF DELAYED PAYMENTS Vs CREDIT SCORE-BOXPLOT(Bivariate Analysis)**

In [None]:
# Boxplot: Delayed Payments vs Credit Score
fig_delay_credit = px.box(
    new_df,
    x="Credit_Score",
    y="Num_of_Delayed_Payment",
    color="Credit_Score",
    title="Number of Delayed Payments vs Credit Score",
    color_discrete_map={"Poor": "red", "Standard": "blue", "Good": "green"}
)

# Layout customization
fig_delay_credit.update_layout(
    height=600, width=900,
    title_font=dict(size=20, color="darkblue"),
    font=dict(size=14, color="darkblue"),
    xaxis_title="Credit Score Category",
    yaxis_title="Number of Delayed Payments"
)

fig_delay_credit.show()


**1. Why did you pick the specific chart?**

I chose a boxplot because it compares a numeric variable (number of delayed payments) with a categorical variable (credit score). Since repayment behavior is one of the strongest indicators of credit health, this visualization clearly shows the distribution of late payments across different credit score categories.

**2. What is/are the insight(s) found from the chart?**

Individuals with Poor credit scores have the highest median number of delayed payments, often with a very wide spread (chronic late payers).
The Good credit score group shows very few delayed payments — most fall near zero.
Standard scorers are in between, but still lean closer to Poor than Good.
This confirms that repayment punctuality is directly tied to credit score health.

**3.Will the gained insights help creating a positive business impact?**
**Are there any insights that lead to negative growth? Justify with specific** **reason.**

Positive:

Direct business impact: Paisabazaar can use delayed payments as a strong feature for predictive modeling.
Individuals with repeated late payments can be flagged early and either denied loans or charged higher risk premiums.

Negative:

If the model is too strict, occasional late payers (but otherwise good customers) may be penalized unfairly. This could reduce customer satisfaction.
Hence, the insight must be used carefully alongside other features (like income, debt, and loan count).

**CHART 9- OUTSTANDING DEBT Vs CREDIT SCORE-BOXPLOT(Bivariate Analysis)**

In [None]:
# Boxplot of Outstanding Debt by Credit Score
fig_debt_credit = px.box(
    new_df,
    x="Credit_Score",
    y="Outstanding_Debt",
    color="Credit_Score",
    title="Outstanding Debt vs Credit Score",
    color_discrete_map={"Poor": "red", "Standard": "blue", "Good": "green"}
)

# Customize layout
fig_debt_credit.update_layout(
    height=600, width=900,
    title_font=dict(size=20, color="darkblue"),
    font=dict(size=14, color="darkblue"),
    xaxis_title="Credit Score Category",
    yaxis_title="Outstanding Debt (₹)"
)

fig_debt_credit.show()


**1. Why did you pick the specific chart?**

I chose a boxplot because we are comparing a numerical variable (Outstanding Debt) against a categorical variable (Credit Score). Boxplots help clearly visualize median debt levels and variability across groups.

**2. What is/are the insight(s) found from the chart?**
Median outstanding debt increases as credit score decreases
Individuals with a Poor Credit Score generally have higher outstanding debt and wider variation, meaning many owe significant amounts.
Those with a Good Credit Score tend to have lower median debt.There are more outliers in the Good credit score group, suggesting that while most Good scorers have low to moderate debt, a small segment has unusually high debts.
The Standard Credit Score group sits in between, as expected.

**3.Will the gained insights help creating a positive business impact?**
**Are there any insights that lead to negative growth? Justify with specific** **reason.**

Positive: This confirms that high outstanding debt is strongly linked to poor credit scores. Businesses can use this insight to flag high-debt customers as risky.
Negative: If businesses reject all high-debt customers, they may lose potential clients who can still repay (e.g., high-income individuals temporarily holding debt). Hence, debt must be considered alongside income and payment history for fair lending.Also there are a good number of outliers in Good score as well, it should be resolved and analysed within time limits otherwise this can lead to wrong credit decisions and more defaults

**CHART 10 & 11- CORELATION BETWEEN NO_OF_LOANS Vs CREDIT SCORE AND CUR Vs CREDIT SCORE -BOXPLOT(Bivariate Analysis)**

In [None]:
# Set figure size for side-by-side plots
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# --- Chart 1: Credit Utilization Ratio vs Credit Score ---
sns.boxplot(
    data=new_df,
    x="Credit_Score",
    y="Credit_Utilization_Ratio",
    palette={"Poor": "red", "Standard": "blue", "Good": "green"},
    ax=axes[0]
)
axes[0].set_title("Credit Utilization Ratio vs Credit Score", fontsize=14, color="darkblue")
axes[0].set_xlabel("Credit Score Category")
axes[0].set_ylabel("Credit Utilization Ratio (%)")

# --- Chart 2: Total Number of Loans vs Credit Score ---
sns.boxplot(
    data=new_df,
    x="Credit_Score",
    y="Num_of_Loan",
    palette={"Poor": "red", "Standard": "blue", "Good": "green"},
    ax=axes[1]
)
axes[1].set_title("Number of Loans vs Credit Score", fontsize=14, color="darkblue")
axes[1].set_xlabel("Credit Score Category")
axes[1].set_ylabel("Number of Loans")

plt.tight_layout()
plt.show()


**1.Why did you pick the specific chart?**

I chose boxplots because we're comparing numeric variables (CUR, Loan Count) against a categorical variable (Credit Score). This makes it easy to compare distributions across categories.

**2.What is/are the insight(s) found from the chart?**
  
  Number of loans (Num_of_Loan) is a strong discriminator of poor credit: people with >4 loans are much more likely to have a Poor credit score.

Credit Utilization Ratio (CUR) is not a strong discriminator in this dataset: CUR medians and means are almost the same across Good / Standard / Poor, and the practical effect size is negligible even though a very large sample makes small differences statistically significant.

**3.Will the gained insights help creating a positive business impact?**
**Are there any insights that lead to negative growth? Justify with specific** **reason.**

Yes, the insights from these charts can drive positive business impact in several ways:

Number of loan appears to have a greater impact on credit score than the Credit Utilization Ratio. This insight is fundamentally wrong but we have got this result possibly due to data imbalance or feature dependencies.
Or we can do Flagging as a precautionary measure
Treat customers with Num_of_Loan > 4 as higher risk by default.

**CHART 12- PAYMENT OF MIN AMOUNT Vs ANNUAL INCOME -BAR CHART (Bivariate Analysis)**

In [None]:
income_summary = new_df.groupby("Payment_of_Min_Amount")["Annual_Income"].median().reset_index()

fig = px.bar(
    income_summary,
    x="Payment_of_Min_Amount",
    y="Annual_Income",
    color="Payment_of_Min_Amount",
    title="Median Annual Income by Payment of Minimum Amount",
    text="Annual_Income"
)

fig.update_traces(texttemplate='%{text:.2s}', textposition="outside")

fig.update_layout(
    height=600, width=900,
    title_font=dict(size=20, color="darkblue"),
    font=dict(size=14, color="darkblue"),
    yaxis_title="Median Annual Income"
)

fig.show()



Firstly we have done "df.groupby("Payment_of_Min_Amount")"- It builds a GroupBy object that groups rows with the unique values in the Payment_of_Min_Amount column. No aggregation happens yet.

"df.groupby("Payment_of_Min_Amount")["Annual_Income"]"- From that GroupBy object we select the Annual_Income column. Internally that becomes a SeriesGroupBy (groups of numeric values).

"df.groupby("Payment_of_Min_Amount")["Annual_Income"].median()"- Now its the aggregation step which calculates median value for each category

**1. Why did you pick the specific chart?**

I used a bar chart of medians because it provides a clear and direct comparison of income levels across the two categories of Payment_of_Min_Amount (Yes, No). Since incomes are numeric and skewed, the median gives a better idea of the “typical” customer than the mean (less distorted by very high incomes). A bar chart is straightforward for business decision-making because it highlights which group earns more or less.

**2. What is/are the insight(s) found from the chart?**

Yes (Pay Minimum):
Median income ≈ ₹32k — these customers earn the least on average.
No (Do Not Pay Minimum):
Median income ≈ ₹43k — this group has the highest incomes.

Lower-income customers are more likely to pay only the minimum. This is a warning sign: they may struggle to repay in full and carry higher risk of debt accumulation.

Higher-income customers are less likely to stick to minimum payments, showing better repayment ability and healthier credit behavior.

**3.Will the gained insights help creating a positive business impact?**
**Are there any insights that lead to negative growth? Justify with specific** **reason.**
Positive impacts (actionable use):

When underwriting loans, the company should flag applicants who pay only the minimum and also have lower income — these customers are at higher risk of default or long-term debt accumulation.
Conversely, customers who do not pay only the minimum and earn higher incomes are more financially reliable. Such individuals can be offered better credit terms, lower interest rates, or higher credit limits.

Negative implications / risks:

Over-reliance on income alone could exclude some low-income but disciplined payers, reducing inclusivity.

In [None]:
print(income_summary)

### **CHART 13- RELATION BETWEEN CREDIT SCORE AND AGE GROUPS CREDITSCORE-BARCHART(Bivariate Analysis)**

In [None]:
# Create Age-wise groups
bins = [18, 25, 35, 45, 55, 65, 120]  # 120 is upper bound for safety
labels = ["18-25", "25-35", "35-45", "45-55", "55-65", "65+"]

new_df["Age_Group"] = pd.cut(new_df["Age"], bins=bins, labels=labels, right=False)
new_df['Age_Group'].value_counts()


In [None]:
fig = px.histogram(
    new_df,
    x='Age_Group',
    color='Credit_Score',
    barmode='group',  # Grouped bars instead of stacked
    title='Clustered Bar Chart: Age-wise Distribution Across Credit Score Categories',
    labels={'Age_Group': 'Age Group', 'count': 'No. of Individuals'},
    category_orders={'Age_Group': ['18-25', '25-35', '35-45', '45-55', '55-65', '65+']}
)

fig.update_layout(
    xaxis_title='Age Group',
    yaxis_title='No. of Individuals',
    bargap=0.2
)

fig.show()

**1. Why did you pick the specific chart?**

We wanted to analyze the relationship between demographics (Age Groups) and credit health (Credit Score categories).
Both variables are categorical, so a grouped bar chart is the best fit.
Grouped bars allow us to compare counts of Good, Standard, and Poor credit scores side by side within each age group, which makes it very easy to spot differences across age categories.

**2. What is/are the insight(s) found from the chart?**
18–25 Age Group:
Very few individuals, and the majority fall into Standard or Poor scores.
Lending to this group is high risk, since they are young, have limited credit history, and repayment capacity may be weak.
Decision: Approve only small-ticket loans with stricter eligibility checks.

25–35 Age Group:
Large number of applicants.
A healthy proportion have Good scores, though there’s still a visible chunk in Standard and Poor.
Decision: Good opportunity segment — target them with personal loans, credit cards, and auto loans, but still screen carefully for those in Poor.

35–45 Age Group:
One of the strongest segments — majority show Good scores, very few in Poor.
Decision: Safest lending category, can be offered higher credit limits or preferential rates.

45–55 Age Group:
Still financially stable, but the share of Standard/Poor slightly rises compared to 35–45.
Decision: Medium risk, but still reliable borrowers, especially for home loans or top-up loans.

55–65 Age Group:
Numbers shrink, but still a mix of Standard and Good.
Decision: Approve cautiously — nearing retirement age means income stability may decline.

65+ Age Group:
Very few applicants, with noticeable Standard/Poor representation.
Decision: High risk — lend sparingly, only against strong collateral.
Credit health improves with age until mid-career (35–45), where Good scores dominate.

Young borrowers (18–25) are the riskiest — low Good score share, likely due to lack of credit history.
Older borrowers (>55) show mixed credit health — income instability in retirement years affects repayment ability.

**3.Will the gained insights help creating a positive business impact?**

Positive Impacts:
Lenders can focus on 25–45 age groups where Good scores are dominant → ensures lower default rates and healthier loan books.
Risk segmentation: Young (18–25) and Old (65+) can still be served with secured loans or small credit lines, keeping inclusivity while managing risk.
Helps in personalized product design — e.g., starter credit cards for young borrowers, home loans for 30–40 segment, pension-backed loans for 55+.

Negative Impacts (Risks):
Over-reliance on age alone can cause age bias in lending. A disciplined 22-year-old may still be a safe borrower, while a 50-year-old with poor credit may default.
Excluding older groups may reduce customer base in niche segments like retirement loans.

### **CHART 14- MONTHWISE DISTRIBUTION OF CREDIT SCORE -GROUPED BARCHART(Bivariate Analysis)**

In [None]:
# Define proper month order
month_order = [
    "January", "February", "March",
    "April", "May", "June",
    "July", "August", "September",
    "October", "November", "December"]

# Plot histogram for month-wise distribution
fig = px.histogram(
    new_df,
    x="Month",
    color="Credit_Score",
    barmode="group",  # grouped by credit score
    category_orders={"Month": month_order},
    title="Month-wise Distribution of Credit Scores",
    labels={"Month": "Month", "count": "Number of Applicants"}
)

fig.update_layout(
    xaxis_title="Month",
    yaxis_title="Number of Individuals",
    bargap=0.2,
    height=600, width=950,
    title_font=dict(size=20, color="darkblue"),
    font=dict(size=14, color="darkblue")
)

fig.show()


**1. Why did you pick the specific chart?**

I used a grouped bar chart because it allows comparison of two categorical variables at once — Month and Credit_Score.
It shows both seasonality trends (month-wise loan applications) and creditworthiness (score distribution).
Month-wise analysis is important since financial behavior often changes with seasons (festivals, tax deadlines, bonus cycles, etc.).

**2. What is/are the insight(s) found from the chart?**

Application Volume Trend: Some months show higher loan applications than others, indicating possible seasonal peaks (e.g., festive seasons or year-start planning).
Good Credit Scores: There are months where the proportion of Good scores is higher — lenders can treat these as relatively safer months for disbursement.
Poor Credit Scores: In some months, Poor scores rise. This signals increased lending risk during those times.
Stability of Standard Scores: Across many months, Standard scores dominate, showing that a majority of applicants fall in the average-risk category.


**3.Will the gained insights help creating a positive business impact?**

Positive Impact:

Lenders can forecast demand month by month and prepare loan products accordingly.
Can target promotions in months with higher Good scorers to maximize safe lending.
Helps with resource planning (staffing, collections, risk checks) during high-volume months.

Negative Impact:

If lenders over-prioritize only safe months, they may miss out on opportunities in risky months where loan demand is high.
Seasonal biases can create exclusion risks if not balanced with individual applicant assessment.


### **CHART 15- CREDIT INQUIRIES Vs CREDIT SCORE- BOXPLOT(Bivariate Analysis)**

In [None]:
# Boxplot: Credit Inquiries vs Credit Score
fig_inquiries_credit = px.box(
    new_df,
    x="Credit_Score",
    y="Num_Credit_Inquiries",
    color="Credit_Score",
    title="Number of Credit Inquiries vs Credit Score",
    color_discrete_map={"Poor": "red", "Standard": "blue", "Good": "green"}
)

# Layout customization
fig_inquiries_credit.update_layout(
    height=600, width=900,
    title_font=dict(size=20, color="darkblue"),
    font=dict(size=14, color="darkblue"),
    xaxis_title="Credit Score Category",
    yaxis_title="Number of Credit Inquiries"
)

fig_inquiries_credit.show()


**1. Why did you pick the specific chart?**

I picked a boxplot because it reveals the relationship between a numeric variable (number of credit inquiries) and a categorical outcome (credit score). Frequent credit inquiries often signal financial stress or aggressive loan-seeking behavior, which directly impacts credit risk.

**2. What is/are the insight(s) found from the chart?**

Poor scorers show the highest number of credit inquiries, confirming over-borrowing or credit desperation.
Good scorers usually have very few inquiries, reflecting financial stability.
Standard scorers lie in the middle, but lean toward Poor with moderate inquiry frequency.
This shows a clear negative correlation: more inquiries → lower credit score.

**3.Will the gained insights help creating a positive business impact?**

Positive:

Insights can help in early risk detection: customers making too many inquiries within a short time frame can be flagged as risky.
It can prevent loan defaults by reducing approvals to financially stressed applicants.

Negative:

Not all inquiries are bad — some may be due to rate shopping by financially savvy customers. If treated too harshly, Paisabazaar may lose high-quality clients.
Hence, insights should be combined with repayment history and income stability before decision-making.

### **CHART 16- ANNUAL INCOME Vs NO OF LOANS TAKEN COLOURED BY CREDIT SCORE(BUBBLE SIZE- OUTSTANDING DEBT -SCATTER CHART(Multivariate Analysis)**

Multivariate analysis is the statistical study of more than two variables at the same time to understand their combined relationships and influence on outcomes.
"Multi" → many
"Variate" → variables
It looks beyond pairwise comparisons (bivariate) and explores how several features interact together.

In [None]:
# --- Scatter:Annual Income vs Number of Loans Colored by Credit Score (Bubble Size = Outstanding_Debt)

fig = px.scatter(
    new_df,
    x="Annual_Income",
    y="Num_of_Loan",
    color="Credit_Score",
    size="Outstanding_Debt",
    hover_data=["Monthly_Balance"],
    title="Annual Income vs Number of Loans Colored by Credit Score (Bubble Size = Outstanding_Debt)"
)

# Layout customization
fig.update_layout(
    height=650, width=1010,
    title_font=dict(size=20, color="darkblue"),
    font=dict(size=14, color="darkblue"),
    xaxis_title="Annual Income",
    yaxis_title="Num of Loans"
)

fig.show()


**1. Why did you pick the specific chart?**

A scatter plot with bubble size and color lets us analyze 4 key variables in one visualization:
X-axis: Annual Income (earning capacity)
Y-axis: Number of Loans (credit burden)
Bubble size: Outstanding Debt (obligation level)
Color: Credit Score (risk classification)

**2. What is/are the insight(s) found from the chart?**

High number of loans strongly correlates with poor credit health
Poor Credit Score Borrowers (High Risk)-
Mostly in the upper-left region → many loans, low income, majority having loans more than 4.
These customers are overleveraged — they borrow more than they can handle.
They pose the highest default risk, and should be flagged for stricter checks or offered loan consolidation instead of new credit.

Standard Credit Score Borrowers (Moderate Risk)-
Spread across middle bands of loans (2–4) with medium-sized bubbles.
This group is the "grey zone". They are not in immediate danger but not fully safe either.Good candidates for closer monitoring (watchlist segment).
Lenders should offer smaller ticket loans or gradual limit increases, but avoid aggressive exposure.

Good Credit Score Borrowers (Low Risk / Prime Segment)-
Mostly in the bottom-right → high income, few loans, small bubbles.
These are ideal customers: stable incomes, low borrowing, and good repayment history.
They should be prioritized for pre-approved loans, higher credit limits, or premium products.
Business growth can come from upselling to this group safely.

**3.Will the gained insights help creating a positive business impact?**

Positive Business Impacts:
Targeted loan restructuring or advisory for:
Poor credit customers with high loans and high debt.
Upsell opportunities to customers with Standard scores and manageable loans.
Cross-sell or loyalty programs for Good score individuals to retain quality customers.
Insights Leading to Potential Negative Growth:
Many high-income customers are still in the Poor credit segment with high loan counts → Indicates:

Even high-income customers show high levels of debt, indicating over-reliance on credit.
High earners are not immune to missing payments or accumulating bad debt.
If not addressed, this could lead to increased bad debt and losses, despite a seemingly strong customer base.


**CHART 17- CORRELATION HEAT MAP OF KEY FINANCIAL VARIABLES(Multivariate Analysis)**

In [None]:
# Select key financial variables
cols = [
    "Annual_Income", "Outstanding_Debt", "Credit_Utilization_Ratio",
    "Num_of_Loan", "Monthly_Balance", "Total_EMI_per_month"
]

plt.figure(figsize=(10, 6))
sns.heatmap(
    new_df[cols].corr(),
    annot=True,
    cmap="coolwarm",
    fmt=".2f",
    linewidths=0.5
)
plt.title("Correlation Heatmap of Key Financial Variables", fontsize=14, color="darkblue")
plt.show()


**1. Why did you pick the specific chart?**
I used a heatmap because it is the most effective way to visualize correlations between multiple numeric variables at once. Instead of reading a large correlation table full of numbers, the heatmap shows the strength of relationships through color intensity, which makes patterns stand out instantly.

**2. What is/are the insight(s) found from the chart?**

Key Correlations in this Heatmap

Annual Income ↔ Monthly Balance (0.63, strong positive)
Higher income individuals maintain a healthier monthly balance.
👉 Business: Safer borrowers, lower default risk.

Num_of_Loan ↔ Outstanding Debt (0.64, strong positive)
More loans = higher outstanding debt.
👉 Business: Multiple loans increase credit risk.

Annual Income ↔ Total EMI per month (0.44, moderate positive)
Higher income borrowers generally have higher EMIs (they can afford bigger loans).
👉 Business: Normal, but needs balance — EMI load shouldn’t cross repayment capacity.

Num_of_Loan ↔ Monthly Balance (-0.43, moderate negative)
More loans = lower monthly balance left after expenses.
👉 Business: This is a warning sign — borrowers juggling multiple loans may face liquidity issues.

Outstanding Debt ↔ Monthly Balance (-0.32, moderate negative)
Higher debt reduces monthly balance.
👉 Business: Indicates repayment stress for highly indebted customers.

Credit Utilization Ratio ↔ Other Variables (very weak correlations)
Credit utilization ratio doesn’t strongly correlate with debt, income, or loans here.
👉 Business: CUR is behaving independently, which makes it an important standalone risk predictor.

**3.Will the gained insights help creating a positive business impact?**
Yes, the insights will drive positive business impact in multiple ways:

Better Risk Segmentation
By knowing that borrowers with multiple loans and high debt usually have low monthly balances, the business can tighten lending policies for such profiles.
This reduces chances of default.

Possible Negative Impact if Ignored:
If the company lends freely to multi-loan borrowers with low balances, defaults will rise.
Misinterpreting correlations (e.g., assuming high EMI always means risk, when it might just mean high income) could limit business unnecessarily.

**CHART 18- DISTRIBUTION OF ANNUAL INCOME ACROSS CREDIT SCORES-VIOLIN CHART(Bivariate Analysis)**

In [None]:
import plotly.express as px

# Violin plot to show income distribution by credit score
fig = px.violin(
    new_df,
    x="Credit_Score",
    y="Annual_Income",
    color="Credit_Score",
    box=True,  # add mini boxplot inside
    points="all",  # show individual points
    title="Distribution of Annual Income Across Credit Score Categories"
)

fig.update_layout(
    height=650, width=950,
    title_font=dict(size=20, color="darkblue"),
    font=dict(size=14, color="darkblue"),
    xaxis_title="Credit Score Category",
    yaxis_title="Annual Income"
)

fig.show()


**1. Why did you pick the specific chart?**

I have chosen this because unlike a bar/box plot, the violin shape shows where most borrowers are concentrated.Shows Distribution + Density Together


**2. What is/are the insight(s) found from the chart?**

Poor Credit Score Group:
Most borrowers fall in the lower income range, with very few high-income individuals. This confirms that low income is strongly associated with poor credit behavior.

Standard Credit Score Group:
Distribution is wider. Borrowers are spread across low to middle incomes, suggesting that this group is financially diverse and not as predictable purely by income.

Good Credit Score Group:
Majority of borrowers fall in middle-to-high income ranges with higher concentration around stable income bands. This indicates that steady or higher income supports good credit discipline.

**3.Will the gained insights help creating a positive business impact?**

Yes, definitely.
Better Credit Segmentation:
Since income distribution varies strongly by credit score, the business can set clear income thresholds for lending (e.g., avoid risky low-income segments for unsecured loans).
Negative Risk if Misused:
If the company assumes that all high-income borrowers are safe, it may miss cases of high-income but poor scorers (outliers seen in the violin plot). These must be carefully screened.


In [None]:
print(df.columns.tolist())


**CHART 19- ANNUAL INCOME Vs OUTSTANDING DEBT(BUBBLE SIZE- NO OF LOANS)COLOUR- DELAYED PAYMENTS- SCATTER CHART(Multivariate Analysis)**

In [None]:
# Select required columns and drop rows with missing values
df_plot = df[['Annual_Income', 'Outstanding_Debt', 'Num_of_Loan', 'Num_of_Delayed_Payment']].dropna()

# Create scatter plot
plt.figure(figsize=(15,8))
scatter = plt.scatter(
    df_plot['Annual_Income'],
    df_plot['Outstanding_Debt'],
    s=df_plot['Num_of_Loan']*20,   # Bubble size = number of loans
    c=df_plot['Num_of_Delayed_Payment'],   # Color = delayed payments
    cmap='viridis',
    alpha=0.7,
    edgecolor='k'
)

# Add colorbar
plt.colorbar(scatter, label="Number of Delayed Payments")

# Labels and title
plt.xlabel("Annual Income")
plt.ylabel("Outstanding Debt")
plt.title("Annual Income vs Outstanding Debt \n(Bubble size = Number of Loans, Color = Delayed Payments)")

plt.show()


**1. Why did you pick the specific chart?**

I have chosen this because it shows 4 dimensions at once — the scatter + bubble size + color encodes income, debt, loan count, and delayed payments in a single visual

**2. What is/are the insight(s) found from the chart?**

Loan count strongly drives debt.
Num_of_Loan and Outstanding_Debt correlation ≈ +0.64. Larger bubbles tend to appear higher on the Y-axis → more loans → more outstanding debt.
Higher loan counts concentrate in lower-income bands.
Debt burden is skewed toward lower incomes.
Delayed payments (color) identify behavioral risk on top of exposure.

**3.Will the gained insights help creating a positive business impact?**

Yes,
We can flag the risky borrower ranges as per their credit patterns reflecting from the chart like
Greater than 4-5 loans leads to more outstanding debt patterns
Annual income less than 50k and no of loans greater than 4 should be red flagged and should be reviewed if customer requests for further credit
Monitoring & early-warning



**CHART 20- RELATIONSHIP ACROSS VARIOUS FIANCIAL VARIABLES-PAIRPLOT CHART(Multivariate Analysis)**

In [None]:
sns.pairplot(new_df[['Age','Annual_Income','Monthly_Balance','Outstanding_Debt',
                     'Credit_Utilization_Ratio','Num_Credit_Inquiries',]],
             diag_kind="kde", plot_kws={"alpha":0.6})
plt.suptitle("Pair Plot of Selected Financial Variables", y=1.02)
plt.show()

**1. Why did you pick the specific chart?**

The pair plot helps to visualize any patterns or outliers in the data and to conduct multivariate analysis. Hence to explore relationships between key variables like Age, Income, Balance Debt, Utilization, and Inquiries pairplot was chosen.

**2. What is/are the insight(s) found from the chart?**

Age has no relationship with credit utilization ratio and annual income as all the point are dispersed across the graph.

Credit Utilization Ratio and Outstanding Debt have a clear positive correlation. This shows that higher outstanding debt leads to higher credit utilization.

The relation between Age and Number of Credit Inquiries shows that younger individuals tend to have more credit inquiries, which may affect their credit scores negatively.

## **5. Solution to Business Objective**


The business objective was to analyze customer credit-related data and build a reliable approach to classify credit scores into categories (Good, Standard, Poor).

1- Through EDA (Exploratory Data Analysis), we identified the most influential factors impacting credit scores — such as number of delayed payments, outstanding debt, annual income, number of credit inquiries, and credit utilization ratio.

2- Using statistical analysis and multivariate visualizations, we observed strong behavioral patterns:

Higher delayed payments and more credit inquiries strongly correlate with Poor credit scores.

Lower outstanding debt and controlled credit utilization are characteristics of Good scorers.

Majority of customers fall in the Standard category, showing potential for targeted improvement.

3- By applying these insights, we can develop predictive models that classify individual's credit scores accurately. This helps Paisabazaar minimize default risks, improve lending precision, and personalize product offerings.

Thus, the solution to the business objective is the creation of a data-driven credit score classification system that enhances creditworthiness assessment.

**What do you suggest to the client to achieve Business Objective ?**


**To fully leverage these insights, Paisabazaar should:**

* Adopt Predictive Modeling in Credit Assessment

* Deploy ML models trained on key features (income, delayed payments, inquiries, utilization ratio).

* Continuously retrain with new customer data to improve accuracy.

* Risk Segmentation & Lending Policies

* Flag high-risk applicants (frequent delayed payments + high inquiries).

* Offer low-risk clients better interest rates and personalized products.

* Create custom loan terms for “Standard” customers to move them towards “Good.”

* Customer Education & Engagement

* Provide financial literacy tools showing how behaviors (timely payments, reduced inquiries) can improve their credit scores.

* Use personalized dashboards for customers to monitor their own creditworthiness.

* Business Growth Alignment

Lower defaults = better lender trust = higher business growth.

Smarter segmentation = better targeting of financial products.

Improved customer satisfaction = long-term client retention.

**In brief:**
Paisabazaar can achieve the business objective by integrating predictive analytics with customer engagement. This ensures smarter credit risk decisions, reduced defaults, and stronger financial product recommendations — creating a positive business impact for both the company and its customers.


# **Conclusion**

This project successfully addressed Paisabazaar's key challenges of improving -credit score classification by leveraging customer financial data.

Through detailed exploratory and multivariate analysis, we identified critical factors influencing creditworthiness,listed below-
* income
* delayed payments
* outstanding debt
* credit utilization

The insights highlight that while the majority of individuals fall into the “Standard” credit score range, a significant portion still lies in the “Poor” category, signaling both risks and opportunities. By implementing predictive models and adopting data-driven lending strategies, Paisabazaar can
* minimize loan defaults,
* optimize lending decisions,
* design tailored financial products
* target right customers through eligibility crieteria

Furthermore, customer-focused education on repayment behaviors can help improve creditworthiness over time. Ultimately, these strategies empower Paisabazaar to reduce financial risks, enhance customer trust, and drive sustainable business growth.