## **Dataset Description & Understanding**

This dataset contains financial and demographic information of 20,000 individuals.
It records income, living conditions, and detailed monthly spending categories such as rent, groceries, transport, entertainment, etc.
The dataset also includes desired savings and potential savings indicators.

## **Objective**

The goal of this analysis is to understand whether financial success depends more on income level or spending behaviour.
The analysis will identify overspending patterns, savings feasibility, and lifestyle factors affecting financial health.



#### **Personal details**

Income — Monthly income of the individual  
Age — Age of the person  
Dependents — Number of people financially supported by the individual  
Occupation — Employment type or job category  
City_Tier — Living area classification (Tier-1 high cost → Tier-3 low cost)  

#### **Monthly Expenses**
Categories like Rent, Loan_Repayment, Insurance, Groceries, Transport, Eating_Out, Entertainment, Utilities, Healthcare, Education, and Miscellaneous record various monthly expenses

#### **Financial Goals & Savings**
Desired_Savings_Percentage — Target saving ratio of income  
Desired_Savings — Planned amount to save monthly  
Disposable_Income — Remaining money after expenses

#### **Potential Savings Indicators**

Estimated reducible spending amounts:  
Potential_Savings_Groceries  
Potential_Savings_Transport  
Potential_Savings_Eating_Out  
Potential_Savings_Entertainment  
Potential_Savings_Utilities  
Potential_Savings_Healthcare  
Potential_Savings_Education  
Potential_Savings_Miscellaneous  
These represent areas where spending can be optimized.

**Load Data**



In [2]:
import pandas as pd
df=pd.read_csv(r"C:\Users\risla\OneDrive\Desktop\EDA raw\data.csv")
df.head()

Unnamed: 0,Income,Age,Dependents,Occupation,City_Tier,Rent,Loan_Repayment,Insurance,Groceries,Transport,...,Desired_Savings,Disposable_Income,Potential_Savings_Groceries,Potential_Savings_Transport,Potential_Savings_Eating_Out,Potential_Savings_Entertainment,Potential_Savings_Utilities,Potential_Savings_Healthcare,Potential_Savings_Education,Potential_Savings_Miscellaneous
0,44637.249636,49,0,Self_Employed,Tier_1,13391.174891,0.0,2206.490129,6658.768341,2636.970696,...,6200.537192,11265.627707,1685.696222,328.895281,465.769172,195.15132,678.292859,67.682471,0.0,85.735517
1,26858.596592,34,2,Retired,Tier_2,5371.719318,0.0,869.522617,2818.44446,1543.018778,...,1923.176434,9676.818733,540.306561,119.347139,141.866089,234.131168,286.668408,6.603212,56.306874,97.388606
2,50367.605084,35,1,Student,Tier_3,7555.140763,4612.103386,2201.80005,6313.222081,3221.396403,...,7050.360422,13891.450624,1466.073984,473.549752,410.857129,459.965256,488.383423,7.290892,106.653597,138.542422
3,101455.600247,21,0,Self_Employed,Tier_3,15218.340037,6809.441427,4889.418087,14690.149363,7106.130005,...,16694.965136,31617.953615,1875.93277,762.020789,1241.017448,320.190594,1389.815033,193.502754,0.0,296.041183
4,24875.283548,52,4,Professional,Tier_2,4975.05671,3112.609398,635.90717,3034.329665,1276.155163,...,1874.099434,6265.700532,788.953124,68.160766,61.712505,187.17375,194.11713,47.294591,67.38812,96.557076



**Initial Inspection**


In [3]:
df.shape



(20000, 27)

In [4]:
df.columns

Index(['Income', 'Age', 'Dependents', 'Occupation', 'City_Tier', 'Rent',
       'Loan_Repayment', 'Insurance', 'Groceries', 'Transport', 'Eating_Out',
       'Entertainment', 'Utilities', 'Healthcare', 'Education',
       'Miscellaneous', 'Desired_Savings_Percentage', 'Desired_Savings',
       'Disposable_Income', 'Potential_Savings_Groceries',
       'Potential_Savings_Transport', 'Potential_Savings_Eating_Out',
       'Potential_Savings_Entertainment', 'Potential_Savings_Utilities',
       'Potential_Savings_Healthcare', 'Potential_Savings_Education',
       'Potential_Savings_Miscellaneous'],
      dtype='object')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 27 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Income                           20000 non-null  float64
 1   Age                              20000 non-null  int64  
 2   Dependents                       20000 non-null  int64  
 3   Occupation                       20000 non-null  object 
 4   City_Tier                        20000 non-null  object 
 5   Rent                             20000 non-null  float64
 6   Loan_Repayment                   20000 non-null  float64
 7   Insurance                        20000 non-null  float64
 8   Groceries                        20000 non-null  float64
 9   Transport                        20000 non-null  float64
 10  Eating_Out                       20000 non-null  float64
 11  Entertainment                    20000 non-null  float64
 12  Utilities         

In [6]:
df.nunique()

Income                             20000
Age                                   47
Dependents                             5
Occupation                             4
City_Tier                              3
Rent                               20000
Loan_Repayment                      7970
Insurance                          20000
Groceries                          20000
Transport                          20000
Eating_Out                         20000
Entertainment                      20000
Utilities                          20000
Healthcare                         20000
Education                          15940
Miscellaneous                      20000
Desired_Savings_Percentage         20000
Desired_Savings                    19889
Disposable_Income                  20000
Potential_Savings_Groceries        20000
Potential_Savings_Transport        20000
Potential_Savings_Eating_Out       20000
Potential_Savings_Entertainment    20000
Potential_Savings_Utilities        20000
Potential_Saving

**Statistical Summary**

In [34]:
df.describe()

Unnamed: 0,Income,Age,Dependents,Rent,Loan_Repayment,Insurance,Groceries,Transport,Eating_Out,Entertainment,...,Desired_Savings,Disposable_Income,Potential_Savings_Groceries,Potential_Savings_Transport,Potential_Savings_Eating_Out,Potential_Savings_Entertainment,Potential_Savings_Utilities,Potential_Savings_Healthcare,Potential_Savings_Education,Potential_Savings_Miscellaneous
count,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,...,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0
mean,41585.5,41.03145,1.99595,9115.494629,2049.800292,1455.028761,5205.667493,2704.466685,1461.856982,1448.853658,...,4982.878416,10647.367257,912.197183,473.04265,254.96328,254.031058,436.332808,41.524964,62.417083,144.904987
std,40014.54,13.578725,1.417616,9254.228188,4281.789941,1492.938435,5035.953689,2666.345648,1481.660811,1489.01927,...,7733.468188,11740.637289,1038.884968,537.222853,296.047943,299.97359,503.200658,53.152458,98.842656,169.160951
min,1301.187,18.0,0.0,235.365692,0.0,30.002012,154.07824,81.228584,39.437523,45.421469,...,0.0,-5400.788673,16.575501,8.268076,3.797926,3.12161,6.200297,0.001238,0.0,2.091973
25%,17604.88,29.0,1.0,3649.422246,0.0,580.204749,2165.426419,1124.578012,581.011801,581.632906,...,1224.932636,3774.894323,317.811,161.913751,84.50687,84.56209,148.013618,11.037421,4.92621,47.637307
50%,30185.38,41.0,2.0,6402.751824,0.0,1017.124681,3741.091535,1933.845509,1029.109726,1020.198376,...,2155.356763,7224.890977,607.038735,307.045856,164.92766,164.740232,285.739582,25.202124,33.127987,93.090257
75%,51765.45,53.0,3.0,11263.940492,2627.14232,1787.160895,6470.892718,3360.597508,1807.075251,1790.104082,...,6216.309609,13331.950716,1128.681837,588.419602,313.39824,310.927935,538.983703,52.353736,80.946145,178.257981
max,1079728.0,64.0,4.0,215945.674703,123080.682009,38734.932935,119816.898124,81861.503457,34406.100166,38667.368308,...,245504.485208,377060.218482,34894.644404,12273.258242,5573.036433,6222.200913,8081.799518,1394.531049,3647.244243,4637.951137


### **Column Type Identification**


In [40]:
numerical_columns=df.select_dtypes(include='number').columns
numerical_columns

Index(['Income', 'Age', 'Dependents', 'Rent', 'Loan_Repayment', 'Insurance',
       'Groceries', 'Transport', 'Eating_Out', 'Entertainment', 'Utilities',
       'Healthcare', 'Education', 'Miscellaneous',
       'Desired_Savings_Percentage', 'Desired_Savings', 'Disposable_Income',
       'Potential_Savings_Groceries', 'Potential_Savings_Transport',
       'Potential_Savings_Eating_Out', 'Potential_Savings_Entertainment',
       'Potential_Savings_Utilities', 'Potential_Savings_Healthcare',
       'Potential_Savings_Education', 'Potential_Savings_Miscellaneous'],
      dtype='object')

In [37]:
categorical_columns=df.select_dtypes(include=['object']).columns
categorical_columns

Index(['Occupation', 'City_Tier'], dtype='object')

**Missing Value Analysis**

In [38]:
df.isnull().sum()
# no missing values found

Income                             0
Age                                0
Dependents                         0
Occupation                         0
City_Tier                          0
Rent                               0
Loan_Repayment                     0
Insurance                          0
Groceries                          0
Transport                          0
Eating_Out                         0
Entertainment                      0
Utilities                          0
Healthcare                         0
Education                          0
Miscellaneous                      0
Desired_Savings_Percentage         0
Desired_Savings                    0
Disposable_Income                  0
Potential_Savings_Groceries        0
Potential_Savings_Transport        0
Potential_Savings_Eating_Out       0
Potential_Savings_Entertainment    0
Potential_Savings_Utilities        0
Potential_Savings_Healthcare       0
Potential_Savings_Education        0
Potential_Savings_Miscellaneous    0
d

**Data Quality Issue Log**

In [47]:
# Check Negative Values

n=(df.select_dtypes(include='number')<0).sum()
n

Income                               0
Age                                  0
Dependents                           0
Rent                                 0
Loan_Repayment                       0
Insurance                            0
Groceries                            0
Transport                            0
Eating_Out                           0
Entertainment                        0
Utilities                            0
Healthcare                           0
Education                            0
Miscellaneous                        0
Desired_Savings_Percentage           0
Desired_Savings                      0
Disposable_Income                  112
Potential_Savings_Groceries          0
Potential_Savings_Transport          0
Potential_Savings_Eating_Out         0
Potential_Savings_Entertainment      0
Potential_Savings_Utilities          0
Potential_Savings_Healthcare         0
Potential_Savings_Education          0
Potential_Savings_Miscellaneous      0
Total_Expense            

In [46]:
# Expense vs Income Logic
expense_cols = [
    'Rent','Loan_Repayment','Insurance','Groceries','Transport',
    'Eating_Out','Entertainment','Utilities','Healthcare','Education','Miscellaneous'
]

df['Total_Expense'] = df[expense_cols].sum(axis=1)
(df['Total_Expense'] > df['Income']).sum()



np.int64(112)

In [54]:
# Disposable Income Consistency
(df['Income'] - df['Total_Expense'] - df['Disposable_Income']).abs().describe()


count    2.000000e+04
mean     1.717402e-12
std      4.044296e-12
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      1.818989e-12
max      8.731149e-11
dtype: float64

In [55]:
# Savings Feasibility
(df['Desired_Savings'] > df['Income']).sum()


np.int64(0)

### **Data Quality Issue Log**

No missing values found in the dataset.

Datatypes are appropriate (numeric for financial values, categorical for Occupation and City_Tier).

Negative values appear only in Disposable_Income, indicating overspending behaviour rather than data error.

Cases where total expenses exceed income match negative disposable income, confirming logical consistency.

Desired savings never exceed income, showing realistic financial planning.