# Analyzing Credit Score

This project will investigate the influence of a customer's marital status and the number of children they have on the probability of failure on loan repayment. The bank already has some data on the creditworthiness of its customers.

This report will be taken into account when making credit assessments for prospective customers. Credit assessments are used to evaluate the ability of potential borrowers to repay their loans.

<b>Hypothesis:</b>
    
   1. There is a correlation between having children and making timely payments of debts
   2. There is a correlation between family status and making timely payments of debts
   3. There is a correlation between income level and making timely payments of debts
   4. The purpose of a credit affects the level of failure

<b>Stages:</b>

1. [Data Overview](#start)
    - [Possible causes of missing values in the data](#cause)
2. [Data Pre-processing](#pre-process)
    - [Check each column](#column)
    - [Duplicate values](#duplicate)
    - [Missing values](#missing)
3. [Data Categorization](#category)
    - [Categorical column type](#categorical)
    - [Numeric column type](#numerical)
4. [Hypothesis Test](#test)
    - [Hypothesis 1: correlation between having children and making timely payments of debts](#1)
    - [Hypothesis 2: correlation between family status and making timely payments of debts](#2)
    - [Hypothesis 3: correlation between income level and making timely payments of debts](#3)
    - [Hypothesis 4: how the purpose of a credit affects the level of failure](#4)
5. [General Conclusion](#conclusion)

## Data Overview <a id="start"></a>

In [3]:
# importing libraries
import pandas as pd

In [4]:
# Dataset
df = pd.read_csv('Y:\\Online Course\\Practicum\\Jupyter Notebook\\2 Project\\credit_scoring_eng.csv')

In [5]:
# General information of dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


In [6]:
# Dataset size
df.shape

(21525, 12)

In [7]:
# Sample data
df.head(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
5,0,-926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house
6,0,-2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions
7,0,-152.779569,50,SECONDARY EDUCATION,1,married,0,M,employee,0,21731.829,education
8,2,-6929.865299,35,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337.093,having a wedding
9,0,-2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family


Documentation:
- `children` - number of children in the family
- `days_employed` - the number of days of customer's work experience
- `dob_years` - the customer's age in years
- `education` - customer's education level
- `education_id` - identifier for customer's education level
- `family_status` - customer's marital status
- `family_status_id` - identifier for customer's marital status
- `gender` - customer's gender
- `income_type` - customer's type of employment
- `debt` - whether the customer has a loan payment debt
- `total_income` - monthly income
- `purpose` - the purpose of getting a loan
- `age_group` - age category
- `purpose_group` - loan purpose category
- `total_income_group` - income category

In [8]:
# Filter for missing values in the `days_employed` column
df.loc[df['days_employed'].isna()]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


In [9]:
# Filter for missing values in the `days_employed` and `total_income` columns
df_null = df.loc[(df['days_employed'].isna()) & (df['total_income'].isna())]

In [10]:
# View the number of rows in the filtered table
df_null.isna().sum()

children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

In [11]:
# View the number of rows in the dataset
df.isna().sum()

children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

As you can see, the sum of the missing values in all the rows is the same as the missing values in the filtered column, indicating that the missing values appear to be symmetrical.
    
A deeper analysis can also be carried out, to see if the missing values are symmetrical or not by looking at *mean, median, and mode* values as follows:

In [12]:
# Look at the description of the dataset to find out whether the dataset has a symmetrical distribution or not
df.describe()

Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,debt,total_income
count,21525.0,19351.0,21525.0,21525.0,21525.0,21525.0,19351.0
mean,0.538908,63046.497661,43.29338,0.817236,0.972544,0.080883,26787.568355
std,1.381587,140827.311974,12.574584,0.548138,1.420324,0.272661,16475.450632
min,-1.0,-18388.949901,0.0,0.0,0.0,0.0,3306.762
25%,0.0,-2747.423625,33.0,1.0,0.0,0.0,16488.5045
50%,0.0,-1203.369529,42.0,1.0,0.0,0.0,23202.87
75%,1.0,-291.095954,53.0,1.0,1.0,0.0,32549.611
max,20.0,401755.400475,75.0,4.0,4.0,1.0,362496.645


In [13]:
# View the mean, median, mode values from `children` column
# to determine whether missing values are symmetrical or not
CentralTendacy = {}
CentralTendacy['Mean'] = df['children'].mean()
CentralTendacy['Median'] = df['children'].median()
CentralTendacy['Mode'] = df['children'].mode()[0]

CentralTendacy

{'Mean': 0.5389082462253194, 'Median': 0.0, 'Mode': 0}

The data distribution will be considered symmetric if it has *mean, median* and *modus* values that are almost the same.

As the description of the dataset and also in the children column, the *mean, median, mode* values are not too far apart, so it can be concluded that the dataset has a symmetrical data distribution which means there are no *outliers*.

In [14]:
# The percentage of missing values is compared to the entire dataset
(len(df_null)/len(df)) * 100

10.099883855981417

The percentage is about 10%, indicating that the missing value is quite large. So this value needs to be handled further by filling in the value.

### Possible causes of missing values in the data <a id="cause"></a>

In [15]:
# Examining datasets with customers who do not have data on identified characteristics
# and columns with missing values
df_null['income_type'].value_counts()

employee         1105
business          508
retiree           413
civil servant     147
entrepreneur        1
Name: income_type, dtype: int64

In [16]:
# Compare to the dataset
df['income_type'].value_counts()

employee                       11119
business                        5085
retiree                         3856
civil servant                   1459
unemployed                         2
entrepreneur                       2
student                            1
paternity / maternity leave        1
Name: income_type, dtype: int64

As you can see from the characteristics of the customer's income type, the filtered `income type` column shows 5 types of values.

In [17]:
# Other characteristics such as the level of education of the customer
df_null['education'].value_counts()

secondary education    1408
bachelor's degree       496
SECONDARY EDUCATION      67
Secondary Education      65
some college             55
Bachelor's Degree        25
BACHELOR'S DEGREE        23
primary education        19
Some College              7
SOME COLLEGE              7
Primary Education         1
PRIMARY EDUCATION         1
Name: education, dtype: int64

In [18]:
# Compare to the dataset
df['education'].value_counts()

secondary education    13750
bachelor's degree       4718
SECONDARY EDUCATION      772
Secondary Education      711
some college             668
BACHELOR'S DEGREE        274
Bachelor's Degree        268
primary education        250
Some College              47
SOME COLLEGE              29
PRIMARY EDUCATION         17
Primary Education         15
graduate degree            4
Graduate Degree            1
GRADUATE DEGREE            1
Name: education, dtype: int64

In [19]:
# Other characteristics such as the gender of the customer
df_null['gender'].value_counts()

F    1484
M     690
Name: gender, dtype: int64

In [20]:
# Compare to the dataset
df['gender'].value_counts()

F      14236
M       7288
XNA        1
Name: gender, dtype: int64

**Tentative conclusions**

*Findings*:
1. There is an inconsistency in the days_employed column which may have positive values, 
2. The `education`, and `purpose` columns have values that have the same meaning but are written in different ways,
3. Missing values:
    - There may not be outliers in the columns that have missing values,
    - The number of rows in the filtered table matches the number of missing values, indicating that the missing values appear to be symmetrical,
    - There is a difference in the number of unique values in the filtered `income_type` column (5) compared to the original dataset (8). It is not yet clear whether this difference is random or whether there is a pattern,
    - There is also a difference in the number of unique values in the filtered `education` column (12) compared to the original dataset (15)
    - There is also a difference in the number of unique values in the filtered `gender` column (2) compared to the original dataset (3).

## Data Pre-processing <a id="pre-process"></a>

In [21]:
# Checkpoint variabel
df_edit = df

### Check each column <a id="column"></a>

#### `'education'` column.

In [22]:
# Unique values from education column
df_edit['education'].unique()

array(["bachelor's degree", 'secondary education', 'Secondary Education',
       'SECONDARY EDUCATION', "BACHELOR'S DEGREE", 'some college',
       'primary education', "Bachelor's Degree", 'SOME COLLEGE',
       'Some College', 'PRIMARY EDUCATION', 'Primary Education',
       'Graduate Degree', 'GRADUATE DEGREE', 'graduate degree'],
      dtype=object)

In [23]:
# Repair dataset
df_edit['education'] = df_edit['education'].str.lower()

In [24]:
# Check the results
df_edit['education'].unique()

array(["bachelor's degree", 'secondary education', 'some college',
       'primary education', 'graduate degree'], dtype=object)

There are several types of values that have the same meaning but are written differently, so it can be improved so that these types of values have different meanings from each other.

#### `children` column.

In [25]:
# Distribution of values in the `children` column
df_edit['children'].value_counts()

 0     14149
 1      4818
 2      2055
 3       330
 20       76
-1        47
 4        41
 5         9
Name: children, dtype: int64

There are 47 rows in the `'children'` column which have a value of -1 and 76 rows which have a value of 20, normally no one has 20 children or even -1 children.

There are two options to fix this value, delete it, or fix the value. Can be decided after seeing the percentage of the entire dataset.

In [26]:
# The percentage of values compared to the entire dataset
print(len(df_edit.loc[df_edit['children']==20])/len(df)*100)
print(len(df_edit.loc[df_edit['children']==-1])/len(df)*100)

0.3530778164924506
0.2183507549361208


As you can see, the percentage is very small, so the value can be deleted/fixed. Because the value is still possible to be repaired, it is better to do repairs than delete.

In [27]:
# Repair datasets
df_edit.loc[df_edit['children']==20, 'children'] = 2
df_edit.loc[df_edit['children']==-1, 'children'] = 1

In [28]:
# Check the results
df_edit['children'].value_counts()

0    14149
1     4865
2     2131
3      330
4       41
5        9
Name: children, dtype: int64

The `'children'` field has been fixed by changing the values 20 to 2 and -1 to 1.

#### `days_employed` column.

In [29]:
# Look for problematic data in `days_employed`
df_edit['days_employed'].value_counts(normalize=True)

-8437.673028      0.000052
-3507.818775      0.000052
 354500.415854    0.000052
-769.717438       0.000052
-3963.590317      0.000052
                    ...   
-1099.957609      0.000052
-209.984794       0.000052
 398099.392433    0.000052
-1271.038880      0.000052
-1984.507589      0.000052
Name: days_employed, Length: 19351, dtype: float64

There is a problem in the `'days_employed'` column showing a negative value, where normally a person's length of service will be recorded as a positive value.

In [30]:
# Converts the value in the 'days_employed' column to per year
df_edit['days_employed'] = df_edit['days_employed'] / 365
df_edit['days_employed'].describe()

count    19351.000000
mean       172.730131
std        385.828252
min        -50.380685
25%         -7.527188
50%         -3.296903
75%         -0.797523
max       1100.699727
Name: days_employed, dtype: float64

As can be seen there is a value indicating a customer who has worked for more than 1100 years, normally this does not happen. The 45 states show that the average person who retires from their job is between the ages of 62 and 65, and for that retirement age they have served about 42 years. Not near 48 or 52 years.

In [31]:
# Calculates the percentage of negative values
print(len(df_edit.loc[df['days_employed']<0])/len(df) *100)

# and hours of work that is not normal
print(len(df_edit.loc[df_edit['days_employed']>42])/len(df) *100)

73.89547038327527
16.004645760743323


Considering that the percentage of problematic data is very large, this problematic row cannot be deleted, must be corrected, and suggests re-checking during the data retrieval process.

In [32]:
# Repair datasets
# suppose a person starts working at the age of 23
value_replace = df_edit.loc[df_edit['days_employed']>42, 'dob_years'] - 23
df_edit.loc[df_edit['days_employed']>42, 'days_employed'] = value_replace

In [33]:
# Check the results
df_edit.loc[df_edit['days_employed']>42, 'days_employed']

25       44.0
35       45.0
128      44.0
150      48.0
168      44.0
         ... 
21405    43.0
21419    44.0
21450    44.0
21504    45.0
21521    44.0
Name: days_employed, Length: 565, dtype: float64

In [34]:
# Fixed problematic values, for negative values
df_edit['days_employed'] = abs(df_edit['days_employed'])

In [35]:
# Check the results
df_edit[df_edit['days_employed']<0]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose


As you can see the `'days_employed'` column has been fixed by changing all the values that look abnormal, replacing them by inserting the values from the `'dob_years'` column minus the age a person started working. Also changes all negative values to positive.

#### `'dob_years'` column.

In [36]:
# Check the `dob_years` column
df_edit['dob_years'].value_counts()

35    617
40    609
41    607
34    603
38    598
42    597
33    581
39    573
31    560
36    555
44    547
29    545
30    540
48    538
37    537
50    514
43    513
32    510
49    508
28    503
45    497
27    493
56    487
52    484
47    480
54    479
46    475
58    461
57    460
53    459
51    448
59    444
55    443
26    408
60    377
25    357
61    355
62    352
63    269
64    265
24    264
23    254
65    194
22    183
66    183
67    167
21    111
0     101
68     99
69     85
70     65
71     58
20     51
72     33
19     14
73      8
74      6
75      1
Name: dob_years, dtype: int64

As you can see there are 101 rows that have a value of 0, normally there is no age with a value of 0.

In [37]:
# Calculates the percentage
len(df_edit.loc[df_edit['dob_years']==0])/len(df)*100

0.4692218350754936

Because the percentage is very small and it is not possible to fix this value, the value can be deleted.

In [39]:
# Fixed datasets
index_drop = df_edit[df_edit['dob_years']==0].index
df_edit.drop(index_drop , inplace=True)

In [40]:
# Checking the result
df_edit.loc[df_edit['dob_years']==0]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose


The `'dob_years'` column has been successfully fixed by deleting every row with a value of 0.

#### `family_status` column.

In [232]:
# Examine `family_status` column
df_edit['family_status'].value_counts()

married              12331
civil partnership     4156
unmarried             2797
divorced              1185
widow / widower        955
Name: family_status, dtype: int64

As you can see there is no problem with this column.

#### `gender` column.

In [41]:
# Examine the `gender` column
df_edit['gender'].value_counts()

F      14164
M       7259
XNA        1
Name: gender, dtype: int64

There is a value `XNA` that should not be in the `'gender'` column, normally the gender will be filled in between male or female. Since there is only 1 row that has the value `XNA`, it can be deleted.

In [42]:
# Looking for XNA data
df_edit.loc[df_edit['gender']=='XNA']

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
10701,0,6.461919,24,some college,2,civil partnership,1,XNA,business,0,32624.825,buy real estate


In [43]:
# Deleting rows
df_edit.drop(10701, axis=0, inplace=True)

In [44]:
# Checking the result
df_edit['gender'].value_counts()

F    14164
M     7259
Name: gender, dtype: int64

As you can see the `'gender'` column has been successfully fixed by deleting the row which has the value `XNA` based on the *index* that row has.

#### `income_type` column.

In [45]:
# Examine `income_type` column
df_edit['income_type'].value_counts()

employee                       11064
business                        5064
retiree                         3836
civil servant                   1453
unemployed                         2
entrepreneur                       2
student                            1
paternity / maternity leave        1
Name: income_type, dtype: int64

As you can see there is no problem with this column.

### Duplicate values <a id="duplicate"></a>

In [46]:
# Checking for duplicates
df_edit.loc[df_edit.duplicated()]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
2849,0,,41,secondary education,1,married,0,F,employee,0,,purchase of the house for my family
3290,0,,58,secondary education,1,civil partnership,1,F,retiree,0,,to have a wedding
4182,1,,34,bachelor's degree,0,civil partnership,1,F,employee,0,,wedding ceremony
4851,0,,60,secondary education,1,civil partnership,1,F,retiree,0,,wedding ceremony
5557,0,,58,secondary education,1,civil partnership,1,F,retiree,0,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
20702,0,,64,secondary education,1,married,0,F,retiree,0,,supplementary education
21032,0,,60,secondary education,1,married,0,F,retiree,0,,to become educated
21132,0,,47,secondary education,1,married,0,F,employee,0,,housing renovation
21281,1,,30,bachelor's degree,0,married,0,F,employee,0,,buy commercial real estate


There are 71 rows that have duplicate values.

In [47]:
# Percentage of duplicate values against the entire dataset
len(df_edit.loc[df_edit.duplicated()])/len(df)*100

0.33141950240395834

Because the percentage is small, duplicate rows can be deleted.

In [48]:
# Repair datasets
df_edit = df_edit.drop_duplicates().reset_index(drop=True)

In [49]:
# Checking the result
df_edit. duplicated(). sum()

0

In [50]:
# Checks the current size of the dataset after the first manipulation performed
print('Shape:',df_edit.shape)

Shape: (21352, 12)


**Tentative conclusions**

Previously *dataset* had 21525 rows, with 12.

After checking and fixing what needs to be fixed, the new *dataset* has a size of 21352 rows with 12 columns.

In [55]:
# Checkpoint variabel
df_clean = df_edit

In [51]:
# Percentage of changes made after making improvements
((21525-21352)/21525)*100

0.8037166085946574

The percentage change from the initial *dataset* to the new *dataset* shows a small percentage, around 0.8%.

### Missing values <a id="missing"></a>

#### `'total_income'` column.

In the column `'total_income'` there is a missing value that needs to be filled in because of the large percentage as described above.

And because this column has a quantitative type, the missing values ​​can be filled with *mean* or *median*.

In [52]:
# Function to calculate age categories
'''
Definition:
-----------
    Function to calculate age categories
-----------
    age:
        defined age
'''
def age_group(age):
    try:
        if age <= 19:
            return 'child'
        elif 20 <= age <= 30:
            return 'young'
        elif 31 <= age <= 45:
            return 'adult'
        else:
            return 'old'
    except:
        return 'Unidentified'

In [53]:
# Function test
print(age_group(17))
print(age_group(30))
print(age_group(45))
print(age_group(70))
print(age_group('70'))

child
young
adult
old
Unidentified


In [56]:
# Implement the function and create a new column
df_clean['age_group'] = df_clean['dob_years'].apply(age_group)

In [57]:
# Checking the result
df_clean['age_group'].unique()

array(['adult', 'old', 'young', 'child'], dtype=object)

Examine factors that are likely to depend on income. To find out whether to use *mean* or *median* to replace missing values.

In [58]:
# Filters without missing values
df_notnull = df_clean[df_clean['total_income'].notnull()]

# Sample data
df_notnull. head(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_group
0,1,23.116912,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,adult
1,1,11.02686,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,adult
2,0,15.406637,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,adult
3,3,11.300677,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,adult
4,0,30.0,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,old
5,0,2.537495,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house,young
6,0,7.888225,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions,adult
7,0,0.418574,50,secondary education,1,married,0,M,employee,0,21731.829,education,old
8,2,18.985932,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337.093,having a wedding,adult
9,0,5.996593,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family,adult


In [59]:
# Average value for income based on the agg of the `education` column
df_notnull. groupby('education')['total_income'].mean()

education
bachelor's degree      33172.428387
graduate degree        27960.024667
primary education      21144.882211
secondary education    24600.353617
some college           29035.057865
Name: total_income, dtype: float64

In [60]:
# The median value for income by agg the `education` column
df_notnull. groupby('education')['total_income'].median()

education
bachelor's degree      28054.5310
graduate degree        25161.5835
primary education      18741.9760
secondary education    21839.4075
some college           25608.7945
Name: total_income, dtype: float64

In [61]:
# Average on other factors
df_notnull. groupby('income_type')['total_income'].mean()

income_type
business                       32397.307219
civil servant                  27361.316126
employee                       25824.679592
entrepreneur                   79866.103000
paternity / maternity leave     8612.661000
retiree                        21939.310393
student                        15712.260000
unemployed                     21014.360500
Name: total_income, dtype: float64

In [62]:
# Median on other factors
df_notnull. groupby('income_type')['total_income'].median()

income_type
business                       27563.0285
civil servant                  24083.5065
employee                       22815.1035
entrepreneur                   79866.1030
paternity / maternity leave     8612.6610
retiree                        18969.1490
student                        15712.2600
unemployed                     21014.3605
Name: total_income, dtype: float64

In [63]:
# Average on other factors
df_notnull. groupby('gender')['total_income'].mean()

gender
F    24664.752169
M    30905.772981
Name: total_income, dtype: float64

In [64]:
# Median on other factors
df_notnull. groupby('gender')['total_income'].median()

gender
F    21469.0015
M    26819.5670
Name: total_income, dtype: float64

If the data distribution is symmetrical then you can use the *median/mean* value, but if the data distribution is not symmetrical then you can use the *median* value.

As seen in the *median* and *mean* values ​​of each factor, the data distribution is symmetrical, so missing values ​​can be filled in using *median/mean*. In this case *mean* is chosen because the data distribution is symmetrical, which means there are no *outliers*.

In [65]:
# Function to fill in missing values
'''
Definition:
-----------
    This function is to fill in missing values based on conditions from other columns
-----------
    data:
        desired data
    agg_columns:
        The condition of other columns to be compared
    value_columns:
        column to fill in (contains missing values)
    buttons:
        1 means the missing values are filled with the median
        2 means the missing values are filled in with the mean
        3 means the missing values are filled with mode/mode
'''
def fill_missing_value(data, agg_column, value_column, button=1):
    
    # Button to change the mean/median
    if button == 1:
        grouped_values = data.groupby(agg_column)[value_column].median().reset_index()
    elif button == 2:
        grouped_values = data.groupby(agg_column)[value_column].mean().reset_index()
    else:
        grouped_values = data.groupby(agg_column)[value_column].apply(pd.Series.mode).reset_index()
    
    # Number of rows in the grouped column
    size = len(grouped_values)
    
    # Insert missing values into the dataset
    for i in range(size):
        group = grouped_values[agg_column][i]
        value = grouped_values[value_column][i]
        data.loc[ (data[agg_column]==group) & (data[value_column].isna()), value_column ] = value
    return data

In [66]:
# Examine `total_income` column
df_clean[['total_income']]

Unnamed: 0,total_income
0,40620.102
1,17932.802
2,23341.752
3,42820.568
4,25378.572
...,...
21347,35966.698
21348,24959.969
21349,14347.610
21350,39054.888


In [71]:
# Implement function
df_clean = fill_missing_value(data=df_clean, agg_column='income_type', value_column='total_income', button=2)

In [72]:
# Checking the result
df_clean.isna().sum()

children               0
days_employed       2093
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income           0
purpose                0
age_group              0
dtype: int64

In [73]:
# Check for other possible errors
sorted(df_clean['total_income'].unique())

[3306.762,
 3392.845,
 3418.824,
 3471.216,
 3503.298,
 3595.641,
 3815.153,
 3913.227,
 4036.463,
 4049.374,
 4212.77,
 4245.348,
 4386.4,
 4444.179,
 4465.254,
 4494.861,
 4592.45,
 4650.812,
 4664.644,
 4672.012,
 4708.271,
 4759.97,
 4812.103,
 4818.545999999999,
 4860.001,
 4919.749,
 5002.295,
 5028.623,
 5029.439,
 5037.321,
 5045.56,
 5053.838,
 5090.55,
 5112.186,
 5137.573,
 5148.514,
 5167.9940000000015,
 5168.082,
 5172.669,
 5195.285,
 5208.353,
 5217.0340000000015,
 5220.544,
 5259.254,
 5274.611,
 5288.165,
 5290.465,
 5330.769,
 5331.621,
 5335.014,
 5402.85,
 5409.738,
 5430.683000000001,
 5443.908,
 5452.4940000000015,
 5461.996,
 5464.092,
 5478.583000000001,
 5490.018,
 5496.834,
 5514.581,
 5515.539000000002,
 5529.334,
 5531.2040000000015,
 5562.874,
 5577.521,
 5579.965,
 5591.44,
 5604.991999999998,
 5622.0790000000015,
 5630.865,
 5639.846,
 5651.584,
 5703.853,
 5768.392,
 5772.8780000000015,
 5801.651,
 5803.271,
 5820.374,
 5826.733,
 5831.255,
 5837.099,
 5

In [74]:
# Fixed datasets
df_clean['total_income'] = round(df_clean['total_income'], 2)

In [75]:
# Checking the number of dataset entries
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21352 entries, 0 to 21351
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21352 non-null  int64  
 1   days_employed     19259 non-null  float64
 2   dob_years         21352 non-null  int64  
 3   education         21352 non-null  object 
 4   education_id      21352 non-null  int64  
 5   family_status     21352 non-null  object 
 6   family_status_id  21352 non-null  int64  
 7   gender            21352 non-null  object 
 8   income_type       21352 non-null  object 
 9   debt              21352 non-null  int64  
 10  total_income      21352 non-null  float64
 11  purpose           21352 non-null  object 
 12  age_group         21352 non-null  object 
dtypes: float64(2), int64(5), object(6)
memory usage: 2.1+ MB


The missing value in column `'total_income'` has been successfully corrected, replacing it with the *mean* value belonging to column `'total_income'`.

#### `'days_employed'` column.

In [76]:
# Median distribution of `days_employed` by agg `age_group` column
df_notnull. groupby('age_group')['days_employed'].median()

age_group
adult     4.769766
child     1.984911
old      16.059749
young     2.875400
Name: days_employed, dtype: float64

In [77]:
# Average distribution of `days_employed` by agg `age_group` column
df_notnull. groupby('age_group')['days_employed'].mean()

age_group
adult     6.352219
child     1.736104
old      20.223445
young     3.511801
Name: days_employed, dtype: float64

In [78]:
# Average on other parameters
df_notnull. groupby('income_type')['days_employed'].mean()

income_type
business                        5.788341
civil servant                   9.283585
employee                        6.379736
entrepreneur                    1.426981
paternity / maternity leave     9.032219
retiree                        36.431115
student                         1.585621
unemployed                     15.000000
Name: days_employed, dtype: float64

In [79]:
# Median in other parameters
df_notnull. groupby('income_type')['days_employed'].median()

income_type
business                        4.241123
civil servant                   7.324397
employee                        4.317994
entrepreneur                    1.426981
paternity / maternity leave     9.032219
retiree                        37.000000
student                         1.585621
unemployed                     15.000000
Name: days_employed, dtype: float64

In [80]:
# Average on other parameters
df_notnull. groupby('family_status')['days_employed'].mean()

family_status
civil partnership    10.778576
divorced             12.047446
married              11.711988
unmarried             8.925047
widow / widower      25.036825
Name: days_employed, dtype: float64

In [81]:
# Median in other parameters
df_notnull. groupby('family_status')['days_employed'].mean()

family_status
civil partnership    10.778576
divorced             12.047446
married              11.711988
unmarried             8.925047
widow / widower      25.036825
Name: days_employed, dtype: float64

As previously discussed in the column `'total_income'`. If the data distribution is symmetrical then you can use the *median/mean* value, but if the data distribution is not symmetrical then you can use the *median* value.
    
And what can be seen in the *median* and *mean* values ​​of each parameter, the data distribution is not symmetrical, so missing values ​​can be filled in using *median*, because the value *median* will be more *robust* than *outlier*.

In [84]:
# Replace missing values
df_clean = fill_missing_value(data=df_clean, agg_column='income_type', value_column='days_employed', button=2)

In [85]:
# Check all columns
df_clean.isna().sum()

children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
age_group           0
dtype: int64

The missing value in the `'days_employed'` column has been successfully fixed, replacing it with the median value of the `'days_employed'` column.

### Data Categorization <a id="category"></a>

Because there are several of the `'purpose'` columns that have the same meaning but use several different words, so they can be categorized to make it easier to check.

In [86]:
# Checkpoint variable
df_category = df_clean

#### Categorical column type <a id="categorical"></a>

In [87]:
# Displays the selected data value for categorization
# The first process includes text data
df_category[['purpose']]

Unnamed: 0,purpose
0,purchase of the house
1,car purchase
2,purchase of the house
3,supplementary education
4,to have a wedding
...,...
21347,housing transactions
21348,purchase of a car
21349,property
21350,buying my own car


In [88]:
# Checking for unique values
sorted(df_category['purpose'].unique())

['building a property',
 'building a real estate',
 'buy commercial real estate',
 'buy real estate',
 'buy residential real estate',
 'buying a second-hand car',
 'buying my own car',
 'buying property for renting out',
 'car',
 'car purchase',
 'cars',
 'construction of own property',
 'education',
 'getting an education',
 'getting higher education',
 'going to university',
 'having a wedding',
 'housing',
 'housing renovation',
 'housing transactions',
 'profile education',
 'property',
 'purchase of a car',
 'purchase of my own house',
 'purchase of the house',
 'purchase of the house for my family',
 'real estate transactions',
 'second-hand car purchase',
 'supplementary education',
 'to become educated',
 'to buy a car',
 'to get a supplementary education',
 'to have a wedding',
 'to own a car',
 'transactions with commercial real estate',
 'transactions with my real estate',
 'university education',
 'wedding ceremony']

As seen in the `'purpose'` column, there are several meanings that are the same but written differently which makes some values ​​unique.

In [89]:
# Create a new column
df_category['purpose_group'] = df_category['purpose']
df_category

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_group,purpose_group
0,1,23.116912,42,bachelor's degree,0,married,0,F,employee,0,40620.10,purchase of the house,adult,purchase of the house
1,1,11.026860,36,secondary education,1,married,0,F,employee,0,17932.80,car purchase,adult,car purchase
2,0,15.406637,33,secondary education,1,married,0,M,employee,0,23341.75,purchase of the house,adult,purchase of the house
3,3,11.300677,32,secondary education,1,married,0,M,employee,0,42820.57,supplementary education,adult,supplementary education
4,0,30.000000,53,secondary education,1,civil partnership,1,F,retiree,0,25378.57,to have a wedding,old,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21347,1,12.409087,43,secondary education,1,civil partnership,1,F,business,0,35966.70,housing transactions,adult,housing transactions
21348,0,44.000000,67,secondary education,1,married,0,F,retiree,0,24959.97,purchase of a car,old,purchase of a car
21349,1,5.789991,38,secondary education,1,civil partnership,1,M,employee,1,14347.61,property,adult,property
21350,3,8.527347,38,secondary education,1,married,0,M,employee,1,39054.89,buying my own car,adult,buying my own car


In [90]:
# Fungsi untuk mengategorikan data berdasarkan topik umum
'''
Definisi:
-----------
    Fungsi untuk mengategorikan data berdasarkan topik umum
-----------
    dtframe:
        data yang diinginkan
    column_name:
        Kolom yang diinginkan
    str_pattern:
        string yang dicari
    define_str:
        string yang akan menggantikan
'''
def string_group (dtframe, column_name, str_pattern, define_str):
    
    str_filt = str_pattern
    dtframe.loc[dtframe[column_name].str.contains(str_filt, regex=True), column_name] = define_str
    
    return dtframe

In [91]:
# Function check
test = string_group (dtframe = df_category, column_name = 'purpose_group', str_pattern = 'real|property', define_str = 'real estate/property')
sorted(test['purpose_group'].unique())

['buying a second-hand car',
 'buying my own car',
 'car',
 'car purchase',
 'cars',
 'education',
 'getting an education',
 'getting higher education',
 'going to university',
 'having a wedding',
 'housing',
 'housing renovation',
 'housing transactions',
 'profile education',
 'purchase of a car',
 'purchase of my own house',
 'purchase of the house',
 'purchase of the house for my family',
 'real estate/property',
 'second-hand car purchase',
 'supplementary education',
 'to become educated',
 'to buy a car',
 'to get a supplementary education',
 'to have a wedding',
 'to own a car',
 'university education',
 'wedding ceremony']

In [94]:
# Executes the function on the new column that has been created
df_category = string_group (dtframe = df_category, column_name = 'purpose_group', str_pattern = 'hou|real|property', define_str = 'real estate/property/house')
df_category = string_group (dtframe = df_category, column_name = 'purpose_group', str_pattern = 'car', define_str = 'car')
df_category = string_group (dtframe = df_category, column_name = 'purpose_group', str_pattern = 'edu|uni', define_str = 'education')
df_category = string_group (dtframe = df_category, column_name = 'purpose_group', str_pattern = 'wed', define_str = 'wedding')

In [95]:
# Calculates the value in the new column
df_category['purpose_group'].value_counts()

real estate/property/house    10763
car                            4284
education                      3995
wedding                        2310
Name: purpose_group, dtype: int64

Categorization in the `'purpose'` column was successful and entered the value in a new column named `'purpose_group'`.

#### Numeric column type <a id="numerical"></a>

In [96]:
# View all numeric data selected for categorization
df_category['total_income'].head(10)

0    40620.10
1    17932.80
2    23341.75
3    42820.57
4    25378.57
5    40922.17
6    38484.16
7    21731.83
8    15337.09
9    23108.15
Name: total_income, dtype: float64

In [97]:
# Get statistical conclusions for the column
df_category['total_income'].describe()

count     21352.000000
mean      26795.979981
std       15707.677378
min        3306.760000
25%       17223.822500
50%       24291.745000
75%       32397.310000
max      362496.640000
Name: total_income, dtype: float64

It can be seen that the column `'total_income'` has several ranges that can be used for categorization to make it easier to check.

In [98]:
# A function that performs categorizing into different numerical groups based on ranges
'''
Definition:
-----------
    A function that performs categorization into different numeric groups based on a range
-----------
    values:
        value to be categorized
'''
def number_group(value):
    
    try:
        if 0 <= value <= 10000:
            return 'very low'
        elif 10001 <= value <= 15000:
            return 'low'
        elif 15001 <= value <= 25000:
            return 'middle'
        else:
            return 'high'
    
    except:
        return 'unidentified'

In [99]:
# Function test
print(number_group(5000))
print(number_group(15000))
print(number_group(25000))
print(number_group(70000))
print(number_group('5000'))

very low
low
middle
high
unidentified


In [100]:
# Implement function
df_category['total_income_group'] = df_category['total_income'].apply(number_group)

In [101]:
# View the results
df_category['total_income_group'].value_counts()

high        10272
middle       7358
low          2801
very low      921
Name: total_income_group, dtype: int64

In [102]:
# Checkpoint variable
df_final = df_category

Categorization in column `'total_income_group'` was successfully carried out and entered the value in a new column named `'total_income_group'`.

## Hypothesis Test <a id="test"></a>

### Hypothesis 1: correlation between having children and making timely payments of debts <a id="1"></a>

In [103]:
# Check children data and debt data
df_final[['children', 'debt']]

Unnamed: 0,children,debt
0,1,0
1,1,0
2,0,0
3,3,0
4,0,0
...,...,...
21347,1,0
21348,0,0
21349,1,1
21350,3,1


In [104]:
# Calculating defaults based on the number of children
df_final.groupby(['children', 'debt']).size()

children  debt
0         0       12963
          1        1058
1         0        4397
          1         442
2         0        1912
          1         202
3         0         301
          1          27
4         0          37
          1           4
5         0           9
dtype: int64

**Tentative conclusions**

*Findings*:

1. As can be seen, customers who do not have children tend to repay loans,
2. The more children, the less likely it is to pay off the loan.

### Hypothesis 2: correlation between family status and making timely payments of debts <a id="2"></a>

In [105]:
# Checking family status data and debt data
df_final[['family_status', 'debt']]

Unnamed: 0,family_status,debt
0,married,0
1,married,0
2,married,0
3,married,0
4,civil partnership,0
...,...,...
21347,civil partnership,0
21348,married,0
21349,civil partnership,1
21350,married,1


In [106]:
# Calculating default based on family status
df_final.groupby(['family_status', 'debt']).size()

family_status      debt
civil partnership  0        3743
                   1         386
divorced           0        1100
                   1          85
married            0       11363
                   1         927
unmarried          0        2521
                   1         273
widow / widower    0         892
                   1          62
dtype: int64

**Tentative conclusions**

*Findings*:

1. It can be seen that customers who have a `married` status have the possibility of paying off loans more than other marital statuses.

### Hypothesis 3: correlation between income level and making timely payments of debts <a id="3"></a>

In [107]:
# Checking income level data and debt data
df_clean[['total_income', 'total_income_group', 'debt']]

Unnamed: 0,total_income,total_income_group,debt
0,40620.10,high,0
1,17932.80,middle,0
2,23341.75,middle,0
3,42820.57,high,0
4,25378.57,high,0
...,...,...,...
21347,35966.70,high,0
21348,24959.97,middle,0
21349,14347.61,low,1
21350,39054.89,high,1


In [108]:
# Calculating defaults based on income level
df_final.groupby(['total_income_group', 'debt']).size()

total_income_group  debt
high                0       9465
                    1        807
low                 0       2562
                    1        239
middle              0       6729
                    1        629
very low            0        863
                    1         58
dtype: int64

**Tentative conclusions**

*Findings*:

1. It can be seen that customers in each criterion have a high probability of repaying the loan.

### Hypothesis 4: how the purpose of a credit affects the level of failure <a id="4"></a>

In [110]:
# Calculating default rate percentage based on credit purpose
df_final.groupby(['purpose_group', 'debt']).size()

purpose_group               debt
car                         0       3884
                            1        400
education                   0       3625
                            1        370
real estate/property/house  0       9984
                            1        779
wedding                     0       2126
                            1        184
dtype: int64

**Tentative conclusions**

*Findings*:

1. Customers who have the goal of borrowing for *real estate/property/house* are the most likely to pay off loans.

# General Conclusion <a id="conclusion"></a>

After testing the following three hypotheses:

1. There is a correlation between having children and making repayments on time
2. There is a correlation between family status and timely repayment
3. There is a correlation between the level of income and paying back on time
4. Credit goals affect default rates

After analyzing the data, it can be concluded:

1. There is a correlation between having children and making payments on time, the more children a customer has, the less likely the customer is to make payments on time.

The first hypothesis can be fully accepted.

2. There is a correlation between family status and timely repayment, customers who have married status have a high probability of repayment, compared to other statuses.

The second hypothesis can be fully accepted. It should be remembered that customers with married status also have a high failure rate

3. Income level is not too influential with paying loans on time, from all categories, the average can repay loans.

The third hypothesis is not accepted.

4. Credit goals do not significantly affect the default rate, of all categories, the average can make payments from a loan.

The fourth hypothesis is not accepted.