# Analysis of the risk of default of borrowers

This project is to prepare a report for the lending division of a bank. It is necessary to find out whether a customer's marital status and the number of children have an impact on whether he or she will default on a loan. The bank already has some data on the creditworthiness of customers.

A **credit score** of the potential customer will be created for the report. The **credit score** is used to assess the ability of a potential borrower to repay his or her loan.

It will be necessary to pre-process the data so that it is cleaner and more concise for later analysis.

In this work, 4 hypotheses will be evaluated, they are:
- Is there a correlation between income level and on-time payment?
- Is there a correlation between family status and on-time payment?
- Is there a correlation between having children or not and on-time payment?
- Is there a correlation with the purpose of the credit affecting the default rate?

## Loading the data file

In [1]:
# Loading Pandas library
import pandas as pd

# Load the data
default_risk = pd.read_csv('credit_scoring_eng.csv')

## Data Exploration

**Data Description**
- `children` - the number of children in the family
- `days_employed` - work experience in days
- `dob_years` - age of the customer in years
- `education` - education of the customer
- `education_id` - education identifier
- `family_status` - marital status of the customer
- `family_status_id` - marital status identifier
- `gender` - gender of the customer
- `income_type` - type of employment
- `debt` - was there any debt in repayment of the loan
- `total_income` - monthly income
- `purpose` - the purpose of obtaining a loan

In [2]:
# Checking data information
default_risk.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


In [3]:
# Displaying the first 60 rows of data
display(default_risk.head(60))


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
5,0,-926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house
6,0,-2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions
7,0,-152.779569,50,SECONDARY EDUCATION,1,married,0,M,employee,0,21731.829,education
8,2,-6929.865299,35,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337.093,having a wedding
9,0,-2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family


The data in the "days_employed" column are inconsistent:
- it is not possible to have negative days of work experience;
- and when the days are positive, they have unrealistic values, since they account for more than 300 years, 1000 years, for example. And they are coinciding when the type of income is "retiree".

The "education" column is filled out in a non-standard way, thus making data analysis difficult.

When there is a missing value in the "days_employed" column, it is also missing in the "monthly income" column ("total_income").

In the "purpose" column, there are several descriptions that say the same thing, requiring an update of the information for a cleaner database.

These situations should be studied and addressed so that there is no impact on the results of the final analyses.

In [4]:
# Missing value information about data
display(default_risk.isna().sum())


children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

This report has 21525 rows of data with 2 columns having missing values. The "days_employed" and "total_income" columns have unfilled gaps. Both have 19351 rows with filled data and 2174 with missing values.

In [5]:
# Missing values ​​in column 'days_employed' with missing data
default_risk_filtered = (default_risk.loc[default_risk['days_employed'].isna()])
display(default_risk_filtered)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


From the sample listed, we observe the same volume of missing values ​​in the "days_employed" column and in the "total_income" column. Further analysis should be carried out to verify whether the absence of a value in this first column coincides with the absence in the second in 100% of cases.

In [6]:
# Analyze this pattern of missing values ​​coinciding between columns.
missing_values = default_risk.loc[(default_risk['days_employed'].isna()) | (default_risk['total_income'].isna())]
missing_values1 = default_risk.loc[(default_risk['days_employed'].isna()) & (~default_risk['total_income'].isna())]
missing_values2 = default_risk.loc[(~default_risk['days_employed'].isna()) & (default_risk['total_income'].isna())]

print(missing_values.shape[0])
print()
print(missing_values1.shape[0])
print()
print(missing_values2.shape[0])

2174

0

0


**Intermediate Conclusion 1**

The number of rows in the filtered table corresponds to the number of missing values.

It is seen that when there is a missing value in the "days_employed" column, the "total_income" column is also missing a value.

The significance of the number of missing values ​​will be checked against the total information we have in the default_risk DataFrame.

In [7]:
# Investigate customers with missing values ​​in the "days_employed" column.
days_employed_total = default_risk['days_employed'].isna().sum()
record_total = default_risk['gender'].count()

days_employed_perc = (days_employed_total / record_total)*100

print(days_employed_perc)

10.099883855981417


In [8]:
# Distribution verification
display(default_risk_filtered['income_type'].value_counts())
print()

income_type
employee         1105
business          508
retiree           413
civil servant     147
entrepreneur        1
Name: count, dtype: int64




In this database, there are approximately **10.1%** of missing values ​​in the "days_employed" and "total_income" columns, which is a very significant volume of data.

These values ​​are distributed as follows, considering the type of income of the clients:
- 1105 employees;
- 508 businesses;
- 413 retirees;
- 147 civil servants;
- 1 entrepreneur.

**Possible reasons for missing values ​​in the data**

The following are possible reasons for this inconsistency:
- People who have never worked;
- They worked, but were unable to prove it because they were not formally registered;
- Such information was not declared;
- Technical problems.

These missing information occur in the following columns:
- "days_employed";
- "total_income".

In [9]:
# Checking the distribution across the entire dataset
display(default_risk)
print()
display(missing_values)


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.422610,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21520,1,-4529.316663,43,secondary education,1,civil partnership,1,F,business,0,35966.698,housing transactions
21521,0,343937.404131,67,secondary education,1,married,0,F,retiree,0,24959.969,purchase of a car
21522,1,-2113.346888,38,secondary education,1,civil partnership,1,M,employee,1,14347.610,property
21523,3,-3112.481705,38,secondary education,1,married,0,M,employee,1,39054.888,buying my own car





Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


**Intermediate Conclusion 2**

The distribution in the original data set is similar to the distribution in the filtered table, and a single pattern can be observed when there is no information in the "days_employed" column and also missing information in the "total_income" column.

This pattern was verified by filtering only the rows that have a missing value in one of the columns and comparing it with the information in the other.

These missing values ​​will be addressed later, because during the necessary data transformations we may be able to identify a pattern that was not possible to observe at this time. If this occurs, actions will be taken to ensure a cleaner analysis of the data.

The data from the other columns will be analyzed and our next steps are the following:

- analysis of the data for duplicates;
- analysis of problematic data;
- filling in missing values;
- categorization of the data.

## Data transformation

In [10]:
# Checking the values ​​in the "education" column to see if there are spellings that will need to be corrected
display(sorted(default_risk['education'].unique()))

print(default_risk.loc[default_risk['education']=='graduate degree'])

["BACHELOR'S DEGREE",
 "Bachelor's Degree",
 'GRADUATE DEGREE',
 'Graduate Degree',
 'PRIMARY EDUCATION',
 'Primary Education',
 'SECONDARY EDUCATION',
 'SOME COLLEGE',
 'Secondary Education',
 'Some College',
 "bachelor's degree",
 'graduate degree',
 'primary education',
 'secondary education',
 'some college']

       children  days_employed  dob_years        education  education_id  \
6551          0   -5352.038180         58  graduate degree             4   
12021         3   -5968.075884         36  graduate degree             4   
12786         0  376276.219531         62  graduate degree             4   
21519         1   -2351.431934         37  graduate degree             4   

      family_status  family_status_id gender    income_type  debt  \
6551        married                 0      M       employee     0   
12021       married                 0      F  civil servant     0   
12786       married                 0      F        retiree     0   
21519      divorced                 3      M       employee     0   

       total_income                      purpose  
6551      42945.794          going to university  
12021     17822.757        purchase of the house  
12786     40868.031  buy residential real estate  
21519     18551.846   buy commercial real estate  


Let's change the spellings to lowercase letters to have more concise records and let's change the option 'graduate degree' to 'bachelor\'s degree'.

With this, we will need to update the data in the 'education_id' column.

This action is necessary for better analysis and categorization of some information.

In [11]:
# Correcting records for columns 'education' and 'education_id'
default_risk['education'] = default_risk['education'].str.lower()
default_risk['education'] = default_risk['education'].replace('graduate degree', 'bachelor\'s degree')
default_risk['education_id'] = default_risk['education_id'].replace(4,0)

In [12]:
# Checking all values ​​in the column to make sure they are correct
display(sorted(default_risk['education'].unique()))
print()
print(default_risk['education'].value_counts())
print()
print(default_risk.loc[default_risk['education']=='graduate degree'])

["bachelor's degree",
 'primary education',
 'secondary education',
 'some college']


education
secondary education    15233
bachelor's degree       5266
some college             744
primary education        282
Name: count, dtype: int64

Empty DataFrame
Columns: [children, days_employed, dob_years, education, education_id, family_status, family_status_id, gender, income_type, debt, total_income, purpose]
Index: []


Now let's analyze the data in the `purpose` column.

In [13]:
# Checking the distribution of values ​​in the `purpose` column
print(default_risk['purpose'].isna().sum())
print()
print(sorted(default_risk['purpose'].unique()))
print()
print(default_risk['purpose'].value_counts())

0

['building a property', 'building a real estate', 'buy commercial real estate', 'buy real estate', 'buy residential real estate', 'buying a second-hand car', 'buying my own car', 'buying property for renting out', 'car', 'car purchase', 'cars', 'construction of own property', 'education', 'getting an education', 'getting higher education', 'going to university', 'having a wedding', 'housing', 'housing renovation', 'housing transactions', 'profile education', 'property', 'purchase of a car', 'purchase of my own house', 'purchase of the house', 'purchase of the house for my family', 'real estate transactions', 'second-hand car purchase', 'supplementary education', 'to become educated', 'to buy a car', 'to get a supplementary education', 'to have a wedding', 'to own a car', 'transactions with commercial real estate', 'transactions with my real estate', 'university education', 'wedding ceremony']

purpose
wedding ceremony                            797
having a wedding                  

This "purpose" column contains several items of information that are written differently but mean the same thing. This creates unnecessary options.

Let's combine these similarities into just 8 (eight) options to make the analysis cleaner.

The items are:
- building a property (undefined);
- buying a car;
- buying commercial real estate;
- buying or renovating residential properties;
- buying real estate (undefined);
- construction of own property;
- education;
- wedding.

In [14]:
# Correcting the data

default_risk['purpose'] = default_risk['purpose'].replace\
            (['building a property', 'building a real estate'], 'building a property (undefined)')
default_risk['purpose'] = default_risk['purpose'].replace\
            (['buy real estate', 'buying property for renting out',\
              'real estate transactions', 'housing transactions', 'property'], 'buy real estate (undefined)')
default_risk['purpose'] = default_risk['purpose'].replace\
            ('transactions with commercial real estate', 'buy commercial real estate')
default_risk['purpose'] = default_risk['purpose'].replace\
            (['buy residential real estate', 'purchase of my own house', 'purchase of the house',\
              'purchase of the house for my family', 'housing', 'transactions with my real estate',\
              'housing renovation'], 'buy or renovate residential properties')
default_risk['purpose'] = default_risk['purpose'].replace\
            (['buying my own car', 'car', 'cars', 'car purchase', 'purchase of a car', 'to buy a car',\
              'to own a car', 'buying a second-hand car', 'second-hand car purchase'], 'buy a car')
default_risk['purpose'] = default_risk['purpose'].replace\
            (['getting an education', 'supplementary education', 'to become educated',\
              'to get a supplementary education', 'university education', 'getting higher education',\
              'going to university', 'profile education'], 'education')
default_risk['purpose'] = default_risk['purpose'].replace\
            (['having a wedding', 'to have a wedding', 'wedding ceremony'], 'wedding')


In [15]:
# Checking the `purpose` column again to make sure everything is correct
print(sorted(default_risk['purpose'].unique()))
print()
print(default_risk['purpose'].value_counts())
print()


['building a property (undefined)', 'buy a car', 'buy commercial real estate', 'buy or renovate residential properties', 'buy real estate (undefined)', 'construction of own property', 'education', 'wedding']

purpose
buy or renovate residential properties    4404
buy a car                                 4315
education                                 4022
buy real estate (undefined)               3240
wedding                                   2348
buy commercial real estate                1315
building a property (undefined)           1246
construction of own property               635
Name: count, dtype: int64



Checking data in `children` column

In [16]:
# Checking the distribution of values ​​in the `children` column
print(sorted(default_risk['children'].unique()))
print()
print(default_risk['children'].value_counts())
print()

values_total = default_risk['children'].count()

print(f'Total data in column = {values_total}.')
print()

values_one_neg = default_risk[default_risk['children'] == -1]['children'].count()
print(f'Total of \"-1\" values in column = {values_one_neg}.')
prob_one_neg_total = (values_one_neg/values_total)
print(f'Total of \"-1\" values in column = {prob_one_neg_total:.2%}.')

print()

values_twenty = default_risk[default_risk['children'] == 20]['children'].count()
print(f'Total of \"20\" values in column = {values_twenty}.')
prob_twenty_total = (values_twenty/values_total)
print(f'Total of \"20\" values in column = {prob_twenty_total:.2%}.')

[np.int64(-1), np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(20)]

children
 0     14149
 1      4818
 2      2055
 3       330
 20       76
-1        47
 4        41
 5         9
Name: count, dtype: int64

Total data in column = 21525.

Total of "-1" values in column = 47.
Total of "-1" values in column = 0.22%.

Total of "20" values in column = 76.
Total of "20" values in column = 0.35%.


The "children" column contains 47 pieces of information with the number 1 (one) as a negative sign (-1), which is equivalent to 0.22% of the total and does not have a significant impact on the final result.

There are also 76 pieces of information in this column such as "20" children, and these entries are equivalent to 0.35% of the total.

These problematic data were probably caused by a typing error when entering the information. They will be changed as follows:
- the values ​​"-1" will be changed to "1" (one);
- the values ​​"20" will be changed to "2" (two).

In [17]:
# Changing the data in the 'children' column
default_risk['children'] = default_risk['children'].replace(-1, 1)
default_risk['children'] = default_risk['children'].replace(20, 2)


In [18]:
# Checking the `children` column again to make sure everything is correct
print(sorted(default_risk['children'].unique()))
print()
print(default_risk['children'].value_counts())
print()


[np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5)]

children
0    14149
1     4865
2     2131
3      330
4       41
5        9
Name: count, dtype: int64



Checking data in `days_employed` column.

In [19]:
# Finding possible problematic data in `days_employed`.
print(default_risk['days_employed'].value_counts())
print()

# Checking the mean of negative values
print(default_risk[default_risk['days_employed'] <=0].mean(numeric_only=True))
print()

# Counting the number of negative values
print(default_risk[default_risk['days_employed'] <=0].count())
print()

# Checking if positive values
print(default_risk[default_risk['days_employed'] > 0].count())
print()
print(default_risk[default_risk['days_employed'] > 20000].count())

days_employed
-8437.673028      1
-3507.818775      1
 354500.415854    1
-769.717438       1
-3963.590317      1
                 ..
-1099.957609      1
-209.984794       1
 398099.392433    1
-1271.038880      1
-1984.507589      1
Name: count, Length: 19351, dtype: int64

children                0.562744
days_employed       -2353.015932
dob_years              39.818245
education_id            0.797372
family_status_id        0.969634
debt                    0.087326
total_income        27837.509634
dtype: float64

children            15906
days_employed       15906
dob_years           15906
education           15906
education_id        15906
family_status       15906
family_status_id    15906
gender              15906
income_type         15906
debt                15906
total_income        15906
purpose             15906
dtype: int64

children            3445
days_employed       3445
dob_years           3445
education           3445
education_id        3445
family_status       3445
f

It was previously observed that the positive values ​​for days worked coincide with the type of income of retired people and in this latest analysis it was found that 100% of these have unrealistic values, as they are accounting for 300, 500, 1000 years of work.

Assuming that 62 years is the average age at which people retire, these high values ​​for days worked will be replaced by 22630 which is equivalent to the 62 years mentioned.

There are also many negative values ​​and they do not make sense, as it is not possible to have negative days worked. It was probably a technical error that caused these problematic lines. These are equivalent to 82.19% of the total information and these negative signs will be removed from the data.

In [20]:
# Correcting the extremely high values ​​of days worked to the median value for people over 45 years old
def calc1(days_employed_high):
    new_days = 22630
    if days_employed_high > 0:
        return new_days
    else:
        return days_employed_high

try:
    default_risk['days_employed'] = default_risk['days_employed'].apply(calc1)
except:
    'Runtime error!'


# Correcting values ​​with negative sign
i = -1
def calc2(days_employed_neg):
    days = i * days_employed_neg
    if days > 0:
        return days
    else:
        return days_employed_neg
    
try:
    default_risk['days_employed'] = default_risk['days_employed'].apply(calc2)
except:
    'Runtime error!'


print(default_risk.head(10))


   children  days_employed  dob_years            education  education_id  \
0         1    8437.673028         42    bachelor's degree             0   
1         1    4024.803754         36  secondary education             1   
2         0    5623.422610         33  secondary education             1   
3         3    4124.747207         32  secondary education             1   
4         0   22630.000000         53  secondary education             1   
5         0     926.185831         27    bachelor's degree             0   
6         0    2879.202052         43    bachelor's degree             0   
7         0     152.779569         50  secondary education             1   
8         2    6929.865299         35    bachelor's degree             0   
9         0    2188.756445         41  secondary education             1   

       family_status  family_status_id gender income_type  debt  total_income  \
0            married                 0      F    employee     0     40620.102   
1

In [21]:
# Checking the result to make sure they are correct
print(default_risk[default_risk['days_employed'] <=0].count())
print()
print(default_risk[default_risk['days_employed'] > 0].count())

children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
dtype: int64

children            19351
days_employed       19351
dob_years           19351
education           19351
education_id        19351
family_status       19351
family_status_id    19351
gender              19351
income_type         19351
debt                19351
total_income        19351
purpose             19351
dtype: int64


Let's now check the data about the customers' ages (column `'dob_years'`)

In [22]:
# Checking `dob_years` for suspicious values
print(sorted(default_risk['dob_years'].unique()))
print()
print(default_risk['dob_years'].value_counts())
print()

values_dob_years_total = default_risk['dob_years'].count()
print(f'Total data in column = {values_dob_years_total}.')
print()

values_dob_years_zero = default_risk[default_risk['dob_years'] == 0]['dob_years'].count()
print(f'Total of \"0\" values in column = {values_dob_years_zero}.')
prob_dob_years_zero = (values_dob_years_zero/values_dob_years_total)
print(f'Total of \"0\" values in column = {prob_dob_years_zero:.2%}.')


[np.int64(0), np.int64(19), np.int64(20), np.int64(21), np.int64(22), np.int64(23), np.int64(24), np.int64(25), np.int64(26), np.int64(27), np.int64(28), np.int64(29), np.int64(30), np.int64(31), np.int64(32), np.int64(33), np.int64(34), np.int64(35), np.int64(36), np.int64(37), np.int64(38), np.int64(39), np.int64(40), np.int64(41), np.int64(42), np.int64(43), np.int64(44), np.int64(45), np.int64(46), np.int64(47), np.int64(48), np.int64(49), np.int64(50), np.int64(51), np.int64(52), np.int64(53), np.int64(54), np.int64(55), np.int64(56), np.int64(57), np.int64(58), np.int64(59), np.int64(60), np.int64(61), np.int64(62), np.int64(63), np.int64(64), np.int64(65), np.int64(66), np.int64(67), np.int64(68), np.int64(69), np.int64(70), np.int64(71), np.int64(72), np.int64(73), np.int64(74), np.int64(75)]

dob_years
35    617
40    609
41    607
34    603
38    598
42    597
33    581
39    573
31    560
36    555
44    547
29    545
30    540
48    538
37    537
50    514
43    513
32    5

101 customer age values ​​are observed as "0" (zero). This may have been a typing error, but there is no way of knowing which ages people intended to have entered. This information corresponds to **0.47%** of the total and will be replaced.

In [23]:
# Checking the mean and median ages.
value_dob_years_mean = (int(default_risk[default_risk['dob_years'] > 0]['dob_years'].mean()))
print(value_dob_years_mean)
print()
value_dob_years_median = (int(default_risk[default_risk['dob_years'] > 0]['dob_years'].median()))
print(value_dob_years_median)


43

43


These 101 customer age values ​​as "0" (zero) will be replaced by the average of the other ages.

In [24]:
# Replacing "0" (zero) values ​​in the 'dob_years' column.
default_risk['dob_years'] = default_risk['dob_years'].replace(0, value_dob_years_mean)

In [25]:
# Checking the result if they were corrected.
print(default_risk['dob_years'].unique())
print()
print(default_risk['dob_years'].value_counts())


[42 36 33 32 53 27 43 50 35 41 40 65 54 56 26 48 24 21 57 67 28 63 62 47
 34 68 25 31 30 20 49 37 45 61 64 44 52 46 23 38 39 51 59 29 60 55 58 71
 22 73 66 69 19 72 70 74 75]

dob_years
35    617
43    614
40    609
41    607
34    603
38    598
42    597
33    581
39    573
31    560
36    555
44    547
29    545
30    540
48    538
37    537
50    514
32    510
49    508
28    503
45    497
27    493
56    487
52    484
47    480
54    479
46    475
58    461
57    460
53    459
51    448
59    444
55    443
26    408
60    377
25    357
61    355
62    352
63    269
64    265
24    264
23    254
65    194
66    183
22    183
67    167
21    111
68     99
69     85
70     65
71     58
20     51
72     33
19     14
73      8
74      6
75      1
Name: count, dtype: int64


Parsing information from the `family_status` column.

In [26]:
# Let's see the values ​​of the 'family_status' column.
print(sorted(default_risk['family_status'].unique()))
print()
print(default_risk['family_status'].value_counts())


['civil partnership', 'divorced', 'married', 'unmarried', 'widow / widower']

family_status
married              12380
civil partnership     4177
unmarried             2813
divorced              1195
widow / widower        960
Name: count, dtype: int64


In [27]:
# Performing simple adjustment to the 'family_status' column.
# Removing the "spaces" from the term 'widow / widower'.
default_risk['family_status'] = default_risk['family_status'].replace('widow / widower', 'widow/widower')


In [28]:
# Checking the result if it was adjusted correctly.
print(sorted(default_risk['family_status'].unique()))
print()
print(default_risk['family_status'].value_counts())

['civil partnership', 'divorced', 'married', 'unmarried', 'widow/widower']

family_status
married              12380
civil partnership     4177
unmarried             2813
divorced              1195
widow/widower          960
Name: count, dtype: int64


Analyzing data from the `gender` column.

In [29]:
# Checking values ​​in the 'gender' column.
print(sorted(default_risk['gender'].unique()))
print()
print(default_risk['gender'].value_counts())

['F', 'M', 'XNA']

gender
F      14236
M       7288
XNA        1
Name: count, dtype: int64


In [30]:
# Performing simple adjustment on the 'gender' column.
# Changing the term 'XNA' to 'F'.
default_risk['gender'] = default_risk['gender'].replace('XNA', 'F')

In [31]:
# Checking the result if it was adjusted correctly.
print(sorted(default_risk['gender'].unique()))
print()
print(default_risk['gender'].value_counts())

['F', 'M']

gender
F    14237
M     7288
Name: count, dtype: int64


Analisando os dados da coluna `income_type`.

In [32]:
# Checking information in the 'income_type' column.
print(sorted(default_risk['income_type'].unique()))
print()
print(default_risk['income_type'].value_counts())

['business', 'civil servant', 'employee', 'entrepreneur', 'paternity / maternity leave', 'retiree', 'student', 'unemployed']

income_type
employee                       11119
business                        5085
retiree                         3856
civil servant                   1459
unemployed                         2
entrepreneur                       2
student                            1
paternity / maternity leave        1
Name: count, dtype: int64


In [33]:
# Performing simple adjustment in this column.
# Changing the term 'paternity/maternity leave'
# for paternity/maternity leave' (without the "spaces" before and after the "slash" (/).
default_risk['income_type'] = default_risk['income_type'].replace('paternity / maternity leave', 'paternity/maternity leave')


In [34]:
# Check the result if it is corrected.
print(sorted(default_risk['income_type'].unique()))
print()
print(default_risk['income_type'].value_counts())


['business', 'civil servant', 'employee', 'entrepreneur', 'paternity/maternity leave', 'retiree', 'student', 'unemployed']

income_type
employee                     11119
business                      5085
retiree                       3856
civil servant                 1459
unemployed                       2
entrepreneur                     2
student                          1
paternity/maternity leave        1
Name: count, dtype: int64


At this point we will analyze whether we have duplicates in our data.

In [35]:
# Checking for duplicates in data.
print(default_risk.duplicated().sum())


252


We found 252 duplicates in our data.

We do not have a column in the spreadsheet that individualizes the information, but it is unlikely that they are not real duplicates, since we have 2 columns with information with floating points containing some digits after the comma/period, they are:
- 'days_employed';
- 'total_income'.

Since this information is equivalent to only **1.17%** of the total, it will not impact the final analysis if it is removed. Therefore, we will exclude these rows from the data.

In [36]:
# Deleting duplicates
default_risk = default_risk.drop_duplicates()


In [37]:
# Checking if we still have duplicates
print(default_risk.duplicated().sum())

0


In [38]:
# Checking the dataset after these manipulations are performed.
default_risk.info()


<class 'pandas.core.frame.DataFrame'>
Index: 21273 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21273 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21273 non-null  int64  
 3   education         21273 non-null  object 
 4   education_id      21273 non-null  int64  
 5   family_status     21273 non-null  object 
 6   family_status_id  21273 non-null  int64  
 7   gender            21273 non-null  object 
 8   income_type       21273 non-null  object 
 9   debt              21273 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21273 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.1+ MB


At this point, our data set is cleaner so that we can perform the necessary analyses and categorizations.

Almost all columns have been adjusted and 252 rows have been reduced from the previous total.

# Working with missing values

### Restoring missing values ​​in `total_income`

We have 2 (two) columns with missing values ​​in this database:
- "days_employed";
- "total_income".

We will analyze some parameters between gender, children, age and education. The averages and medians will be checked to make a decision about which information will be inserted in the missing values.

A table will be created without the rows containing the missing values ​​so that an assertive analysis of the data is possible.

In [39]:
# Creating a table without missing values
new_table = default_risk.dropna(subset=['days_employed', 'total_income'])

display(new_table.head(60))

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,buy or renovate residential properties
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,buy a car
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,buy or renovate residential properties
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,education
4,0,22630.0,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,wedding
5,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,buy or renovate residential properties
6,0,2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,buy real estate (undefined)
7,0,152.779569,50,secondary education,1,married,0,M,employee,0,21731.829,education
8,2,6929.865299,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337.093,wedding
9,0,2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,buy or renovate residential properties


In [40]:
# Analyzing average income values ​​based on some parameters

# Total average value
value_mean_total = new_table['total_income'].mean()
print(f'Average income value of all values is {value_mean_total:.2f}.')
print()

# Average value of women
value_mean_gender_f = new_table.loc[new_table['gender'] == 'F']['total_income'].mean()
print(f'Average income for women is {value_mean_gender_f:.2f}.')

# Average value of men
value_mean_gender_m = new_table.loc[new_table['gender'] == 'M']['total_income'].mean()
print(f'Average income for men is {value_mean_gender_m:.2f}.')
print()

# Average value of women under 40
value_mean_gender_f_age_under40 = new_table.loc[(new_table['gender'] == 'F')\
                                          &(new_table['dob_years']<40)]['total_income'].mean()
print(f'The average income of women under 40 is {value_mean_gender_f_age_under40:.2f}.')

# Average value of men under 40
value_mean_gender_m_age_under40 = new_table.loc[(new_table['gender'] == 'M')\
                                           &(new_table['dob_years']<40)]['total_income'].mean()
print(f'The average income of men under 40 is {value_mean_gender_m_age_under40:.2f}.')

# Average value for women aged 40 or over
value_mean_gender_f_age_over40 = new_table.loc[(new_table['gender'] == 'F')\
                                          &(new_table['dob_years']>=40)]['total_income'].mean()
print(f'Average income for women aged 40 or over is {value_mean_gender_f_age_over40:.2f}.')

# Average value of men aged 40 or over
value_mean_gender_m_age_over40 = new_table.loc[(new_table['gender'] == 'M')\
                                           &(new_table['dob_years']>=40)]['total_income'].mean()
print(f'Average income for men aged 40 and over is {value_mean_gender_m_age_over40:.2f}.')
print()

# Average value of women with whom they have children
value_mean_gender_f_child = new_table.loc[(new_table['gender'] == 'F')\
                                          &(new_table['children']>0)]['total_income'].mean()
print(f'Average income of women who have child(ren) is {value_mean_gender_f_child:.2f}.')

# Average value of men with whom they have children
value_mean_gender_m_child = new_table.loc[(new_table['gender'] == 'M')\
                                           &(new_table['children']>0)]['total_income'].mean()
print(f'Average income of men who have children is {value_mean_gender_m_child:.2f}.')
print()

# Average value of women with higher education/graduates
value_mean_gender_f_f = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='F')]['total_income'].mean()
print(f'Average income of female graduates is {value_mean_gender_f_f:.2f}.')

# Average value of men with higher education/graduates
value_mean_gender_m_f = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='M')]['total_income'].mean()
print(f'Average income of educated men is {value_mean_gender_m_f:.2f}.')
print()

# Average value of women with higher education/graduates and who have children
value_mean_gender_f_f_child = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='F')\
                                       &(new_table['children']>0)]['total_income'].mean()
print(f'Average income of educated women who have children is {value_mean_gender_f_f_child:.2f}.')

# Average value of men with higher education/graduates and who have children
value_mean_gender_m_f_child = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='M')\
                                       &(new_table['children']>0)]['total_income'].mean()
print(f'Average income of educated men who have children is {value_mean_gender_m_f_child:.2f}.')
print()

# Average value of women with higher education/graduates, under 40 years old and who have children
value_mean_gender_f_f_child_age_under40 = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='F')\
                                       &(new_table['dob_years']<40)\
                                       &(new_table['children']>0)]['total_income'].mean()
print(f'The average income of educated women who have children under 40 years of age is {value_mean_gender_f_f_child_age_under40:.2f}.')

# Average value of men with higher education/graduates, under 40 years old and who have children
value_mean_gender_m_f_child_age_under40 = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='M')\
                                       &(new_table['dob_years']<40)\
                                       &(new_table['children']>0)]['total_income'].mean()
print(f'The average income of educated men who have children under 40 years of age is {value_mean_gender_m_f_child_age_under40:.2f}.')
print()

Average income value of all values is 26787.57.

Average income for women is 24656.23.
Average income for men is 30907.14.

The average income of women under 40 is 24815.59.
The average income of men under 40 is 31020.82.
Average income for women aged 40 or over is 24560.69.
Average income for men aged 40 and over is 30799.25.

Average income of women who have child(ren) is 24587.20.
Average income of men who have children is 32423.54.

Average income of female graduates is 30305.84.
Average income of educated men is 38950.76.

Average income of educated women who have children is 29300.42.
Average income of educated men who have children is 41720.44.

The average income of educated women who have children under 40 years of age is 28349.18.
The average income of educated men who have children under 40 years of age is 39910.57.



In [41]:
# Analyzing median income values ​​based on some parameters

# Total median value
value_median_total = new_table['total_income'].median()
print(f'Median income value of all values is {value_median_total:.2f}.')
print()

# Median value of women
value_median_gender_f = new_table.loc[new_table['gender'] == 'F']['total_income'].median()
print(f'Median income for women is {value_median_gender_f:.2f}.')

# Median value of men
value_median_gender_m = new_table.loc[new_table['gender'] == 'M']['total_income'].median()
print(f'Median income for men is {value_median_gender_m:.2f}.')
print()

# Median value for women under 40
value_median_gender_f_age_under40 = new_table.loc[(new_table['gender'] == 'F')\
                                          &(new_table['dob_years']<40)]['total_income'].median()
print(f'The median income for women under 40 is {value_median_gender_f_age_under40:.2f}.')

# Median value for men under 40
value_median_gender_m_age_under40 = new_table.loc[(new_table['gender'] == 'M')\
                                           &(new_table['dob_years']<40)]['total_income'].median()
print(f'The median income for men under 40 is {value_median_gender_m_age_under40:.2f}.')

# Median value for women aged 40 or over
value_median_gender_f_age_over40 = new_table.loc[(new_table['gender'] == 'F')\
                                          &(new_table['dob_years']>=40)]['total_income'].median()
print(f'Median income for women aged 40 and over is {value_median_gender_f_age_over40:.2f}.')

# Median value for men aged 40 or over
value_median_gender_m_age_over40 = new_table.loc[(new_table['gender'] == 'M')\
                                           &(new_table['dob_years']>=40)]['total_income'].median()
print(f'Median income for men aged 40 and over is {value_median_gender_m_age_over40:.2f}.')
print()

# Median value of women with whom they have children
value_median_gender_f_child = new_table.loc[(new_table['gender'] == 'F')\
                                          &(new_table['children']>0)]['total_income'].median()
print(f'Median income for women who have children is {value_median_gender_f_child:.2f}.')

# Median value of men with whom they have children
value_median_gender_m_child = new_table.loc[(new_table['gender'] == 'M')\
                                           &(new_table['children']>0)]['total_income'].median()
print(f'Median income for men who have children is {value_median_gender_m_child:.2f}.')
print()

# Median value of women with higher education/graduates
value_median_gender_f_f = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='F')]['total_income'].median()
print(f'Median income for women graduates is {value_median_gender_f_f:.2f}.')

# Median value for men with higher education/graduates
value_median_gender_m_f = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='M')]['total_income'].median()
print(f'Median income for male graduates is {value_median_gender_m_f:.2f}.')
print()

# Median value for women with higher education/graduates and who have children
value_median_gender_f_f_child = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='F')\
                                       &(new_table['children']>0)]['total_income'].median()
print(f'Median income for women with degrees who have children is {value_median_gender_f_f_child:.2f}.')

# Median value of men with higher education/graduates and who have children
value_median_gender_m_f_child = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='M')\
                                       &(new_table['children']>0)]['total_income'].median()
print(f'Median income for men with degrees who have children is {value_median_gender_m_f_child:.2f}.')
print()

# Median value for women with higher education/graduates, under 40 years old and who have children
value_median_gender_f_f_child_age_under40 = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='F')\
                                       &(new_table['dob_years']<40)\
                                       &(new_table['children']>0)]['total_income'].median()
print(f'The median income of educated women who have children under 40 years of age is {value_median_gender_f_f_child_age_under40:.2f}.')

# Median value of men with higher education/graduates, under 40 years old and who have children
value_median_gender_m_f_child_age_under40 = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='M')\
                                       &(new_table['dob_years']<40)\
                                       &(new_table['children']>0)]['total_income'].median()
print(f'The median income of educated men who have children under 40 years of age is {value_median_gender_m_f_child_age_under40:.2f}.')
print()

Median income value of all values is 23202.87.

Median income for women is 21465.17.
Median income for men is 26834.30.

The median income for women under 40 is 21678.97.
The median income for men under 40 is 27269.98.
Median income for women aged 40 and over is 21327.87.
Median income for men aged 40 and over is 26595.15.

Median income for women who have children is 21481.19.
Median income for men who have children is 27994.07.

Median income for women graduates is 26063.47.
Median income for male graduates is 32623.49.

Median income for women with degrees who have children is 24387.33.
Median income for men with degrees who have children is 33907.39.

The median income of educated women who have children under 40 years of age is 23685.43.
The median income of educated men who have children under 40 years of age is 32689.95.



Based on the analyses performed, missing values ​​will be replaced by medians in the following ways:
- for people who have higher education (graduates), missing values ​​will be filled in by the median for income by gender and graduates.
- for people who do not have higher education, missing values ​​will be filled in by the median for income by gender.

In [42]:
# Defining a function to fill in missing values.
def fill_missing_values_income(row):
    total_income = row['total_income']
    gender = row['gender']
    education_id = row['education_id']
    
    if total_income > 0:
        return total_income
    if education_id == 0:
        if gender == 'M':
            return value_median_gender_m_f
        else:
            return value_median_gender_f_f
    if education_id > 0:
        if gender == 'M':
            return value_median_gender_m
        else:
            return value_median_gender_f


In [43]:
# Checking if the function works.
row_values = [[0, 'M', 3], [0, 'F', 7], [10,'F',0],[0, 'M', 0], [0, 'F', 0],[10, 'M', 0]]
row_columns = ['education_id', 'gender', 'total_income']
df_teste = pd.DataFrame(row_values, columns=row_columns)
print(df_teste.head(10))
print()

try:
    df_teste['total_income'] = df_teste.apply(fill_missing_values_income, axis=1)
except:
    'Runtime error!'

print(df_teste)

   education_id gender  total_income
0             0      M             3
1             0      F             7
2            10      F             0
3             0      M             0
4             0      F             0
5            10      M             0

   education_id gender  total_income
0             0      M        3.0000
1             0      F        7.0000
2            10      F    21465.1650
3             0      M    32623.4850
4             0      F    26063.4715
5            10      M    26834.2950


In [44]:
# Applying the function to all rows of the dataframe.
default_risk['total_income'] = default_risk.apply(fill_missing_values_income, axis=1)

In [45]:
# Checking if everything went well.
print(default_risk['total_income'].isna().sum())
print()
print(default_risk.loc[default_risk['total_income']==0])
print()
print(default_risk.loc[default_risk['total_income']<0])


0

Empty DataFrame
Columns: [children, days_employed, dob_years, education, education_id, family_status, family_status_id, gender, income_type, debt, total_income, purpose]
Index: []

Empty DataFrame
Columns: [children, days_employed, dob_years, education, education_id, family_status, family_status_id, gender, income_type, debt, total_income, purpose]
Index: []


In [46]:
# Checking the number of entries in the column. If 'total_income' was actually filled.
default_risk.info()


<class 'pandas.core.frame.DataFrame'>
Index: 21273 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21273 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21273 non-null  int64  
 3   education         21273 non-null  object 
 4   education_id      21273 non-null  int64  
 5   family_status     21273 non-null  object 
 6   family_status_id  21273 non-null  int64  
 7   gender            21273 non-null  object 
 8   income_type       21273 non-null  object 
 9   debt              21273 non-null  int64  
 10  total_income      21273 non-null  float64
 11  purpose           21273 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.1+ MB


### Restoring values ​​in `days_employed`

We will analyze the same parameters that we used to evaluate 'total_income' for decision making.
- gender;
- children;
- age;
- education.

The averages and medians will also be checked to make a decision about which information will be inserted in the missing values ​​of this column.

In [47]:
# Analyzing median values ​​of days worked based on some parameters

# Total median value
value_median_days_employed_total = new_table['days_employed'].median()
print(f'Median value of days worked of all values is {value_median_days_employed_total:.2f}.')
print()

# Median value of women
value_median_days_employed_gender_f = new_table.loc[new_table['gender'] == 'F']['days_employed'].median()
print(f'Median value of days worked by women is {value_median_days_employed_gender_f:.2f}.')

# Median value of men
value_median_days_employed_gender_m = new_table.loc[new_table['gender'] == 'M']['days_employed'].median()
print(f'Median value of days worked by men is {value_median_days_employed_gender_m:.2f}.')
print()

# Median value for women under 40
value_median_days_employed_gender_f_age_under40 = new_table.loc[(new_table['gender'] == 'F')\
                                          &(new_table['dob_years']<40)]['days_employed'].median()
print(f'The median number of days worked by women under 40 is {value_median_days_employed_gender_f_age_under40:.2f}.')

# Median value for men under 40
value_median_days_employed_gender_m_age_under40 = new_table.loc[(new_table['gender'] == 'M')\
                                           &(new_table['dob_years']<40)]['days_employed'].median()
print(f'The median number of days worked for men under 40 is {value_median_days_employed_gender_m_age_under40:.2f}.')

# Median value for women aged 40 or over
value_median_days_employed_gender_f_age_over40 = new_table.loc[(new_table['gender'] == 'F')\
                                          &(new_table['dob_years']>=40)]['days_employed'].median()
print(f'Median number of days worked by women aged 40 or over is {value_median_days_employed_gender_f_age_over40:.2f}.')

# Median value of men aged 40 or over
value_median_days_employed_gender_m_age_over40 = new_table.loc[(new_table['gender'] == 'M')\
                                           &(new_table['dob_years']>=40)]['days_employed'].median()
print(f'Median value of days worked for men aged 40 or over is {value_median_days_employed_gender_m_age_over40:.2f}.')
print()

# Median value of women with whom they have children
value_median_days_employed_gender_f_child = new_table.loc[(new_table['gender'] == 'F')\
                                          &(new_table['children']>0)]['days_employed'].median()
print(f'Median value of days worked by women who have child(ren) is {value_median_days_employed_gender_f_child:.2f}.')

# Median value of men with whom they have children
value_median_days_employed_gender_m_child = new_table.loc[(new_table['gender'] == 'M')\
                                           &(new_table['children']>0)]['days_employed'].median()
print(f'Median value of days worked for men who have children is {value_median_days_employed_gender_m_child:.2f}.')
print()

# Median value of women with higher education/graduates
value_median_days_employed_gender_f_f = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='F')]['days_employed'].median()
print(f'Median value of days worked by female graduates is {value_median_days_employed_gender_f_f:.2f}.')

# Median value for men with higher education/graduates
value_median_days_employed_gender_m_f = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='M')]['days_employed'].median()
print(f'Median value of days worked by graduated men is {value_median_days_employed_gender_m_f:.2f}.')
print()

# Median value for women with higher education/graduates and who have children
value_median_days_employed_gender_f_f_child = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='F')\
                                       &(new_table['children']>0)]['days_employed'].median()
print(f'Median value of days worked by graduated women who have child(ren) is {value_median_days_employed_gender_f_f_child:.2f}.')

# Median value of men with higher education/graduates and who have children
value_median_days_employed_gender_m_f_child = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='M')\
                                       &(new_table['children']>0)]['days_employed'].median()
print(f'Median value of days worked for graduated men who have child(ren) is {value_median_days_employed_gender_m_f_child:.2f}.')
print()

# Median value for women with higher education/graduates, under 40 years old and who have children
value_median_days_employed_gender_f_f_child_age_under40 = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='F')\
                                       &(new_table['dob_years']<40)\
                                       &(new_table['children']>0)]['days_employed'].median()
print(f'The median number of days worked by women with degrees who have children under 40 years of age is {value_median_days_employed_gender_f_f_child_age_under40:.2f}.')

# Median value of men with higher education/graduates, under 40 years old and who have children
value_median_days_employed_gender_m_f_child_age_under40 = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='M')\
                                       &(new_table['dob_years']<40)\
                                       &(new_table['children']>0)]['days_employed'].median()
print(f'The median number of days worked by educated men who have children under 40 years of age is {value_median_days_employed_gender_m_f_child_age_under40:.2f}.')
print()


Median value of days worked of all values is 2194.22.

Median value of days worked by women is 2539.76.
Median value of days worked by men is 1662.37.

The median number of days worked by women under 40 is 1387.33.
The median number of days worked for men under 40 is 1228.64.
Median number of days worked by women aged 40 or over is 4662.14.
Median value of days worked for men aged 40 or over is 2465.40.

Median value of days worked by women who have child(ren) is 1755.00.
Median value of days worked for men who have children is 1553.83.

Median value of days worked by female graduates is 2011.17.
Median value of days worked by graduated men is 1659.31.

Median value of days worked by graduated women who have child(ren) is 1669.09.
Median value of days worked for graduated men who have child(ren) is 1403.58.

The median number of days worked by women with degrees who have children under 40 years of age is 1391.78.
The median number of days worked by educated men who have children under 

In [48]:
# Analyzing average values ​​of days worked based on some parameters

# Total average value
value_mean_days_employed_total = new_table['days_employed'].mean()
print(f'Average value of days worked of all values is {value_mean_days_employed_total:.2f}.')
print()

# Average value of women
value_mean_days_employed_gender_f = new_table.loc[new_table['gender'] == 'F']['days_employed'].mean()
print(f'Average number of days worked by women is {value_mean_days_employed_gender_f:.2f}.')

# Average value of men
value_mean_days_employed_gender_m = new_table.loc[new_table['gender'] == 'M']['days_employed'].mean()
print(f'Average number of days worked by men is {value_mean_days_employed_gender_m:.2f}.')
print()

# Average value of women under 40
value_mean_days_employed_gender_f_age_under40 = new_table.loc[(new_table['gender'] == 'F')\
                                          &(new_table['dob_years']<40)]['days_employed'].mean()
print(f'The average number of days worked by women under 40 is {value_mean_days_employed_gender_f_age_under40:.2f}.')

# Average value of men under 40
value_mean_days_employed_gender_m_age_under40 = new_table.loc[(new_table['gender'] == 'M')\
                                           &(new_table['dob_years']<40)]['days_employed'].mean()
print(f'The average number of days worked by men under 40 is {value_mean_days_employed_gender_m_age_under40:.2f}.')

# Average value of women over 40
value_mean_days_employed_gender_f_age_over40 = new_table.loc[(new_table['gender'] == 'F')\
                                          &(new_table['dob_years']>=40)]['days_employed'].mean()
print(f'Average number of days worked by women aged 40 or over is {value_mean_days_employed_gender_f_age_over40:.2f}.')

# Average value of men over 40
value_mean_days_employed_gender_m_age_over40 = new_table.loc[(new_table['gender'] == 'M')\
                                           &(new_table['dob_years']>=40)]['days_employed'].mean()
print(f'Average number of days worked by men aged 40 or over is {value_mean_days_employed_gender_m_age_over40:.2f}.')
print()

# Average value of women with whom they have children
value_mean_days_employed_gender_f_child = new_table.loc[(new_table['gender'] == 'F')\
                                          &(new_table['children']>0)]['days_employed'].mean()
print(f'Average number of days worked by women who have children is {value_mean_days_employed_gender_f_child:.2f}.')

# Average value of men with whom they have children
value_mean_days_employed_gender_m_child = new_table.loc[(new_table['gender'] == 'M')\
                                           &(new_table['children']>0)]['days_employed'].mean()
print(f'Average number of days worked by men who have children is {value_mean_days_employed_gender_m_child:.2f}.')
print()

# Average value of women with higher education/graduates
value_mean_days_employed_gender_f_f = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='F')]['days_employed'].mean()
print(f'Average number of days worked by women graduates is {value_mean_days_employed_gender_f_f:.2f}.')

# Average value of men with higher education/graduates
value_mean_days_employed_gender_m_f = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='M')]['days_employed'].mean()
print(f'Average number of days worked by trained men is {value_mean_days_employed_gender_m_f:.2f}.')
print()

# Average value of women with higher education/graduates and who have children
value_mean_days_employed_gender_f_f_child = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='F')\
                                       &(new_table['children']>0)]['days_employed'].mean()
print(f'Average number of days worked by women with degrees and children is {value_mean_days_employed_gender_f_f_child:.2f}.')

# Average value of men with higher education/graduates and who have children
value_mean_days_employed_gender_m_f_child = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='M')\
                                       &(new_table['children']>0)]['days_employed'].mean()
print(f'The average number of days worked by educated men who have children is {value_mean_days_employed_gender_m_f_child:.2f}.')
print()

# Average value of women with higher education/graduates, under 40 years old and who have children
value_mean_days_employed_gender_f_f_child_age_under40 = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='F')\
                                       &(new_table['dob_years']<40)\
                                       &(new_table['children']>0)]['days_employed'].mean()
print(f'The average number of days worked by women with degrees who have children under 40 years of age is {value_mean_days_employed_gender_f_f_child_age_under40:.2f}.')

# Average value of men with higher education/graduates, under 40 years old and who have children
value_mean_days_employed_gender_m_f_child_age_under40 = new_table.loc[(new_table['education'] == 'bachelor\'s degree')\
                                       &(new_table['gender']=='M')\
                                       &(new_table['dob_years']<40)\
                                       &(new_table['children']>0)]['days_employed'].mean()
print(f'The average number of days worked by educated men who have children under 40 years of age is {value_mean_days_employed_gender_m_f_child_age_under40:.2f}.')
print()

Average value of days worked of all values is 5962.87.

Average number of days worked by women is 6945.66.
Average number of days worked by men is 4063.26.

The average number of days worked by women under 40 is 1866.71.
The average number of days worked by men under 40 is 1770.16.
Average number of days worked by women aged 40 or over is 9990.61.
Average number of days worked by men aged 40 or over is 6239.84.

Average number of days worked by women who have children is 3254.89.
Average number of days worked by men who have children is 2704.90.

Average number of days worked by women graduates is 4894.34.
Average number of days worked by trained men is 3790.82.

Average number of days worked by women with degrees and children is 2736.23.
The average number of days worked by educated men who have children is 2666.39.

The average number of days worked by women with degrees who have children under 40 years of age is 1819.01.
The average number of days worked by educated men who have chi

Based on the analyses performed, missing values ​​will be replaced by medians in the following ways:
- for people who have higher education (graduates), missing values ​​will be filled in by the median for days worked by gender and graduates.
- for people who do not have higher education, missing values ​​will be filled in by the median for days worked by gender.

In [49]:
# Defining a function to fill in missing values.
def fill_missing_values_days_employed(row):
    days_employed = row['days_employed']
    gender = row['gender']
    education_id = row['education_id']
    
    if days_employed > 0:
        return days_employed
    if education_id == 0:
        if gender == 'M':
            return value_median_days_employed_gender_m_f
        else:
            return value_median_days_employed_gender_f_f
    if education_id > 0:
        if gender == 'M':
            return value_median_days_employed_gender_m
        else:
            return value_median_days_employed_gender_f


In [50]:
# Checking if the function works
row_values2 = [[0, 'M', 3000], [0, 'F', 7700], [10,'F',0],[0, 'M', 0], [0, 'F', 0],[10, 'M', 0]]
row_columns2 = ['education_id', 'gender', 'days_employed']
df_teste2 = pd.DataFrame(row_values2, columns=row_columns2)
print(df_teste2.head(10))
print()

try:
    df_teste2['days_employed'] = df_teste2.apply(fill_missing_values_days_employed, axis=1)
except:
    'Runtime error!'
    
print(df_teste2)


   education_id gender  days_employed
0             0      M           3000
1             0      F           7700
2            10      F              0
3             0      M              0
4             0      F              0
5            10      M              0

   education_id gender  days_employed
0             0      M    3000.000000
1             0      F    7700.000000
2            10      F    2539.761232
3             0      M    1659.309688
4             0      F    2011.171809
5            10      M    1662.370103


In [51]:
# Applying the function on the 'days_employed' column.
default_risk['days_employed'] = default_risk.apply(fill_missing_values_days_employed, axis=1)

In [52]:
# Checking if the function worked.
print(default_risk['days_employed'].isna().sum())
print()
print(default_risk.loc[default_risk['days_employed']==0])
print()
print(default_risk.loc[default_risk['days_employed']<0])
print()
display(default_risk.head(60))



0

Empty DataFrame
Columns: [children, days_employed, dob_years, education, education_id, family_status, family_status_id, gender, income_type, debt, total_income, purpose]
Index: []

Empty DataFrame
Columns: [children, days_employed, dob_years, education, education_id, family_status, family_status_id, gender, income_type, debt, total_income, purpose]
Index: []



Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,buy or renovate residential properties
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,buy a car
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,buy or renovate residential properties
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,education
4,0,22630.0,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,wedding
5,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,buy or renovate residential properties
6,0,2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,buy real estate (undefined)
7,0,152.779569,50,secondary education,1,married,0,M,employee,0,21731.829,education
8,2,6929.865299,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337.093,wedding
9,0,2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,buy or renovate residential properties


In [53]:
# Checking the number of entries in the column.
# If 'days_employed' was actually populated.
default_risk.info()

<class 'pandas.core.frame.DataFrame'>
Index: 21273 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21273 non-null  int64  
 1   days_employed     21273 non-null  float64
 2   dob_years         21273 non-null  int64  
 3   education         21273 non-null  object 
 4   education_id      21273 non-null  int64  
 5   family_status     21273 non-null  object 
 6   family_status_id  21273 non-null  int64  
 7   gender            21273 non-null  object 
 8   income_type       21273 non-null  object 
 9   debt              21273 non-null  int64  
 10  total_income      21273 non-null  float64
 11  purpose           21273 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.1+ MB


## Data categorization

After pre-processing the data and making it more organized and clean, some information will be categorized so that we can have information that will help in decision-making and to answer the hypotheses raised

At this first moment we will categorize by age range:
- from 10 to 19 years old;
- from 20 to 29 years old;
- from 30 to 39 years old;
- from 40 to 49 years old;
- from 50 to 59 years old;
- from 60 to 69 years old;
- 70 or older.

In [54]:
# Defining a function that calculates the age category.
def calc_age(age):
    if age < 20:
        return '10-19'
    elif age < 30:
        return '20-29'
    elif age < 40:
        return '30-39'
    elif age < 50:
        return '40-49'
    elif age < 60:
        return '50-59'
    elif age < 70:
        return '60-69'
    else:
        return '70+'


In [55]:
# Testing if the function works

age_test0 = 15
age_test1 = 25
age_test2 = 35
age_test3 = 45
age_test4 = 55
age_test5 = 65
age_test6 = 75

print(calc_age(age_test0))
print(calc_age(age_test1))
print(calc_age(age_test2))
print(calc_age(age_test3))
print(calc_age(age_test4))
print(calc_age(age_test5))
print(calc_age(age_test6))



10-19
20-29
30-39
40-49
50-59
60-69
70+


In [56]:
# Creating a new column based on the function called 'age_category'.
default_risk['age_category'] = default_risk['dob_years'].apply(calc_age)


In [57]:
# Checking if everything went well.
display(default_risk.head(60))

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,buy or renovate residential properties,40-49
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,buy a car,30-39
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,buy or renovate residential properties,30-39
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,education,30-39
4,0,22630.0,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,wedding,50-59
5,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,buy or renovate residential properties,20-29
6,0,2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,buy real estate (undefined),40-49
7,0,152.779569,50,secondary education,1,married,0,M,employee,0,21731.829,education,50-59
8,2,6929.865299,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337.093,wedding,30-39
9,0,2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,buy or renovate residential properties,40-49


Now let's categorize people's total income ('total_income')

In [58]:
# Examining numeric data from the 'total_income' column for categorization
display(default_risk['total_income'].head(60))
print()
display(default_risk['total_income'].tail(60))

0      40620.102
1      17932.802
2      23341.752
3      42820.568
4      25378.572
5      40922.170
6      38484.156
7      21731.829
8      15337.093
9      23108.150
10     18230.959
11     12331.077
12     26834.295
13     20873.317
14     26420.466
15     18691.345
16     46272.433
17     14465.694
18      9091.804
19     38852.977
20     33528.423
21     21089.953
22     23948.983
23     20522.515
24     46487.558
25      8818.041
26     26834.295
27     49415.837
28     30058.118
29     21465.165
30     27432.971
31     44077.710
32     22249.194
33     25159.326
34     16745.672
35     12448.908
36     22212.904
37     24660.621
38     30759.568
39    120678.528
40     22858.493
41     21465.165
42     13130.414
43     43673.141
44     16124.879
45     17021.747
46     29229.194
47     57004.465
48     25930.483
49      7134.689
50     14774.837
51     32511.949
52      8439.428
53     49832.576
54     14758.210
55     21465.165
56     23862.567
57      9378.625
58     66304.6




21465    27523.7500
21466    11765.7520
21467     7582.2300
21468    26260.4440
21469    28608.8560
21470    33680.7950
21471    39059.7510
21472    58030.6290
21473    32035.8130
21474    45264.3570
21475     7718.2900
21476    84392.4530
21477    11477.4250
21478    28989.7380
21479    27440.6960
21480    23511.2070
21481    70475.3410
21482    15050.4050
21483    12386.8100
21484    15306.7920
21485    42720.1170
21486    12922.2170
21487    14053.5300
21488    20623.5980
21489    26834.2950
21490     5562.8740
21491    24883.3440
21492    47237.4740
21493    23600.4160
21494    28219.1350
21495    21465.1650
21496    19102.8190
21497    26063.4715
21498    38522.8120
21499    25208.5050
21500    12450.1270
21501    13797.1400
21502    21465.1650
21503    42280.1600
21504    12890.6110
21505    12070.3990
21506    23286.7190
21507    15708.8450
21508    11622.1750
21509    11684.6500
21510    21465.1650
21511    22410.9560
21512    23568.2330
21513    40157.7830
21514    56958.1450


In [59]:
# Getting more information from the 'total_income' column.
print(default_risk['total_income'].min())
print()
print(default_risk['total_income'].max())
print()
print(default_risk['total_income'].mean())
print()
print(default_risk['total_income'].median())
print()

3306.762

362496.645

26591.288944789172

23246.394



We will categorize this income information into 6 (six) levels, which are:
- Level 1: income from 0 to 19,999
- Level 2: income from 20,000 to 39,999
- Level 3: income from 40,000 to 59,999
- Level 4: income from 60,000 to 79,999
- Level 5: income from 80,000 to 99,999
- Level 6: income equal to or greater than 100,000

These level ranges were defined because they are how much income people have. With this information, it will be easier to check the possibilities of individuals' defaults.

In [60]:
# Defining a function that calculates the income category.

def risk_level(income):
    if income < 20000:
        return 'Nível 1'
    elif income < 40000:
        return 'Nível 2'
    elif income < 60000:
        return 'Nível 3'
    elif income < 80000:
        return 'Nível 4'
    elif income < 100000:
        return 'Nível 5'
    else:
        return 'Nível 6'

In [61]:
# Creating column 'risk_level' with income categories.
try:
    default_risk['risk_level'] = default_risk['total_income'].apply(risk_level)
except:
    'Runtime error!'

# Checking if it worked.
display(default_risk.head())

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category,risk_level
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,buy or renovate residential properties,40-49,Nível 3
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,buy a car,30-39,Nível 1
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,buy or renovate residential properties,30-39,Nível 2
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,education,30-39,Nível 3
4,0,22630.0,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,wedding,50-59,Nível 2


In [62]:
# Counting the values ​​of each category to see the distribution
display(default_risk['risk_level'].value_counts())


risk_level
Nível 2    11092
Nível 1     7369
Nível 3     2140
Nível 4      450
Nível 5      123
Nível 6       99
Name: count, dtype: int64

## Checking the Hypotheses

**Hypothesis 1: Is there a correlation between income level and timely payment?**

In [63]:
# Checking income level and payment date data
print(default_risk['debt'].value_counts())

risk_level_debt = default_risk.pivot_table(index='risk_level', values='debt', aggfunc='sum', margins=True)

# Calculating Default Rate Based on Income Level
total_default = default_risk['debt'].sum()
def tx_default(debt):
    if debt > 0:
        return debt/total_default
    
try:
    risk_level_debt['default_rate'] = risk_level_debt['debt'].apply(tx_default)
except:
    'Runtime error!'

display(risk_level_debt)

debt
0    19533
1     1740
Name: count, dtype: int64


Unnamed: 0_level_0,debt,default_rate
risk_level,Unnamed: 1_level_1,Unnamed: 2_level_1
Nível 1,608,0.349425
Nível 2,938,0.53908
Nível 3,156,0.089655
Nível 4,24,0.013793
Nível 5,8,0.004598
Nível 6,6,0.003448
All,1740,1.0


**Conclusion of Hypothesis 1**

When analyzing the data, considering the income level with payments on time, we realize that there is a correlation between them.

Most defaults are concentrated in income levels 1 and 2.

These 2 income levels are responsible for **88.85%** of the defaults recorded and individually they are:
- Level 1 = **34.94%**
- Level 2 = **53.91%**

People with the lowest incomes are responsible for **88.85%** of the defaults.

**Hypothesis 2: Is there a correlation between family status and timely payment?**

In [64]:
# Checking family status and payment details on time
risk_family_status_debt = default_risk.pivot_table(index='family_status', values='debt', aggfunc='sum', margins=True)

# Calculating the standard rate based on family status
total_default = default_risk['debt'].sum()
def tx_family_status_default(debt):
    if debt > 0:
        return debt/total_default
    
try:
    risk_family_status_debt['tx_family_status_default'] = risk_family_status_debt['debt'].apply(tx_family_status_default)
except:
    'Runtime error!'

display(risk_family_status_debt)


Unnamed: 0_level_0,debt,tx_family_status_default
family_status,Unnamed: 1_level_1,Unnamed: 2_level_1
civil partnership,388,0.222989
divorced,85,0.048851
married,930,0.534483
unmarried,274,0.157471
widow/widower,63,0.036207
All,1740,1.0


**Conclusion of Hypothesis 2**

When analyzing the data regarding the payment of debts on time in relation to family status, we realize that people who have been revoked are responsible for **53.45%** of the defaults recorded.

People in a stable union (civil partnership) are responsible for a share of **22.30%** of the defaults.

With this, we realize that people who live with a spouse are responsible for **75.75%** of the total defaults.

**Hypothesis 3: Is there a correlation between having children or not and timely payment?**

In [65]:
# Checking children's data and payment up to date
risk_children_debt = default_risk.pivot_table(index='children', values='debt', aggfunc='sum', margins=True)


# Calculating the default rate based on the number of children
total_default = default_risk['debt'].sum()
def tx_children_default(debt):
    if debt > 0:
        return debt/total_default
    
try:
    risk_children_debt['tx_children_default'] = risk_children_debt['debt'].apply(tx_children_default)
except:
    'Runtime error!'

display(risk_children_debt)



Unnamed: 0_level_0,debt,tx_children_default
children,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1062,0.610345
1,445,0.255747
2,202,0.116092
3,27,0.015517
4,4,0.002299
5,0,
All,1740,1.0


**Conclusion of Hypothesis 3**

The correlation between having children or not and timely payments is confirmed, since among the defaulters, those who do not have children are responsible for **61.03%** of late payments. Followed, with **25.57%**, by people with only 1 child and then those with 2 children with **11.61%** of the records.

These 3 groups are responsible for **98.21%** of the total.

**Hypothesis 4: Is there a correlation with how the purpose of the credit affects the default rate?**

In [66]:
# Check the default percentages for each credit purpose and analyze them
risk_purpose_debt = default_risk.pivot_table(index='purpose', values='debt', aggfunc='sum', margins=True)


# Calculating the default rate based on the purpose of the loan
total_default = default_risk['debt'].sum()
def tx_purpose_default(debt):
    if debt > 0:
        return debt/total_default
    
try:
    risk_purpose_debt['tx_purpose_default'] = risk_purpose_debt['debt'].apply(tx_purpose_default)
except:
    'Runtime error!'

display(risk_purpose_debt)

Unnamed: 0_level_0,debt,tx_purpose_default
purpose,Unnamed: 1_level_1,Unnamed: 2_level_1
building a property (undefined),102,0.058621
buy a car,402,0.231034
buy commercial real estate,99,0.056897
buy or renovate residential properties,299,0.171839
buy real estate (undefined),240,0.137931
construction of own property,42,0.024138
education,370,0.212644
wedding,186,0.106897
All,1740,1.0


**Conclusion of Hypothesis 4**

The analysis of the purposes of credit shows us that this parameter does affect the default rate.

Car purchases are the main offenders, with **23.10%** of the total cases, followed by education, with **21.26%**. These represent **44.36%** of defaults.

# General Conclusion

There was inconsistent information in the database and this was addressed.

Missing values ​​were filled with medians, after data analysis was performed to define which parameter would be most appropriate.

Inconsistencies in information in the various columns were evaluated and corrected, after studying the actions to be taken.

Duplicates were excluded because it was unlikely that they were related to different individuals. A total of 252 lines were deleted from the database. This volume did not impact the final results.

The categorizations performed were important for a better visualization of the information for the company's decision-making.

The 4 (four) hypotheses tested/analyzed confirmed their relationship with default rates.