## Open the data file and have a look at the general information. 



In [1]:
# Loading all the libraries
import pandas as pd
import numpy as np



# Load the data
#data=pd.read_csv('/datasets/credit_scoring_eng.csv')


In [3]:
try:
    data = pd.read_csv('credit_scoring_eng.csv')
except:
    data= pd.read_csv('https://practicum-content.s3.us-west-1.amazonaws.com/datasets/credit_scoring_eng.csv')

##  Data exploration

**Description of the data**
- `children` - the number of children in the family
- `days_employed` - work experience in days
- `dob_years` - client's age in years
- `education` - client's education
- `education_id` - education identifier
- `family_status` - marital status
- `family_status_id` - marital status identifier
- `gender` - gender of the client
- `income_type` - type of employment
- `debt` - was there any debt on loan repayment
- `total_income` - monthly income
- `purpose` - the purpose of obtaining a loan



In [4]:
# Let's see how many rows and columns our dataset has
print(data.shape)



(21525, 12)


In [5]:
# let's print the first N rows

data.head(15)


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
5,0,-926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house
6,0,-2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions
7,0,-152.779569,50,SECONDARY EDUCATION,1,married,0,M,employee,0,21731.829,education
8,2,-6929.865299,35,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337.093,having a wedding
9,0,-2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family




From a first sight of the data sample, I can already see that there are missing data, negative numbers which don't make any sense and strings written in upper case. I will need to operate some changes and investigate about the missing data: how many, do they overlap, which columns and rows.

In [6]:
# Get info on data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


From a first look we can see that there are two columns with missing values. let's see how many values we are missing in each column:

In [7]:
# Let's look in the filtered table at the the first column with missing data
data.isnull().sum()



children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64



There is the same exact amount of missing values in each column. I think they might be symmetrical, but I cannot just assume so. Therefore i will proceed to further investigation.

In [8]:
# i want to see the percentage of missing values in each column
data.isnull().sum()/len(data)


children            0.000000
days_employed       0.100999
dob_years           0.000000
education           0.000000
education_id        0.000000
family_status       0.000000
family_status_id    0.000000
gender              0.000000
income_type         0.000000
debt                0.000000
total_income        0.100999
purpose             0.000000
dtype: float64

<div class="alert alert-block alert-success">
<b>Reviewer's comment v1</b>
 

Great! It is indeed helpful to check not only the total amount of missing values in each column but also look at the percentage of missing values. It helps to understand the overall impact. You can check percentage using, for example, this code: 
    
    data.isnull().sum()/len(data)

Or you can even make a dataframe of of it using to_frame
    
    mis_values = data.isnull().sum().to_frame('missing_values')
    mis_values['%'] = round(data.isnull().sum()/len(data),3)
    mis_values.sort_values(by='%', ascending=False)
    
   

<div class="alert alert-info"> <b>Student comments:</b> Trying the suggestion:</div>   

In [9]:
mis_values = data.isnull().sum().to_frame('missing_values')
mis_values['%'] = round(data.isnull().sum()/len(data),3)
mis_values.sort_values(by='%', ascending=False)

Unnamed: 0,missing_values,%
days_employed,2174,0.101
total_income,2174,0.101
children,0,0.0
dob_years,0,0.0
education,0,0.0
education_id,0,0.0
family_status,0,0.0
family_status_id,0,0.0
gender,0,0.0
income_type,0,0.0


In [10]:
# just trying a different way to see columns with missing values
for i in data:
    if data[i].isnull().sum() >0 :
        print(i)


days_employed
total_income


In [11]:
print(data[data['days_employed'].isna()])

       children  days_employed  dob_years            education  education_id  \
12            0            NaN         65  secondary education             1   
26            0            NaN         41  secondary education             1   
29            0            NaN         63  secondary education             1   
41            0            NaN         50  secondary education             1   
55            0            NaN         54  secondary education             1   
...         ...            ...        ...                  ...           ...   
21489         2            NaN         47  Secondary Education             1   
21495         1            NaN         50  secondary education             1   
21497         0            NaN         48    BACHELOR'S DEGREE             0   
21502         1            NaN         42  secondary education             1   
21510         2            NaN         28  secondary education             1   

           family_status  family_status

In [12]:
print(data[data['total_income'].isna()])

       children  days_employed  dob_years            education  education_id  \
12            0            NaN         65  secondary education             1   
26            0            NaN         41  secondary education             1   
29            0            NaN         63  secondary education             1   
41            0            NaN         50  secondary education             1   
55            0            NaN         54  secondary education             1   
...         ...            ...        ...                  ...           ...   
21489         2            NaN         47  Secondary Education             1   
21495         1            NaN         50  secondary education             1   
21497         0            NaN         48    BACHELOR'S DEGREE             0   
21502         1            NaN         42  secondary education             1   
21510         2            NaN         28  secondary education             1   

           family_status  family_status

At this stage we proved that the columns 'days_employed' and 'total_income' have missing values. It seems that NaN are symmetrical, let's check it :


In [13]:
# I replace missing values with string 'NaN'
data[['days_employed']] = data[['days_employed']].fillna('NaN')
# compare the two columns by indexes: NaN and'NaN' are missing values,
#they should be in the same rows if the missing values are symmetrical. 
#in that case the funcion will return boolean True
test = data.loc[data['total_income'].isna(), 'days_employed'].eq('NaN').all()
print(test)

True


In [14]:
# restoring NaN from string to float
data.replace('NaN', np.nan, inplace=True)
#check if the missing values have been restored to float
print(data['days_employed'].head(25))


0      -8437.673028
1      -4024.803754
2      -5623.422610
3      -4124.747207
4     340266.072047
5       -926.185831
6      -2879.202052
7       -152.779569
8      -6929.865299
9      -2188.756445
10     -4171.483647
11      -792.701887
12              NaN
13     -1846.641941
14     -1844.956182
15      -972.364419
16     -1719.934226
17     -2369.999720
18    400281.136913
19    -10038.818549
20     -1311.604166
21      -253.685166
22     -1766.644138
23      -272.981385
24    338551.952911
Name: days_employed, dtype: float64


We proved that there are missing values in the exact same rows.
I think it is a pattern: maybe some data were lost systematically at some point, or is a version of the same investigation without this specific data. If it looks relevant to my analysys I might proceed to further digging. Otherwise I might just ignore or drop the missing values. I don't think is a good idea to replace the missing values because the total day of employment is too much personal. I could deal with the negatives just by converting the negative numbers into positive, and convert them to years, to see if datas make sense.

In [15]:
#another way to see if missing values overlap

filtered=data[(data.total_income.isnull())&(data.days_employed.isnull())]

# I filter two columns of the dataframe at the same time, for missing value
print(filtered.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2174 entries, 12 to 21510
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          2174 non-null   int64  
 1   days_employed     0 non-null      float64
 2   dob_years         2174 non-null   int64  
 3   education         2174 non-null   object 
 4   education_id      2174 non-null   int64  
 5   family_status     2174 non-null   object 
 6   family_status_id  2174 non-null   int64  
 7   gender            2174 non-null   object 
 8   income_type       2174 non-null   object 
 9   debt              2174 non-null   int64  
 10  total_income      0 non-null      float64
 11  purpose           2174 non-null   object 
dtypes: float64(2), int64(5), object(5)
memory usage: 220.8+ KB
None


**Intermediate conclusion**



The number of the rows in the filtered data matches the number of missing values, it means that there is an overlap of missing values rows, the missing values appear in the same rows

Let's alculate the percentage of the missing values compared to the whole dataset.If it is  considerably large piece of data. it's recommended to fill the missing values. To do that, firstly we should consider whether the missing data could be due to the specific client characteristic, such as employment type or something else. Secondly, we should check whether there's any dependence missing values have on the value of other indicators with the columns with identified specific client characteristic.


I am going to compare the column with missing data, with other columns:

In [16]:
data[data.days_employed.isnull()]['income_type']

12             retiree
26       civil servant
29             retiree
41       civil servant
55             retiree
             ...      
21489         business
21495         employee
21497         business
21502         employee
21510         employee
Name: income_type, Length: 2174, dtype: object

In [17]:
data[data.days_employed.isnull()]['income_type'].value_counts()

employee         1105
business          508
retiree           413
civil servant     147
entrepreneur        1
Name: income_type, dtype: int64

faster way to compare to all columns:


In [18]:
for i in data:
    print(data[data.days_employed.isnull()][i].value_counts())
    print('-------------')

 0     1439
 1      475
 2      204
 3       36
 20       9
 4        7
-1        3
 5        1
Name: children, dtype: int64
-------------
Series([], Name: days_employed, dtype: int64)
-------------
34    69
40    66
42    65
31    65
35    64
36    63
47    59
41    59
30    58
28    57
58    56
57    56
54    55
56    54
38    54
52    53
37    53
33    51
39    51
50    51
43    50
45    50
49    50
51    50
29    50
46    48
55    48
48    46
44    44
53    44
60    39
62    38
61    38
32    37
64    37
23    36
27    36
26    35
59    34
63    29
25    23
24    21
65    20
66    20
21    18
22    17
67    16
0     10
68     9
71     5
69     5
20     5
70     3
72     2
19     1
73     1
Name: dob_years, dtype: int64
-------------
secondary education    1408
bachelor's degree       496
SECONDARY EDUCATION      67
Secondary Education      65
some college             55
Bachelor's Degree        25
BACHELOR'S DEGREE        23
primary education        19
Some College              7
S

I just checked the missing values in days_employed, compared to the other columns, to see if there is any dependence that missing values have on the value of other indicators with the columns with identified specific client characteristic

In [19]:
for i in data:
    print(data[data.total_income.isnull()][i].value_counts())
    print('-------------')

 0     1439
 1      475
 2      204
 3       36
 20       9
 4        7
-1        3
 5        1
Name: children, dtype: int64
-------------
Series([], Name: days_employed, dtype: int64)
-------------
34    69
40    66
42    65
31    65
35    64
36    63
47    59
41    59
30    58
28    57
58    56
57    56
54    55
56    54
38    54
52    53
37    53
33    51
39    51
50    51
43    50
45    50
49    50
51    50
29    50
46    48
55    48
48    46
44    44
53    44
60    39
62    38
61    38
32    37
64    37
23    36
27    36
26    35
59    34
63    29
25    23
24    21
65    20
66    20
21    18
22    17
67    16
0     10
68     9
71     5
69     5
20     5
70     3
72     2
19     1
73     1
Name: dob_years, dtype: int64
-------------
secondary education    1408
bachelor's degree       496
SECONDARY EDUCATION      67
Secondary Education      65
some college             55
Bachelor's Degree        25
BACHELOR'S DEGREE        23
primary education        19
Some College              7
S

I looked for any dependance also in the column total_income. In both results, I cannot identify any particular connections.

let's proceed to find subtle missing value's: I will look for zeros, values=0 which do not resonate with the decripion of the data.

In [20]:

#checking for 0's: the function retuns the columns where I have values 0

for i in data:
    print(i, len(data[data[i]==0]))

    


children 14149
days_employed 0
dob_years 101
education 0
education_id 5260
family_status 0
family_status_id 12380
gender 0
income_type 0
debt 19784
total_income 0
purpose 0


There are some missing values in the age of the client(dob), someone who requests a loan cannot be a baby. i could replace them with mean or average

In [21]:
# I change the 0 in dob_years, to NaN, so I can compare with the other missing values
data['dob_years'].replace(0, np.nan, inplace = True)


In [22]:
# compare dob missing values with days_employed
zeros_dob=data[(data.days_employed.isnull())&(data.dob_years.isnull())]
print(zeros_dob.head(25))


       children  days_employed  dob_years            education  education_id  \
1890          0            NaN        NaN    bachelor's degree             0   
2284          0            NaN        NaN  secondary education             1   
4064          1            NaN        NaN  secondary education             1   
5014          0            NaN        NaN  secondary education             1   
6411          0            NaN        NaN    bachelor's degree             0   
6670          0            NaN        NaN    Bachelor's Degree             0   
8574          0            NaN        NaN  secondary education             1   
12403         3            NaN        NaN  secondary education             1   
13741         0            NaN        NaN  secondary education             1   
19829         0            NaN        NaN  secondary education             1   

           family_status  family_status_id gender income_type  debt  \
1890           unmarried                 4      

it seems that we have only 10 entries, let's verify:

In [23]:
# Checking distrib
print(zeros_dob.info())


<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 1890 to 19829
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          10 non-null     int64  
 1   days_employed     0 non-null      float64
 2   dob_years         0 non-null      float64
 3   education         10 non-null     object 
 4   education_id      10 non-null     int64  
 5   family_status     10 non-null     object 
 6   family_status_id  10 non-null     int64  
 7   gender            10 non-null     object 
 8   income_type       10 non-null     object 
 9   debt              10 non-null     int64  
 10  total_income      0 non-null      float64
 11  purpose           10 non-null     object 
dtypes: float64(3), int64(4), object(5)
memory usage: 1.0+ KB
None


Out of 101 values of zeros in dob, only 10 overlap with the other 2 columns containing missing values. it means that in this set, there is no connection between miissing values in dob and other columns.

In [24]:
#restoring NaN in dob_years with 0
data['dob_years'] = data['dob_years'].fillna(0)

In [25]:
# Checking the distribution in the whole dataset
data['dob_years'].describe()

count    21525.000000
mean        43.293380
std         12.574584
min          0.000000
25%         33.000000
50%         42.000000
75%         53.000000
max         75.000000
Name: dob_years, dtype: float64

In [26]:
data.describe()

Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,debt,total_income
count,21525.0,19351.0,21525.0,21525.0,21525.0,21525.0,19351.0
mean,0.538908,63046.497661,43.29338,0.817236,0.972544,0.080883,26787.568355
std,1.381587,140827.311974,12.574584,0.548138,1.420324,0.272661,16475.450632
min,-1.0,-18388.949901,0.0,0.0,0.0,0.0,3306.762
25%,0.0,-2747.423625,33.0,1.0,0.0,0.0,16488.5045
50%,0.0,-1203.369529,42.0,1.0,0.0,0.0,23202.87
75%,1.0,-291.095954,53.0,1.0,1.0,0.0,32549.611
max,20.0,401755.400475,75.0,4.0,4.0,1.0,362496.645


I don't see any other relevant zero( missing value)

In [27]:
# count the values of debt=0 and debt=1
count_zero_debt=(data['debt']==0).sum()
count_debt=(data['debt']==1).sum()
print(count_zero_debt)
print(count_debt)

19784
1741


In [28]:
(data.total_income.isnull().sum())-(count_debt)

433

In [29]:
data.debt.dtype

dtype('int64')

I don't want to just drop the missing values rows, cause they contain a large portion of the debt=1




**Possible reasons for missing values in data**



Missing data can occur due to several reasons, e.g. interviewer mistakes, anonymization purposes, or survey filters.Data can go missing due to incomplete data entry, equipment malfunctions, lost files.Values might be missing because of some technical mistake, wrong data entry

In my dataset I spotted a pattern, for which there's overlap of missing values among 2 columns. I determine that this is nonignorable data and I can't drop the rows with missing data, because this approach would reduce statistical power and increases estimation bias. 




In [30]:
#seeking for empty strings that could be missing data

empty_strings = data[data['education'].str.len() == 0]
print(empty_strings)

Empty DataFrame
Columns: [children, days_employed, dob_years, education, education_id, family_status, family_status_id, gender, income_type, debt, total_income, purpose]
Index: []


In [31]:
#same for the column purpose
empty_strings = data[data['purpose'].str.len() == 0]
print(empty_strings)

Empty DataFrame
Columns: [children, days_employed, dob_years, education, education_id, family_status, family_status_id, gender, income_type, debt, total_income, purpose]
Index: []


In [32]:
#Whith the following code, I verified that it works by using values>0 :

#empty_strings = data[data['purpose'].str.len() > 0]
#print(empty_strings)

In [33]:
data.describe(include='all').T


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
children,21525.0,,,,0.538908,1.381587,-1.0,0.0,0.0,1.0,20.0
days_employed,19351.0,,,,63046.497661,140827.311974,-18388.949901,-2747.423625,-1203.369529,-291.095954,401755.400475
dob_years,21525.0,,,,43.29338,12.574584,0.0,33.0,42.0,53.0,75.0
education,21525.0,15.0,secondary education,13750.0,,,,,,,
education_id,21525.0,,,,0.817236,0.548138,0.0,1.0,1.0,1.0,4.0
family_status,21525.0,5.0,married,12380.0,,,,,,,
family_status_id,21525.0,,,,0.972544,1.420324,0.0,0.0,0.0,1.0,4.0
gender,21525.0,3.0,F,14236.0,,,,,,,
income_type,21525.0,8.0,employee,11119.0,,,,,,,
debt,21525.0,,,,0.080883,0.272661,0.0,0.0,0.0,0.0,1.0


**Intermediate conclusion**

[Can we finally confirm that missing values are accidental? Check for anything else that you think might be important here.]

**Conclusions**


In my dataset I spotted a pattern, for which there's overlap of missing values among 2 columns: total_income and days_employed. I determine that this is non-ignorable data and I can't drop the rows with missing data, because this approach would reduce statistical power and increases estimation bias. 

Both columns are numerical data, so I want to be as accurate as possible, replacing them with means or medians calculated by considering also other known data about the clients.

I will replace the zeros in the age with mean, and proceed to search for duplicates. I will also furter categorize the data within some columns in order to have a more accurate, and easy too interpret, approach.

## Data transformation

Let's go through each column to see what issues we may have in them.



In [33]:
# Let's see all values in education column to check if and what spellings will need to be fixed
print(data['education'].head(10))

0      bachelor's degree
1    secondary education
2    Secondary Education
3    secondary education
4    secondary education
5      bachelor's degree
6      bachelor's degree
7    SECONDARY EDUCATION
8      BACHELOR'S DEGREE
9    secondary education
Name: education, dtype: object


There's need to convert all strings to lower case, that way we can find duplicates:

In [34]:


data['education'] = data['education'].str.lower()


In [34]:
# Checking all the values in the column to make sure we fixed them

data['education'].head(10)

0      bachelor's degree
1    secondary education
2    Secondary Education
3    secondary education
4    secondary education
5      bachelor's degree
6      bachelor's degree
7    SECONDARY EDUCATION
8      BACHELOR'S DEGREE
9    secondary education
Name: education, dtype: object

Check the data the `children` column

In [35]:
# Let's see the distribution of values in the `children` column
print(data['children'].describe())


count    21525.000000
mean         0.538908
std          1.381587
min         -1.000000
25%          0.000000
50%          0.000000
75%          1.000000
max         20.000000
Name: children, dtype: float64


In [36]:
data['children'].value_counts()

 0     14149
 1      4818
 2      2055
 3       330
 20       76
-1        47
 4        41
 5         9
Name: children, dtype: int64

there are few inconsistent data here:

20 kids for a person, -1: human mistake. Since I think that's because of typo wile entering data, i decided to replace -1 with 1 and 20 with 2

In [37]:
# fix the data 

data['children']=data['children'].replace([-1], 1)
data['children']=data['children'].replace([20], 2)

#i decided to replace -1 with 1 and 20 with 2, i think those where typos

I converted those rows to 1 and 2

In [39]:
# Checking the `children` column again to make sure it's all fixed
data['children'].value_counts()


0    14149
1     4865
2     2131
3      330
4       41
5        9
Name: children, dtype: int64

Check the data in the `days_employed` column. I expect to see negative numbers and maybe an eccessive or inconsistent number of days. I will return the absolute values of the negatives, convert days to years and if the data appears good, I will proceed to replace NaN with a median.

In [40]:
# Find problematic data in `days_employed`, if they exist, and calculate the percentage
data['days_employed'].value_counts()

-8437.673028      1
-3507.818775      1
 354500.415854    1
-769.717438       1
-3963.590317      1
                 ..
-1099.957609      1
-209.984794       1
 398099.392433    1
-1271.038880      1
-1984.507589      1
Name: days_employed, Length: 19351, dtype: int64

The amount of problematic data is high, it could've been due to some technical issues. Maybe the data was entered using the minus operator as a bullet point.

In [41]:
# Address the problematic values, if they exist

#converting all numbers to their absolute value

data['days_employed']=data['days_employed'].abs()

I converted all the negative values in "days_employed", to their absolute values, this way I can work on them

In [42]:
data['days_employed']=abs(data['days_employed'])
print(data['days_employed'].head(20))

0       8437.673028
1       4024.803754
2       5623.422610
3       4124.747207
4     340266.072047
5        926.185831
6       2879.202052
7        152.779569
8       6929.865299
9       2188.756445
10      4171.483647
11       792.701887
12              NaN
13      1846.641941
14      1844.956182
15       972.364419
16      1719.934226
17      2369.999720
18    400281.136913
19     10038.818549
Name: days_employed, dtype: float64


In [43]:
# Check the result - make sure it's fixed
data['days_employed'].value_counts()

8437.673028      1
3507.818775      1
354500.415854    1
769.717438       1
3963.590317      1
                ..
1099.957609      1
209.984794       1
398099.392433    1
1271.038880      1
1984.507589      1
Name: days_employed, Length: 19351, dtype: int64

I want to convert days to years, to see what's going on:

In [44]:
data['years_employed']= data['days_employed'].div(365)
print (data['years_employed'].sort_values(ascending=False).head(20))

6954     1100.699727
10006    1100.591265
7664     1100.479708
2156     1100.477991
7794     1100.448904
4697     1100.369953
13420    1100.327762
17823    1100.313632
10991    1100.251585
8369     1100.247814
1184     1100.206018
4949     1100.202480
15192    1100.155489
5716     1100.066463
10484    1100.047333
16237    1099.963580
15599    1099.887336
701      1099.853279
14356    1099.837902
5762     1099.675989
Name: years_employed, dtype: float64


In [45]:
print (data['years_employed'].sort_values(ascending=False).sample(20))

9291        4.397143
17857       9.005262
4678        2.028091
5585        1.725349
21485     997.250546
10595       2.488144
20589       1.811366
21215     925.392493
6143        6.170858
5323        6.554750
13760    1038.421946
6908        2.200549
6740             NaN
13516       0.296359
9897     1098.968642
11981            NaN
17909    1055.630986
16206       5.361217
6521        0.359487
15084      20.513519
Name: years_employed, dtype: float64


comparing years employed to age:

In [46]:
test=data.groupby(['years_employed'])['dob_years']
print(test.sample())

17437    31.0
8336     32.0
6157     47.0
9683     43.0
2127     31.0
         ... 
7794     61.0
2156     60.0
7664     61.0
10006    69.0
6954     56.0
Name: dob_years, Length: 19351, dtype: float64


obviously here we deal with a set of very inconsistent data, nobody could have been employed for thousands of year!

In [47]:
# Check the `dob_years` for suspicious values and count the percentage

data.describe()

Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,debt,total_income,years_employed
count,21525.0,19351.0,21525.0,21525.0,21525.0,21525.0,19351.0,19351.0
mean,0.479721,66914.728907,43.29338,0.817236,0.972544,0.080883,26787.568355,183.328024
std,0.755528,139030.880527,12.574584,0.548138,1.420324,0.272661,16475.450632,380.906522
min,0.0,24.141633,0.0,0.0,0.0,0.0,3306.762,0.066141
25%,0.0,927.009265,33.0,1.0,0.0,0.0,16488.5045,2.539751
50%,0.0,2194.220567,42.0,1.0,0.0,0.0,23202.87,6.011563
75%,1.0,5537.882441,53.0,1.0,1.0,0.0,32549.611,15.172281
max,5.0,401755.400475,75.0,4.0,4.0,1.0,362496.645,1100.699727


replacing 0's in dob years with mean as there aren'tmany missing value and not big outliers

In [48]:
# Address the issues in the `dob_years` column, if they exist
#I will replace the 0's with a mean; we don't have many missing value aand not big outliers.
data['dob_years']=data['dob_years'].replace(0,data['dob_years'].mean())

In [49]:
# Check the result - make sure it's fixed
data.describe()

Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,debt,total_income,years_employed
count,21525.0,19351.0,21525.0,21525.0,21525.0,21525.0,19351.0,19351.0
mean,0.479721,66914.728907,43.496522,0.817236,0.972544,0.080883,26787.568355,183.328024
std,0.755528,139030.880527,12.218174,0.548138,1.420324,0.272661,16475.450632,380.906522
min,0.0,24.141633,19.0,0.0,0.0,0.0,3306.762,0.066141
25%,0.0,927.009265,34.0,1.0,0.0,0.0,16488.5045,2.539751
50%,0.0,2194.220567,43.0,1.0,0.0,0.0,23202.87,6.011563
75%,1.0,5537.882441,53.0,1.0,1.0,0.0,32549.611,15.172281
max,5.0,401755.400475,75.0,4.0,4.0,1.0,362496.645,1100.699727


It's fixed: the minimum age now is 19 and not 0

Now let's check the `family_status` column. See what kind of values there are and what problems there might be

In [50]:
# Let's see the values for the column

print(data['family_status'].value_counts())


married              12380
civil partnership     4177
unmarried             2813
divorced              1195
widow / widower        960
Name: family_status, dtype: int64


everything seems fine

Now let's check the `gender` column. See what kind of values there are and what problems need to be addressed

In [51]:
# Let's see the values in the column
print(data['gender'].value_counts())

F      14236
M       7288
XNA        1
Name: gender, dtype: int64


In [52]:
# Address the problematic values, if they exist
data['gender'] = data['gender'].replace('XNA', np.nan)


#I want to ignore XNA because is only 1 value compared to more that 21000, not statistically relevant

I want to ignore XNA because is only 1 value compared to more that 21000, not statistically relevant. replaced with NaN

In [53]:
# Check the result - make sure it's fixed
print(data['gender'].value_counts())

F    14236
M     7288
Name: gender, dtype: int64


Now let's check the `income_type` column. 

In [54]:
# Let's see the values in the column
print(data['income_type'].value_counts())

employee                       11119
business                        5085
retiree                         3856
civil servant                   1459
unemployed                         2
entrepreneur                       2
student                            1
paternity / maternity leave        1
Name: income_type, dtype: int64


In [55]:
# Address the problematic values, if they exist
# i will just add those few values to the value employee, by replacing the string with "employee"
# I think is better to add them to the largest category and not lose those few data in the row
#data['income_type'] = data['income_type'].replace(['entrepreneur',
                                                   #'unemployed', 
                                                   #'paternity / maternity leave', 
                                                   #'student'], 'employee')

I actually reduced the categories of employment type by adding six rows of former categories to the most popiulated one: employee

In [56]:
data['income_type'] = data['income_type'].replace(['entrepreneur',
                                                   'unemployed', 
                                                   'paternity / maternity leave', 
                                                   'student'], np.nan)

In [57]:
# Check the result - make sure it's fixed

print(data['income_type'].value_counts())

employee         11119
business          5085
retiree           3856
civil servant     1459
Name: income_type, dtype: int64


I want to ignore those four categories of income because are  only few values compared to more that 21000, not statistically relevant. replaced with NaN

Now let's see if we have any duplicates in our dataset, and address the issues if there are any:

In [58]:
# Checking duplicates
duplicateRows = data[data.duplicated()]
print(duplicateRows)

       children  days_employed  dob_years            education  education_id  \
2849          0            NaN       41.0  secondary education             1   
4182          1            NaN       34.0    BACHELOR'S DEGREE             0   
4851          0            NaN       60.0  secondary education             1   
5557          0            NaN       58.0  secondary education             1   
7808          0            NaN       57.0  secondary education             1   
8583          0            NaN       58.0    bachelor's degree             0   
9238          2            NaN       34.0  secondary education             1   
9528          0            NaN       66.0  secondary education             1   
9627          0            NaN       56.0  secondary education             1   
10462         0            NaN       62.0  secondary education             1   
10697         0            NaN       40.0  secondary education             1   
10864         0            NaN       62.

there are 71 duplicates: I want to drop them because duplicates are an extreme case of nonrandom sampling, and they bias our  model. Including them will essentially lead to data redundancy.

In [59]:
# Address the duplicates, if they exist
data=data.drop_duplicates()

In [60]:
# Last check whether we have any duplicates
duplicateRows = data[data.duplicated()]
print(duplicateRows)

Empty DataFrame
Columns: [children, days_employed, dob_years, education, education_id, family_status, family_status_id, gender, income_type, debt, total_income, purpose, years_employed]
Index: []


I removed the duplicates

In [61]:
# Check the size of the dataset that you now have after the first manipulations with it
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21471 entries, 0 to 21524
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21471 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21471 non-null  float64
 3   education         21471 non-null  object 
 4   education_id      21471 non-null  int64  
 5   family_status     21471 non-null  object 
 6   family_status_id  21471 non-null  int64  
 7   gender            21470 non-null  object 
 8   income_type       21465 non-null  object 
 9   debt              21471 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21471 non-null  object 
 12  years_employed    19351 non-null  float64
dtypes: float64(4), int64(4), object(5)
memory usage: 2.3+ MB


we can observe, now, a diminished number of entry in our dataset

In [62]:
data.describe()

Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,debt,total_income,years_employed
count,21471.0,19351.0,21471.0,21471.0,21471.0,21471.0,19351.0,19351.0
mean,0.480229,66914.728907,43.482727,0.817195,0.973685,0.081086,26787.568355,183.328024
std,0.755892,139030.880527,12.217198,0.548508,1.421082,0.272974,16475.450632,380.906522
min,0.0,24.141633,19.0,0.0,0.0,0.0,3306.762,0.066141
25%,0.0,927.009265,33.5,1.0,0.0,0.0,16488.5045,2.539751
50%,0.0,2194.220567,43.0,1.0,0.0,0.0,23202.87,6.011563
75%,1.0,5537.882441,53.0,1.0,1.0,0.0,32549.611,15.172281
max,5.0,401755.400475,75.0,4.0,4.0,1.0,362496.645,1100.699727


[Describe your new dataset: briefly say what's changed and what's the percentage of the changes, if there were any.]


The size of the data is smaller, the minumum age now is 19

# Working with missing values

it's good to find dictionaries, in order to use them as reference when we have descriptive strings, and rto use them as well in possible functions we might want to write.

In [63]:
# Find the dictionaries
data.head(15)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,years_employed
0,1,8437.673028,42.0,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,23.116912
1,1,4024.803754,36.0,secondary education,1,married,0,F,employee,0,17932.802,car purchase,11.02686
2,0,5623.42261,33.0,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house,15.406637
3,3,4124.747207,32.0,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,11.300677
4,0,340266.072047,53.0,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,932.235814
5,0,926.185831,27.0,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house,2.537495
6,0,2879.202052,43.0,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions,7.888225
7,0,152.779569,50.0,SECONDARY EDUCATION,1,married,0,M,employee,0,21731.829,education,0.418574
8,2,6929.865299,35.0,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337.093,having a wedding,18.985932
9,0,2188.756445,41.0,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family,5.996593


In [64]:
#education, education_id and family_status can be converted into dictionaries
reference_edu=dict(zip(data.education_id, data.education))
reference_fam=dict(zip(data.family_status_id, data.family_status))
print(reference_edu)
print(reference_fam)

{0: "bachelor's degree", 1: 'secondary education', 2: 'some college', 3: 'primary education', 4: 'graduate degree'}
{0: 'married', 1: 'civil partnership', 2: 'widow / widower', 3: 'divorced', 4: 'unmarried'}


### Restoring missing values in `total_income`

In [65]:
data_nan=data[data.isnull().any(axis=1)]
print(data_nan)

       children  days_employed  dob_years            education  education_id  \
12            0            NaN       65.0  secondary education             1   
26            0            NaN       41.0  secondary education             1   
29            0            NaN       63.0  secondary education             1   
41            0            NaN       50.0  secondary education             1   
55            0            NaN       54.0  secondary education             1   
...         ...            ...        ...                  ...           ...   
21489         2            NaN       47.0  Secondary Education             1   
21495         1            NaN       50.0  secondary education             1   
21497         0            NaN       48.0    BACHELOR'S DEGREE             0   
21502         1            NaN       42.0  secondary education             1   
21510         2            NaN       28.0  secondary education             1   

           family_status  family_status


I will address only the colum 'total income'. I can compare the values to other known specifications of the client and draft a way to replace the missing values with mean or median based on comparison with known values.


Start with addressing total income missing values. Create and age category for clients. Create a new column with the age category. This strategy can help with calculating values for the total income.


In [66]:
# Let's write a function that calculates the age category
def age_group(age):
    ''' The function returns the age group according to the age value,
    following a pretty intuitive method: '''
     
    if 18<= age <= 25 :
        return '18-25'
    elif age < 35:
        return '26-34'
    elif age < 40:
        return '35-39'
    elif age < 50:
        return '40-49'
    elif age < 60:
        return '50-59'
    elif age < 70:
        return '60-69'
    else:
        return '70+'
    

<div class="alert alert-success" role="alert">
<b>Reviewer's comment v1</b>

Great that you managed to create a function to automate your code.
</div>

In [67]:
# Test if the function works
print(age_group(27)) 
print(age_group(53)) 
print(age_group(25)) 

26-34
50-59
18-25


In [68]:
data['age_group_test1'] = data['dob_years'].apply(age_group)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['age_group_test1'] = data['dob_years'].apply(age_group)


In [69]:
data['age_group_test1'].value_counts()

40-49    5459
26-34    4736
50-59    4662
35-39    2875
60-69    2335
18-25    1233
70+       171
Name: age_group_test1, dtype: int64

In [70]:
data['age_group'] = (data.loc[:,('dob_years')]).apply(age_group)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['age_group'] = (data.loc[:,('dob_years')]).apply(age_group)


In [71]:
# Checking how values in the new column
data['age_group'].value_counts()

40-49    5459
26-34    4736
50-59    4662
35-39    2875
60-69    2335
18-25    1233
70+       171
Name: age_group, dtype: int64

I followed the suggestion from SettingWithCopyWarning:  but I get the same result: the function though, it works!

usually the income depend's on a person's age, gender, education and employment. Let's try to group these impacting factor and calculate mean and median based on them. Then proceed to replace

first, let's create a table that only has data without missing values. This data will be used to restore the missing values

In [72]:
# Create a table without missing values and print a few of its rows to make sure it looks fine
#Use boolean indexing with check missing values and any for check at least one True per rows:
mask= data.isnull().any(axis=1)
dataf1=data[~mask] 
print(dataf1.head(10))

   children  days_employed  dob_years            education  education_id  \
0         1    8437.673028       42.0    bachelor's degree             0   
1         1    4024.803754       36.0  secondary education             1   
2         0    5623.422610       33.0  Secondary Education             1   
3         3    4124.747207       32.0  secondary education             1   
4         0  340266.072047       53.0  secondary education             1   
5         0     926.185831       27.0    bachelor's degree             0   
6         0    2879.202052       43.0    bachelor's degree             0   
7         0     152.779569       50.0  SECONDARY EDUCATION             1   
8         2    6929.865299       35.0    BACHELOR'S DEGREE             0   
9         0    2188.756445       41.0  secondary education             1   

       family_status  family_status_id gender income_type  debt  total_income  \
0            married                 0      F    employee     0     40620.102   
1

In [73]:
# Look at the mean values for income based on your identified factors
#total_income_mean2=dataf1['total_income'].mean()
#print(total_income_mean2)


In [74]:
total_income_mean=data['total_income'].mean()
print(total_income_mean)

26787.56835465871


In [75]:

#total_income_bytype_mean2=dataf1.groupby('income_type')['total_income'].mean()
#print(total_income_bytype_mean2)


In [76]:
total_income_bytype_mean=data.groupby('income_type')['total_income'].mean()
print(total_income_bytype_mean)

income_type
business         32386.793835
civil servant    27343.729582
employee         25820.841683
retiree          21940.394503
Name: total_income, dtype: float64


In [77]:
#total_income_bytype_median2=dataf1.groupby('income_type')['total_income'].median()
#print(total_income_bytype_median2)


In [78]:
total_income_bytype_median=data.groupby('income_type')['total_income'].median()
print(total_income_bytype_median)

income_type
business         27577.2720
civil servant    24071.6695
employee         22815.1035
retiree          18962.3180
Name: total_income, dtype: float64


There is a small difference beween means and mediand calculated on the masked dataframe and on the original dataframe.In this set of data, that difference is not crucial to determine our final results. I will prefer to use the original dataframe, in the attemp to generate less errors.

let's find mean and median on grouped set of columns:

In [79]:
# Look at the mean values for income based on identified factors
total_income_debt=data.groupby('debt')['total_income'].mean()
print(total_income_debt)


debt
0    26848.661065
1    26096.143537
Name: total_income, dtype: float64


In [80]:
# Look at the median values for income based on identified factors
total_income_debt_median=data.groupby('debt')['total_income'].median()
print(total_income_debt_median)


debt
0    23225.905
1    22928.480
Name: total_income, dtype: float64


Repeat such comparisons for multiple factors.Usually the income depend's on a person's age, gender, education and employment. Let's try to group these impacting factor and calculate mean and median based on them. 



In [81]:

edu_mean=data.groupby('education_id')['total_income'].mean()
print(edu_mean)
edu_medi=data.groupby('education_id')['total_income'].median()
print(edu_medi)

education_id
0    33142.802434
1    24594.503037
2    29045.443644
3    21144.882211
4    27960.024667
Name: total_income, dtype: float64
education_id
0    28054.5310
1    21836.5830
2    25618.4640
3    18741.9760
4    25161.5835
Name: total_income, dtype: float64


In [82]:
gender_mean=data.groupby('gender')['total_income'].mean()
print(gender_mean)
gender_medi=data.groupby('gender')['total_income'].median()
print(gender_medi)


gender
F    24655.604757
M    30907.144369
Name: total_income, dtype: float64
gender
F    21464.845
M    26834.295
Name: total_income, dtype: float64


In [83]:
gender_edu_mean=data.groupby(['gender','education_id'])['total_income'].mean()
print(gender_edu_mean)
gender_edu_medi=data.groupby(['gender','education_id'])['total_income'].median()
print(gender_edu_medi)


gender  education_id
F       0               30306.441576
        1               22671.099805
        2               26470.312199
        3               19118.479588
        4               29345.394000
M       0               38981.070503
        1               28296.294264
        2               33209.842210
        3               23798.931664
        4               27267.340000
Name: total_income, dtype: float64
gender  education_id
F       0               26063.4715
        1               20101.2700
        2               22836.0820
        3               17223.9615
        4               29345.3940
M       0               32675.8355
        1               25435.5815
        2               29973.6640
        3               21204.0860
        4               25161.5835
Name: total_income, dtype: float64


take a look at min and max income by known factors, this will help us to decide wether to use mean or median

In [84]:
test1=data.groupby(['gender','education_id'])['total_income'].max()
print(test1)
test2=data.groupby(['gender','education_id'])['total_income'].min()
print(test2)

gender  education_id
F       0               228469.514
        1               274402.943
        2               153349.533
        3                65263.983
        4                40868.031
M       0               362496.645
        1               276204.162
        2               131588.163
        3                78410.774
        4                42945.794
Name: total_income, dtype: float64
gender  education_id
F       0                5148.514
        1                3306.762
        2                5831.255
        3                4049.374
        4               17822.757
M       0                6844.452
        1                3392.845
        2                5514.581
        3                5837.099
        4               15800.399
Name: total_income, dtype: float64


In [85]:
min_total_income=data['total_income'].min()
print(min_total_income)
max_total_income=data['total_income'].max()
print(max_total_income)

3306.762
362496.645


In [86]:
type_inc_edu_mean=data.groupby(['income_type','education_id'])['total_income'].mean()
print(type_inc_edu_mean)
type_inc_edu_median=data.groupby(['income_type','education_id'])['total_income'].median()
print(type_inc_edu_median)

income_type    education_id
business       0               38780.136881
               1               28718.435242
               2               31623.893705
               3               26409.124931
civil servant  0               31571.287664
               1               24648.816597
               2               27596.312587
               3               29449.016667
               4               17822.757000
employee       0               30650.288996
               1               24426.079549
               2               27951.531586
               3               21954.056075
               4               31089.653667
retiree        0               27306.878056
               1               21071.829349
               2               22129.937314
               3               17810.387914
               4               28334.215000
Name: total_income, dtype: float64
income_type    education_id
business       0               32285.6640
               1               

I should proceed to some further calculations to find if the data are skewed and if there are outliers. For now, taking a look at the description, and to the resuts of mean, median, min and max ,I assume that there are outliers and that the data are not symmetrical. So I will use MEDIAN, and i will find the median for total incomes based on 4 known factors: age, gender, education and type of income.

In [87]:
#using transform()
medians = data.groupby(['income_type','education_id','gender', 'age_group'])['total_income'].transform('median')

data['total_income'] = data['total_income'].fillna(medians)
print(data['total_income'])

0        40620.102
1        17932.802
2        23341.752
3        42820.568
4        25378.572
           ...    
21520    35966.698
21521    24959.969
21522    14347.610
21523    39054.888
21524    13127.587
Name: total_income, Length: 21471, dtype: float64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['total_income'] = data['total_income'].fillna(medians)


In [88]:
print(total_income_bytype_median)

income_type
business         27577.2720
civil servant    24071.6695
employee         22815.1035
retiree          18962.3180
Name: total_income, dtype: float64


In [89]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21471 entries, 0 to 21524
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21471 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21471 non-null  float64
 3   education         21471 non-null  object 
 4   education_id      21471 non-null  int64  
 5   family_status     21471 non-null  object 
 6   family_status_id  21471 non-null  int64  
 7   gender            21470 non-null  object 
 8   income_type       21465 non-null  object 
 9   debt              21471 non-null  int64  
 10  total_income      21467 non-null  float64
 11  purpose           21471 non-null  object 
 12  years_employed    19351 non-null  float64
 13  age_group_test1   21471 non-null  object 
 14  age_group         21471 non-null  object 
dtypes: float64(4), int64(4), object(7)
memory usage: 2.6+ MB


we can see that in the column 10, we don't have missing vlues anymore. let's check if they were replaced correctly: i will access some of the rows where previously there were NaN, and see which values there are now

In [90]:
print(data.iloc[41])
print(data.iloc[55]) 
print(data.iloc[26])
print(data.iloc[90])
print(data.iloc[72])



children                                   0
days_employed                            NaN
dob_years                               50.0
education                secondary education
education_id                               1
family_status                        married
family_status_id                           0
gender                                     F
income_type                    civil servant
debt                                       0
total_income                       21093.356
purpose             second-hand car purchase
years_employed                           NaN
age_group_test1                        50-59
age_group                              50-59
Name: 41, dtype: object
children                              0
days_employed                       NaN
dob_years                          54.0
education           secondary education
education_id                          1
family_status         civil partnership
family_status_id                      1
gender               

trying a different code:

In [91]:
total_income_bytype_median_dict=pd.Series(total_income_bytype_median).to_dict()
data['total_income']=data['total_income'].fillna(data.income_type.map(total_income_bytype_median_dict))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['total_income']=data['total_income'].fillna(data.income_type.map(total_income_bytype_median_dict))


In [93]:
display(data.iloc[41])
display(data.iloc[55]) 
display(data.iloc[26])
display(data.iloc[90])
display(data.iloc[72])

children                                   0
days_employed                            NaN
dob_years                               50.0
education                secondary education
education_id                               1
family_status                        married
family_status_id                           0
gender                                     F
income_type                    civil servant
debt                                       0
total_income                       21093.356
purpose             second-hand car purchase
years_employed                           NaN
age_group_test1                        50-59
age_group                              50-59
Name: 41, dtype: object

children                              0
days_employed                       NaN
dob_years                          54.0
education           secondary education
education_id                          1
family_status         civil partnership
family_status_id                      1
gender                                F
income_type                     retiree
debt                                  1
total_income                  18366.459
purpose               to have a wedding
years_employed                      NaN
age_group_test1                   50-59
age_group                         50-59
Name: 55, dtype: object

children                              0
days_employed                       NaN
dob_years                          41.0
education           secondary education
education_id                          1
family_status                   married
family_status_id                      0
gender                                M
income_type               civil servant
debt                                  0
total_income                  27691.871
purpose                       education
years_employed                      NaN
age_group_test1                   40-49
age_group                         40-49
Name: 26, dtype: object

children                               2
days_employed                        NaN
dob_years                           35.0
education              bachelor's degree
education_id                           0
family_status                    married
family_status_id                       0
gender                                 F
income_type                     employee
debt                                   0
total_income                  25091.4555
purpose             housing transactions
years_employed                       NaN
age_group_test1                    35-39
age_group                          35-39
Name: 90, dtype: object

children                                                   1
days_employed                                            NaN
dob_years                                               32.0
education                                  bachelor's degree
education_id                                               0
family_status                                        married
family_status_id                                           0
gender                                                     M
income_type                                    civil servant
debt                                                       0
total_income                                       33935.027
purpose             transactions with commercial real estate
years_employed                                           NaN
age_group_test1                                        26-34
age_group                                              26-34
Name: 72, dtype: object

it seems like, in both methods, the total_income for the rows with missing values have been filled with the medians for different categories

In [94]:
data.info()
data.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 21471 entries, 0 to 21524
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21471 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21471 non-null  float64
 3   education         21471 non-null  object 
 4   education_id      21471 non-null  int64  
 5   family_status     21471 non-null  object 
 6   family_status_id  21471 non-null  int64  
 7   gender            21470 non-null  object 
 8   income_type       21465 non-null  object 
 9   debt              21471 non-null  int64  
 10  total_income      21470 non-null  float64
 11  purpose           21471 non-null  object 
 12  years_employed    19351 non-null  float64
 13  age_group_test1   21471 non-null  object 
 14  age_group         21471 non-null  object 
dtypes: float64(4), int64(4), object(7)
memory usage: 2.6+ MB


children               0
days_employed       2120
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 1
income_type            6
debt                   0
total_income           1
purpose                0
years_employed      2120
age_group_test1        0
age_group              0
dtype: int64

###  Restoring values in `days_employed`

If possible, I will restore missing values following the same  procedures used to fix the total incomes. However, I already suspect that this column have more issues: I already dealt with negatives that appeared scattered all over the columns. let's figure out if the current values make sense by trasansforming days into years and comparing it with age of the client.

In [95]:
# Distribution of `days_employed` medians based on your identified parameters

days_employed_median=data.groupby(['gender','education_id'])['days_employed'].median()
print(days_employed_median)

gender  education_id
F       0                 2009.211057
        1                 2890.224831
        2                 1209.128083
        3               168556.875700
        4               191122.147707
M       0                 1657.056118
        1                 1710.440205
        2                 1197.316592
        3                 1331.017371
        4                 3851.735057
Name: days_employed, dtype: float64


In [96]:
# Distribution of `days_employed` means based on your identified parameters
days_employed_mean=data.groupby(['gender','education_id'])['days_employed'].mean()
print(days_employed_mean)

gender  education_id
F       0                47234.757552
        1                95616.564419
        2                28078.545846
        3               183252.662395
        4               191122.147707
M       0                32373.092864
        1                39456.131517
        2                 8685.270012
        3                61039.444626
        4                86424.371456
Name: days_employed, dtype: float64


In [97]:
data.head(14)


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,years_employed,age_group_test1,age_group
0,1,8437.673028,42.0,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,23.116912,40-49,40-49
1,1,4024.803754,36.0,secondary education,1,married,0,F,employee,0,17932.802,car purchase,11.02686,35-39,35-39
2,0,5623.42261,33.0,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house,15.406637,26-34,26-34
3,3,4124.747207,32.0,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,11.300677,26-34,26-34
4,0,340266.072047,53.0,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,932.235814,50-59,50-59
5,0,926.185831,27.0,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house,2.537495,26-34,26-34
6,0,2879.202052,43.0,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions,7.888225,40-49,40-49
7,0,152.779569,50.0,SECONDARY EDUCATION,1,married,0,M,employee,0,21731.829,education,0.418574,50-59,50-59
8,2,6929.865299,35.0,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337.093,having a wedding,18.985932,35-39,35-39
9,0,2188.756445,41.0,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family,5.996593,40-49,40-49



I will use median to have a more accurate estimation, since there are considerable differences among the groups

In [98]:
# Let's write divide the medians fo 365 in order to find the years employed
median_years=days_employed_mean.div(365)
print (median_years.sort_values(ascending=False).head(20))

gender  education_id
F       4               523.622322
        3               502.062089
        1               261.963190
M       4               236.779100
        3               167.231355
F       0               129.410295
M       1               108.098990
        0                88.693405
F       2                76.927523
M       2                23.795260
Name: days_employed, dtype: float64


as previously assessed, the data in days_employed column are corrupted and also non relevant to my study, so I will not proceed to replace missing values

for the sake of practice and to benefit the business, i will proceed to replace missing values in the column days_employed:

In [99]:
days_employed_median=data.groupby(['gender','education_id'])['days_employed'].transform('median')
data['days_employed'] = data['days_employed'].fillna(days_employed_median)
print(data['days_employed'])


0          8437.673028
1          4024.803754
2          5623.422610
3          4124.747207
4        340266.072047
             ...      
21520      4529.316663
21521    343937.404131
21522      2113.346888
21523      3112.481705
21524      1984.507589
Name: days_employed, Length: 21471, dtype: float64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['days_employed'] = data['days_employed'].fillna(days_employed_median)


In [100]:
data.info()
data.isnull().sum()



<class 'pandas.core.frame.DataFrame'>
Int64Index: 21471 entries, 0 to 21524
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21471 non-null  int64  
 1   days_employed     21471 non-null  float64
 2   dob_years         21471 non-null  float64
 3   education         21471 non-null  object 
 4   education_id      21471 non-null  int64  
 5   family_status     21471 non-null  object 
 6   family_status_id  21471 non-null  int64  
 7   gender            21470 non-null  object 
 8   income_type       21465 non-null  object 
 9   debt              21471 non-null  int64  
 10  total_income      21470 non-null  float64
 11  purpose           21471 non-null  object 
 12  years_employed    19351 non-null  float64
 13  age_group_test1   21471 non-null  object 
 14  age_group         21471 non-null  object 
dtypes: float64(4), int64(4), object(7)
memory usage: 2.6+ MB


children               0
days_employed          0
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 1
income_type            6
debt                   0
total_income           1
purpose                0
years_employed      2120
age_group_test1        0
age_group              0
dtype: int64

we can see that all missing values have been replaced with the exception of the column 'years_employed' that I created, I don't need this column anymore, so I prefer to drop it, in order to simplify my analysis.

In [101]:
data=data.drop(columns=['years_employed'])

In [102]:
data.info()
data.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21471 entries, 0 to 21524
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21471 non-null  int64  
 1   days_employed     21471 non-null  float64
 2   dob_years         21471 non-null  float64
 3   education         21471 non-null  object 
 4   education_id      21471 non-null  int64  
 5   family_status     21471 non-null  object 
 6   family_status_id  21471 non-null  int64  
 7   gender            21470 non-null  object 
 8   income_type       21465 non-null  object 
 9   debt              21471 non-null  int64  
 10  total_income      21470 non-null  float64
 11  purpose           21471 non-null  object 
 12  age_group_test1   21471 non-null  object 
 13  age_group         21471 non-null  object 
dtypes: float64(3), int64(4), object(7)
memory usage: 2.5+ MB


children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              1
income_type         6
debt                0
total_income        1
purpose             0
age_group_test1     0
age_group           0
dtype: int64

## Categorization of data

#### Goal of analysis: check wether there is a difference in defaulting loan rate depending on : number of kids, marital status, income level and loan purposes

Is there a connection between having kids and repaying a loan on time?
Is there a connection between marital status and repaying a loan on time?
Is there a connection between income level and repaying a loan on time?
How do different loan purposes affect on-time loan repayment?

To answer the questions and test the hypotheses, it's better to work with categorized data, so we can provide more clear informations. I will categorize the data about total income and purpose of the loan, so we will be able to easily compare with the debt info

In [103]:
# Print the values for selected data for categorization
print(data['total_income'].value_counts())
print(data['family_status_id'].value_counts())
print(data['children'].value_counts())
print(data['purpose'].value_counts())
print(data['debt'].value_counts())



20178.4195    175
18366.4590    132
17749.0720    128
18986.5500    107
24506.8390     85
             ... 
15843.5710      1
35720.9900      1
17950.7370      1
13535.0140      1
13127.5870      1
Name: total_income, Length: 19411, dtype: int64
0    12344
1     4163
4     2810
3     1195
2      959
Name: family_status_id, dtype: int64
0    14107
1     4856
2     2128
3      330
4       41
5        9
Name: children, dtype: int64
wedding ceremony                            793
having a wedding                            773
to have a wedding                           769
real estate transactions                    675
buy commercial real estate                  662
buying property for renting out             652
housing transactions                        652
transactions with commercial real estate    650
purchase of the house                       646
housing                                     646
purchase of the house for my family         638
construction of own property           

Let's check unique values

In [104]:
# Check the unique values
print(data['purpose'].unique())



['purchase of the house' 'car purchase' 'supplementary education'
 'to have a wedding' 'housing transactions' 'education' 'having a wedding'
 'purchase of the house for my family' 'buy real estate'
 'buy commercial real estate' 'buy residential real estate'
 'construction of own property' 'property' 'building a property'
 'buying a second-hand car' 'buying my own car'
 'transactions with commercial real estate' 'building a real estate'
 'housing' 'transactions with my real estate' 'cars' 'to become educated'
 'second-hand car purchase' 'getting an education' 'car'
 'wedding ceremony' 'to get a supplementary education'
 'purchase of my own house' 'real estate transactions'
 'getting higher education' 'to own a car' 'purchase of a car'
 'profile education' 'university education'
 'buying property for renting out' 'to buy a car' 'housing renovation'
 'going to university']


I can spot 4 main categories: wedding, car, education, and real estate. I will create a new column with those category and assign the rows with Lemmatization approach.


In [111]:
from pymystem3 import Mystem
from collections import Counter
m = Mystem()

Installing mystem to C:\Users\User/.local/bin\mystem.exe from http://download.cdn.yandex.net/mystem/mystem-3.1-win-64bit.zip


In [112]:
# importing WordNet Lemmatizer:
import nltk
from nltk.stem import WordNetLemmatizer

wordnet_lemma = WordNetLemmatizer()

data.purpose.head()

0      purchase of the house
1               car purchase
2      purchase of the house
3    supplementary education
4          to have a wedding
Name: purpose, dtype: object

In [113]:

# Create a column with the categories and count the values for them
real_estate_category=['house','housing','real estate','property']
car_category=['car', 'cars']

education_category= ['education', 'educated','university']
           
wedding_category =['wedding']

In [114]:
# Let's write a function to categorize the data based on common topic

def lemmatization_func(line):
  
    words = nltk.word_tokenize(line)
    lemmas = [wordnet_lemma.lemmatize(w, pos = 'n') for w in words]
    lemmas=[l.lower() for l in lemmas]
    
    if any(word in lemmas for word in real_estate_category):
        return 'real estate'
    elif  any(word in lemmas for word in car_category):
        return 'car'
    elif  any(word in lemmas for word in education_category):
        return 'education'
    elif  any(word in lemmas for word in wedding_category):
        return 'wedding'
    else:
        
        return 'other'


In [121]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...


True

In [122]:
data['clean_purpose']=data['purpose'].apply(lemmatization_func)

In [123]:
data['clean_purpose'].value_counts()

real estate    6348
other          4466
car            4308
education      4014
wedding        2335
Name: clean_purpose, dtype: int64

In [124]:
#count the total values, to make sure nothing was lost
data['clean_purpose'].value_counts().sum()

21471

my new column, called 'clean_purpose' is ready with the four categories.

In [125]:
data['children'].value_counts()

0    14107
1     4856
2     2128
3      330
4       41
5        9
Name: children, dtype: int64

there are already categories

In [126]:
print(data.groupby('children')['family_status'].value_counts())

children  family_status    
0         married              7473
          civil partnership    2741
          unmarried            2262
          widow / widower       847
          divorced              784
1         married              3004
          civil partnership    1001
          unmarried             454
          divorced              316
          widow / widower        81
2         married              1582
          civil partnership     355
          unmarried              84
          divorced               83
          widow / widower        24
3         married               249
          civil partnership      56
          divorced               11
          unmarried               8
          widow / widower         6
4         married                29
          civil partnership       8
          unmarried               2
          divorced                1
          widow / widower         1
5         married                 7
          civil partnership       2


I could categorize the income by household: family status and number of children, but is my understanding that the hypothesis are more straightforward (ex: number of kids and debt) so i don't think is necessary to add this step.

let's categorize income

In [127]:
data.describe()

Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,debt,total_income
count,21471.0,21471.0,21471.0,21471.0,21471.0,21471.0,21470.0
mean,0.480229,60651.763994,43.482727,0.817195,0.973685,0.081086,26476.595986
std,0.755892,133409.875647,12.217198,0.548508,1.421082,0.272974,15748.494239
min,0.0,24.141633,19.0,0.0,0.0,0.0,3306.762
25%,0.0,1024.27409,33.5,1.0,0.0,0.0,17156.08775
50%,0.0,2182.448451,43.0,1.0,0.0,0.0,23219.727
75%,1.0,4815.241499,53.0,1.0,1.0,0.0,31639.309
max,5.0,401755.400475,75.0,4.0,4.0,1.0,362496.645


In [128]:

def total_income_group(income):
        '''The function returns the income group according to the income value, using the following rules:
    —'low income' for income < 20000
    —'middle income' for income < 50000
    —'middle upper income' for income < 100000
    —'upper' for income < 200000
    —'very high income' for superior values
    '''
        if  income < 20000:
            return 'low income'
        if  income < 50000 :
            return'middle income'
        if  income < 100000 :
            return 'middle upper income'
        if  income < 200000 :
            return'upper income'
        else:  
            return'very high income'
        
        

check if it works

In [129]:
print(total_income_group(17.000))
print(total_income_group(300000))
print(total_income_group(60000))

low income
very high income
middle upper income


apply the function to 'total_income'

In [134]:

data['total_income_groups'] = data['total_income'].apply(lambda income: total_income_group(income))

In [135]:
# Getting summary statistics for the column
print(data['total_income_groups'].value_counts())



middle income          12294
low income              7855
middle upper income     1222
upper income              88
very high income          12
Name: total_income_groups, dtype: int64


[Decide what ranges you will use for grouping and explain why.]

In [136]:
# Creating function for categorizing into different numerical groups based on ranges
#done
data['total_income_groups'].tail(30)

21495          middle income
21496             low income
21497          middle income
21498          middle income
21499          middle income
21500             low income
21501             low income
21502          middle income
21503          middle income
21504             low income
21505             low income
21506          middle income
21507             low income
21508             low income
21509             low income
21510             low income
21511          middle income
21512          middle income
21513          middle income
21514    middle upper income
21515             low income
21516    middle upper income
21517          middle income
21518          middle income
21519             low income
21520          middle income
21521          middle income
21522             low income
21523          middle income
21524             low income
Name: total_income_groups, dtype: object

In [137]:
# Count each categories values to see the distribution
data.head(30)


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_group_test1,age_group,clean_purpose,total_income_groups
0,1,8437.673028,42.0,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,40-49,40-49,real estate,middle income
1,1,4024.803754,36.0,secondary education,1,married,0,F,employee,0,17932.802,car purchase,35-39,35-39,car,low income
2,0,5623.42261,33.0,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house,26-34,26-34,real estate,middle income
3,3,4124.747207,32.0,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,26-34,26-34,education,middle income
4,0,340266.072047,53.0,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,50-59,50-59,wedding,middle income
5,0,926.185831,27.0,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house,26-34,26-34,real estate,middle income
6,0,2879.202052,43.0,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions,40-49,40-49,real estate,middle income
7,0,152.779569,50.0,SECONDARY EDUCATION,1,married,0,M,employee,0,21731.829,education,50-59,50-59,education,middle income
8,2,6929.865299,35.0,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337.093,having a wedding,35-39,35-39,wedding,low income
9,0,2188.756445,41.0,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family,40-49,40-49,real estate,middle income


## Checking the Hypotheses


**Is there a correlation between having children and paying back on time?**

In [138]:
children_count=data['children'].value_counts()

In [139]:
# Check the children data and paying back on time
grouped_children_debt=data.groupby(['children','debt'])['debt'].size()

print(grouped_children_debt)





children  debt
0         0       13044
          1        1063
1         0        4411
          1         445
2         0        1926
          1         202
3         0         303
          1          27
4         0          37
          1           4
5         0           9
Name: debt, dtype: int64


In [140]:
data_pivot_1= data.pivot_table(index='children', values='debt', aggfunc='sum')

print(data_pivot_1)

          debt
children      
0         1063
1          445
2          202
3           27
4            4
5            0


In [141]:
# add a 'ratio' column

data_pivot_1['default_rate_children']=(data_pivot_1['debt'] / children_count)*100

print(data_pivot_1)

          debt  default_rate_children
children                             
0         1063               7.535266
1          445               9.163921
2          202               9.492481
3           27               8.181818
4            4               9.756098
5            0               0.000000


In [142]:
default_rate_children = data.groupby('children')['debt'].agg(Count='count', Sum='sum', Mean = 'mean')
display(default_rate_children)


Unnamed: 0_level_0,Count,Sum,Mean
children,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,14107,1063,0.075353
1,4856,445,0.091639
2,2128,202,0.094925
3,330,27,0.081818
4,41,4,0.097561
5,9,0,0.0


We obtained a  table with sum of client with debts per number of children. Since debt=0 and deb=1, the fuction sum() will return us the total amount of clients with debt. If we look at the variable grouped_children_debt, we can easily see that for same number of children, there are more client without debt, then clients with debt.

**Conclusion**


Based on ratio of clients with children and having ever defaulted a loan, we can assess that there isn't correlation between the two.


**Is there a correlation between family status and paying back on time?**

Following the same code and thinking process as before: comparing debt an family status on a pivot table and adding defaulting ratew column to the pivot table.By observation, determine if there is a connection between the catacteristics.

In [143]:
# Check the family status data and paying back on time
family_status_debt=data.groupby(['family_status','debt'])['debt'].size()

print(family_status_debt)


# Calculating default-rate based on family status



family_status      debt
civil partnership  0        3775
                   1         388
divorced           0        1110
                   1          85
married            0       11413
                   1         931
unmarried          0        2536
                   1         274
widow / widower    0         896
                   1          63
Name: debt, dtype: int64


In [144]:
data_pivot_2 = data.pivot_table(index='family_status', values='debt', aggfunc='sum')

print(data_pivot_2)

                   debt
family_status          
civil partnership   388
divorced             85
married             931
unmarried           274
widow / widower      63


In [145]:
family_status_count=data['family_status'].value_counts()
print(family_status_count)

married              12344
civil partnership     4163
unmarried             2810
divorced              1195
widow / widower        959
Name: family_status, dtype: int64


In [146]:
# add a 'ratio' column

data_pivot_2['default_rate_family_status']=(data_pivot_2['debt'] /family_status_count )*100

display(data_pivot_2)

Unnamed: 0_level_0,debt,default_rate_family_status
family_status,Unnamed: 1_level_1,Unnamed: 2_level_1
civil partnership,388,9.320202
divorced,85,7.112971
married,931,7.542126
unmarried,274,9.75089
widow / widower,63,6.569343


In [147]:
default_rate_family_status=data.groupby('family_status')['debt'].agg(Count='count', Sum='sum', Mean = 'mean')
display(default_rate_family_status)

Unnamed: 0_level_0,Count,Sum,Mean
family_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
civil partnership,4163,388,0.093202
divorced,1195,85,0.07113
married,12344,931,0.075421
unmarried,2810,274,0.097509
widow / widower,959,63,0.065693


**Conclusion**

Also in this case we don't see a significant default rate difference among the family status categories, for all categories, the default rate ranges between 6 to 9 %


**Is there a correlation between income level and paying back on time?**

In [148]:
# Check the income level data and paying back on time

total_income_groups_debt=data.groupby(['total_income_groups','debt'])['debt'].size()

print(total_income_groups_debt)


# Calculating default-rate based on income level



total_income_groups  debt
low income           0        7200
                     1         655
middle income        0       11300
                     1         994
middle upper income  0        1136
                     1          86
upper income         0          83
                     1           5
very high income     0          11
                     1           1
Name: debt, dtype: int64


In [149]:
total_income_groups_count=data['total_income_groups'].value_counts()
print(total_income_groups_count)

middle income          12294
low income              7855
middle upper income     1222
upper income              88
very high income          12
Name: total_income_groups, dtype: int64


In [150]:
data_pivot_3 = data.pivot_table(index='total_income_groups', values='debt', aggfunc='sum')

print(data_pivot_3)

                     debt
total_income_groups      
low income            655
middle income         994
middle upper income    86
upper income            5
very high income        1


In [151]:
data_pivot_3['default_rate_total_income_group']=(data_pivot_3['debt'] /total_income_groups_count)*100

print(data_pivot_3)

                     debt  default_rate_total_income_group
total_income_groups                                       
low income            655                         8.338638
middle income         994                         8.085245
middle upper income    86                         7.037643
upper income            5                         5.681818
very high income        1                         8.333333


In [152]:
default_rate_total_income_groups=data.groupby('total_income_groups')['debt'].agg(Count='count', Sum='sum', Mean = 'mean')
display(default_rate_total_income_groups)

Unnamed: 0_level_0,Count,Sum,Mean
total_income_groups,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
low income,7855,655,0.083386
middle income,12294,994,0.080852
middle upper income,1222,86,0.070376
upper income,88,5,0.056818
very high income,12,1,0.083333


**Conclusion**
From this third pivot table, we can observe that for all categories of income, the rate of default, ranges between 6 to 8. We can assess that not even the income level define a higher or lower risk, however I would not underestimate 1.6% if we are talking about numbers in some mid-big bank.

**How does credit purpose affect the default rate?**

In [153]:
# Check the percentages for default rate for each credit purpose and analyze them

credit_purpose_debt=data.groupby(['clean_purpose','debt'])['debt'].size()

print(credit_purpose_debt)

clean_purpose  debt
car            0       3905
               1        403
education      0       3644
               1        370
other          0       4130
               1        336
real estate    0       5902
               1        446
wedding        0       2149
               1        186
Name: debt, dtype: int64


In [154]:
credit_purpose_count=data['clean_purpose'].value_counts()
print(credit_purpose_count)

real estate    6348
other          4466
car            4308
education      4014
wedding        2335
Name: clean_purpose, dtype: int64


In [155]:
data_pivot_4 = data.pivot_table(index='clean_purpose', values='debt', aggfunc='sum')

print(data_pivot_4)

               debt
clean_purpose      
car             403
education       370
other           336
real estate     446
wedding         186


In [156]:
data_pivot_4['default_rate_credit_purpose']=(data_pivot_4['debt'] /credit_purpose_count)*100

print(data_pivot_4)

               debt  default_rate_credit_purpose
clean_purpose                                   
car             403                     9.354689
education       370                     9.217738
other           336                     7.523511
real estate     446                     7.025835
wedding         186                     7.965739


In [157]:
default_rate_credit_purpose=data.groupby('clean_purpose')['debt'].agg(Count='count', Sum='sum', Mean = 'mean')
display(default_rate_credit_purpose)

Unnamed: 0_level_0,Count,Sum,Mean
clean_purpose,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
car,4308,403,0.093547
education,4014,370,0.092177
other,4466,336,0.075235
real estate,6348,446,0.070258
wedding,2335,186,0.079657


**Conclusion**

in this last pivot table, we can observe again a small range od defaulting rate, between 7 to 9, similar for all credit purposes.


In [158]:
display(data_pivot_1)
display(data_pivot_2)
display(data_pivot_3)
display(data_pivot_4)


Unnamed: 0_level_0,debt,default_rate_children
children,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1063,7.535266
1,445,9.163921
2,202,9.492481
3,27,8.181818
4,4,9.756098
5,0,0.0


Unnamed: 0_level_0,debt,default_rate_family_status
family_status,Unnamed: 1_level_1,Unnamed: 2_level_1
civil partnership,388,9.320202
divorced,85,7.112971
married,931,7.542126
unmarried,274,9.75089
widow / widower,63,6.569343


Unnamed: 0_level_0,debt,default_rate_total_income_group
total_income_groups,Unnamed: 1_level_1,Unnamed: 2_level_1
low income,655,8.338638
middle income,994,8.085245
middle upper income,86,7.037643
upper income,5,5.681818
very high income,1,8.333333


Unnamed: 0_level_0,debt,default_rate_credit_purpose
clean_purpose,Unnamed: 1_level_1,Unnamed: 2_level_1
car,403,9.354689
education,370,9.217738
other,336,7.523511
real estate,446,7.025835
wedding,186,7.965739


# General Conclusion 

If we take a look at the four pivot table, we can conclude that for all the observed categories (total income, number of children, purpose of the loan and marital status) the percentage of loan default ranges from 5 to 9 percent and there aren't any significant and considerable differences between all groups and all categories.

Based on the dataset provided me by the bank's loan division, I can estimate that on this sample of clients, there are't significant differences among client's default loans rates.

The general conclusion is that a customer's marital status, number of children, level of income and loan's purpose, do not pose an inpact on the risk of defaulting a loan payment.

