# Analyzing borrowers’ risk of defaulting

Your project is to prepare a report for a bank’s loan division. You’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

Your report will be considered when building the **credit score** of a potential customer. The **credit score** is used to evaluate the ability of a potential borrower to repay their loan.


## Open the data file and have a look at the general information. 


In [2]:
# Loading all the libraries
import pandas as pd


# Load the data
try:
    data = pd.read_csv('/credit_scoring_eng.csv')
except:
    data = pd.read_csv('/datasets/credit_scoring_eng.csv')

## Task 1. Data exploration

**Description of the data**
- `children` - the number of children in the family
- `days_employed` - work experience in days
- `dob_years` - client's age in years
- `education` - client's education
- `education_id` - education identifier
- `family_status` - marital status
- `family_status_id` - marital status identifier
- `gender` - gender of the client
- `income_type` - type of employment
- `debt` - was there any debt on loan repayment
- `total_income` - monthly income
- `purpose` - the purpose of obtaining a loan

[Now let's explore our data. You'll want to see how many columns and rows it has, look at a few rows to check for potential issues with the data.]

In [3]:
# Let's see how many rows and columns our dataset has

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


In [4]:
# let's print the first N rows

data.head(10)



Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
5,0,-926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house
6,0,-2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions
7,0,-152.779569,50,SECONDARY EDUCATION,1,married,0,M,employee,0,21731.829,education
8,2,-6929.865299,35,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337.093,having a wedding
9,0,-2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family


# Get info on data

- 'days_employed' and 'total_income' seem to have the same rows with missing values. 
-'days_employed' has negative values and even if they were positive, many numbers don't make sense.
-'education' has duplicates with upper and lower
-'debt' shows that apparently very few people have had debt

[Are there missing values across all columns or just a few? Briefly describe what you see in 1-2 sentences.]

In [5]:
# Let's look in the filtered table at the the first column with missing data

print(data.isna().sum())
print()
print(data.isnull().sum()/len(data))
print()
mis_values = data.isnull().sum().to_frame('missing_values')
mis_values['%'] = round(data.isnull().sum()/len(data),3)
mis_values.sort_values(by='%', ascending=False)

# Both days_employed and total_income have same number of missing values


children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

children            0.000000
days_employed       0.100999
dob_years           0.000000
education           0.000000
education_id        0.000000
family_status       0.000000
family_status_id    0.000000
gender              0.000000
income_type         0.000000
debt                0.000000
total_income        0.100999
purpose             0.000000
dtype: float64



Unnamed: 0,missing_values,%
days_employed,2174,0.101
total_income,2174,0.101
children,0,0.0
dob_years,0,0.0
education,0,0.0
education_id,0,0.0
family_status,0,0.0
family_status_id,0,0.0
gender,0,0.0
income_type,0,0.0


In [6]:
# Let's apply multiple conditions for filtering data and look at the number of rows in the filtered table.

print(data[(data.days_employed.isnull())&(data.total_income.isnull())])

# Since this table has 2174 rows, we can conlude missing values in botch columns belogn to same rows. 

       children  days_employed  dob_years            education  education_id  \
12            0            NaN         65  secondary education             1   
26            0            NaN         41  secondary education             1   
29            0            NaN         63  secondary education             1   
41            0            NaN         50  secondary education             1   
55            0            NaN         54  secondary education             1   
...         ...            ...        ...                  ...           ...   
21489         2            NaN         47  Secondary Education             1   
21495         1            NaN         50  secondary education             1   
21497         0            NaN         48    BACHELOR'S DEGREE             0   
21502         1            NaN         42  secondary education             1   
21510         2            NaN         28  secondary education             1   

           family_status  family_status

Missing values seem symmetric when comparing both columns that have them:

- Could be these customers prefer to keep confidentiality due to the nature of their work.
- Could be they forgot to answer this part.

In [7]:
#**Intermediate conclusion**

#Does the number of rows in the filtered table match the number of missing values? What conclusion can we make from this?


#The number of missing values in each total_income and days_employed is the same

#Based on the filtered table above, we know missing values from both columns are found on the same rows.


#Calculate the percentage of the missing values compared to the whole dataset. Is it a considerably large piece of data? If so, you may want to fill the missing values. To do that, firstly we should consider whether the missing data could be due to the specific client characteristic, such as employment type or something else. You will need to decide which characteristic *you* think might be the reason. Secondly, we should check whether there's any dependence missing values have on the value of other indicators with the columns with identified specific client characteristic.

data['total_income'].isnull().sum()/len(data['total_income']) 

#Explain your next steps and how they correlate with the conclusions you made so far.
# We will need to investigate the best way to deal with missing values from each of the rows]

0.10099883855981417

In [8]:
# Let's investigate clients who do not have data on identified characteristic and the column with the missing values

print((data[(data.days_employed.isnull())&(data.total_income.isnull())]).groupby('income_type')['income_type'].count())
print()
print((data[(data.days_employed.isnull())&(data.total_income.isnull())]).groupby('education')['education'].count())
print()
print((data[(data.days_employed.isnull())&(data.total_income.isnull())]).groupby('dob_years')['dob_years'].count())
print()
print((data[(data.days_employed.isnull())&(data.total_income.isnull())]).groupby('children')['children'].count())

income_type
business          508
civil servant     147
employee         1105
entrepreneur        1
retiree           413
Name: income_type, dtype: int64

education
BACHELOR'S DEGREE        23
Bachelor's Degree        25
PRIMARY EDUCATION         1
Primary Education         1
SECONDARY EDUCATION      67
SOME COLLEGE              7
Secondary Education      65
Some College              7
bachelor's degree       496
primary education        19
secondary education    1408
some college             55
Name: education, dtype: int64

dob_years
0     10
19     1
20     5
21    18
22    17
23    36
24    21
25    23
26    35
27    36
28    57
29    50
30    58
31    65
32    37
33    51
34    69
35    64
36    63
37    53
38    54
39    51
40    66
41    59
42    65
43    50
44    44
45    50
46    48
47    59
48    46
49    50
50    51
51    50
52    53
53    44
54    55
55    48
56    54
57    56
58    56
59    34
60    39
61    38
62    38
63    29
64    37
65    20
66    20
67    16
68     9

In [9]:
# Checking distribution

(data[(data.days_employed.isnull())&(data.total_income.isnull())]).describe(include='all')


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
count,2174.0,0.0,2174.0,2174,2174.0,2174,2174.0,2174,2174,2174.0,0.0,2174
unique,,,,12,,5,,2,5,,,38
top,,,,secondary education,,married,,F,employee,,,having a wedding
freq,,,,1408,,1237,,1484,1105,,,92
mean,0.552438,,43.632015,,0.800828,,0.975161,,,0.078197,,
std,1.469356,,12.531481,,0.530157,,1.41822,,,0.268543,,
min,-1.0,,0.0,,0.0,,0.0,,,0.0,,
25%,0.0,,34.0,,0.25,,0.0,,,0.0,,
50%,0.0,,43.0,,1.0,,0.0,,,0.0,,
75%,1.0,,54.0,,1.0,,1.0,,,0.0,,



- Customers with missing values all seem to be of working age.
- Missing values are mostly concentrated in the 'employee' and 'business' income_types.
- Most of the have secondary education

**Possible reasons for missing values in data**

Reasons could be:
- Confidentiality related.


In [10]:
# Checking the distribution in the whole dataset

data.describe(include='all')


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
count,21525.0,19351.0,21525.0,21525,21525.0,21525,21525.0,21525,21525,21525.0,19351.0,21525
unique,,,,15,,5,,3,8,,,38
top,,,,secondary education,,married,,F,employee,,,wedding ceremony
freq,,,,13750,,12380,,14236,11119,,,797
mean,0.538908,63046.497661,43.29338,,0.817236,,0.972544,,,0.080883,26787.568355,
std,1.381587,140827.311974,12.574584,,0.548138,,1.420324,,,0.272661,16475.450632,
min,-1.0,-18388.949901,0.0,,0.0,,0.0,,,0.0,3306.762,
25%,0.0,-2747.423625,33.0,,1.0,,0.0,,,0.0,16488.5045,
50%,0.0,-1203.369529,42.0,,1.0,,0.0,,,0.0,23202.87,
75%,1.0,-291.095954,53.0,,1.0,,1.0,,,0.0,32549.611,


**Intermediate conclusion**

- dob_years (customers' age) mean and median are very similar in both the filtered table with missing values, and in the entire data set. This may mean the missing values could be random.

- The fact that the missing values in both columns income_type and total_income all belong the same rows may show there's a pattern. But the data above shows the missing values could be randome. Therefore, we must investigate further to reach a conclusion that explains the reason for the missing values.

In [11]:
# Check for other reasons and patterns that could lead to missing values

print((data[(data.days_employed.isnull())&(data.total_income.isnull())]).groupby('purpose')['purpose'].count())
print()
print((data[(data.days_employed.isnull())&(data.total_income.isnull())]).groupby('gender')['gender'].count())

purpose
building a property                         59
building a real estate                      46
buy commercial real estate                  67
buy real estate                             72
buy residential real estate                 61
buying a second-hand car                    42
buying my own car                           53
buying property for renting out             65
car                                         41
car purchase                                43
cars                                        57
construction of own property                75
education                                   42
getting an education                        50
getting higher education                    36
going to university                         56
having a wedding                            92
housing                                     60
housing renovation                          70
housing transactions                        74
profile education                           47
prope

**Intermediate conclusion**

- Most missing values are within females employees, of mean age 43, with secondary level of education.

- The fact that missing values match on both total_income and days_employed most probably means they are not accidental.


## Data transformation

[Let's go through each column to see what issues we may have in them.]

[Begin with removing duplicates and fixing educational information if required.]

In [13]:
# Let's see all values in education column to check if and what spellings will need to be fixed
print(data['education'].unique())

["bachelor's degree" 'secondary education' 'Secondary Education'
 'SECONDARY EDUCATION' "BACHELOR'S DEGREE" 'some college'
 'primary education' "Bachelor's Degree" 'SOME COLLEGE' 'Some College'
 'PRIMARY EDUCATION' 'Primary Education' 'Graduate Degree'
 'GRADUATE DEGREE' 'graduate degree']


In [14]:
# Fix the registers if required
data['education'] = data['education'].str.lower()

In [15]:
# Checking all the values in the column to make sure we fixed them

print(data['education'].unique())
print()

["bachelor's degree" 'secondary education' 'some college'
 'primary education' 'graduate degree']



[Check the data the `children` column]

In [16]:
# Let's see the distribution of values in the `children` column
print(data.groupby('children').size())
print()
print((data.children == -1).sum() / (data.loc[:,'children']).count())
print((data.children == 20).sum() / (data.loc[:,'children']).count())

       

children
-1        47
 0     14149
 1      4818
 2      2055
 3       330
 4        41
 5         9
 20       76
dtype: int64

0.002183507549361208
0.0035307781649245064


[Are there any strange things in the column? If yes, how high is the percentage of problematic data? How could they have occurred? Make a decision on what you will do with this data and explain you reasoning.]

- The problematic data percentage is not high, but it may have an effect on our analysis later on and categorization look less clean.
- 0.2% customers have -1 children --> will turn this to an absolute 1
- 0.4% customers have 20 children, which not impossible although highly unlikely --> will delete these rows


In [17]:
# [fix the data based on your decision]



data['children'] = data['children'].abs()

data = data[data['children'] != 20]

In [18]:
# Checking the `children` column again to make sure it's all fixed

print(data.groupby('children').size())

children
0    14149
1     4865
2     2055
3      330
4       41
5        9
dtype: int64


[Check the data in the `days_employed` column. Firstly think about what kind of issues could there be and what you may want to check and how you will do it.]

In [19]:
# Find problematic data in `days_employed`, if they exist, and calculate the percentage
data.groupby('days_employed').size()


days_employed
-18388.949901     1
-17615.563266     1
-16593.472817     1
-16264.699501     1
-16119.687737     1
                 ..
 401663.850046    1
 401674.466633    1
 401675.093434    1
 401715.811749    1
 401755.400475    1
Length: 19284, dtype: int64

[If the amount of problematic data is high, it could've been due to some technical issues. We may probably want to propose the most obvious reason why it could've happened and what the correct data might've been, as we can't drop these problematic rows.]


- Since the column has NaN's, the column's data type is float. --> will turn these to 0's
- Some negative numbers --> will turn these to absolute positives
- Some huge numbers, but I will not address this, since to make changes, I would need to make assumptions which would lead to wrongly tampering with the data. 

In [20]:
# Address the problematic values, if they exist

# For now I will leave the NaN's and converting the column's type from float to int for later

# Converting all the column's values to positive
data['days_employed'] = data['days_employed'].abs()

In [21]:
# Check the result - make sure it's fixed
data.info()
data.groupby('days_employed').size()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21449 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21449 non-null  int64  
 1   days_employed     19284 non-null  float64
 2   dob_years         21449 non-null  int64  
 3   education         21449 non-null  object 
 4   education_id      21449 non-null  int64  
 5   family_status     21449 non-null  object 
 6   family_status_id  21449 non-null  int64  
 7   gender            21449 non-null  object 
 8   income_type       21449 non-null  object 
 9   debt              21449 non-null  int64  
 10  total_income      19284 non-null  float64
 11  purpose           21449 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.1+ MB


days_employed
24.141633        1
24.240695        1
30.195337        1
33.520665        1
34.701045        1
                ..
401663.850046    1
401674.466633    1
401675.093434    1
401715.811749    1
401755.400475    1
Length: 19284, dtype: int64

[Let's now look at the client's age and whether there are any issues there. Again, think about what can data can be strange in this column, i.e. what cannot be someone's age.]

In [22]:
# Check the `dob_years` for suspicious values and count the percentage

print(data.groupby('dob_years').size())
print()
print((data.dob_years == 0).sum() / (data.loc[:,'dob_years']).count())


dob_years
0     100
19     14
20     51
21    110
22    183
23    253
24    263
25    356
26    407
27    491
28    503
29    543
30    537
31    558
32    508
33    579
34    600
35    615
36    553
37    533
38    597
39    572
40    605
41    605
42    594
43    511
44    545
45    494
46    472
47    480
48    537
49    505
50    511
51    447
52    483
53    458
54    478
55    442
56    482
57    459
58    461
59    442
60    376
61    354
62    351
63    269
64    264
65    194
66    183
67    167
68     99
69     84
70     65
71     58
72     33
73      8
74      6
75      1
dtype: int64

0.004662222015012355


In [23]:
# Address the issues in the `dob_years` column, if they exist
data = data[data['dob_years'] != 0]

In [24]:
# Check the result - make sure it's fixed
print(data.groupby('dob_years').size())

dob_years
19     14
20     51
21    110
22    183
23    253
24    263
25    356
26    407
27    491
28    503
29    543
30    537
31    558
32    508
33    579
34    600
35    615
36    553
37    533
38    597
39    572
40    605
41    605
42    594
43    511
44    545
45    494
46    472
47    480
48    537
49    505
50    511
51    447
52    483
53    458
54    478
55    442
56    482
57    459
58    461
59    442
60    376
61    354
62    351
63    269
64    264
65    194
66    183
67    167
68     99
69     84
70     65
71     58
72     33
73      8
74      6
75      1
dtype: int64


[Now let's check the `family_status` column. See what kind of values there are and what problems you may need to address.]

In [25]:
# Let's see the values for the column

print(data.groupby('family_status').size())
print()
print(data['family_status'].isna().sum())

family_status
civil partnership     4144
divorced              1183
married              12283
unmarried             2788
widow / widower        951
dtype: int64

0


In [26]:
# Address the problematic values in `family_status`, if they exist

# For now I don't see any problematic values here


In [27]:
# Check the result - make sure it's fixed


[Now let's check the `gender` column. See what kind of values there are and what problems you may need to address]

In [28]:
# Let's see the values in the column
print(data.groupby('gender').size())

gender
F      14118
M       7230
XNA        1
dtype: int64


In [29]:
# Address the problematic values, if they exist

#XNA in gender might be because of a third gender or incorrect information with data entry. 
#For now Im not sure if this will affect the table analysis, but since its only 1 row I will delete it.

data = data[data['gender'] != 'XNA']

In [30]:
# Check the result - make sure it's fixed

print(data.groupby('gender').size())

gender
F    14118
M     7230
dtype: int64


[Now let's check the `income_type` column. See what kind of values there are and what problems you may need to address]

In [31]:
# Let's see the values in the column
print(data.groupby('income_type').size())

income_type
business                        5042
civil servant                   1451
employee                       11022
entrepreneur                       2
paternity / maternity leave        1
retiree                         3827
student                            1
unemployed                         2
dtype: int64


In [32]:
# Address the problematic values, if they exist

# no problematic values here

In [33]:
# Check the result - make sure it's fixed

print(data.groupby('income_type').size())

income_type
business                        5042
civil servant                   1451
employee                       11022
entrepreneur                       2
paternity / maternity leave        1
retiree                         3827
student                            1
unemployed                         2
dtype: int64


In [34]:
# Checking duplicates
print(data.duplicated().sum())
data[data.duplicated()]


71


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
2849,0,,41,secondary education,1,married,0,F,employee,0,,purchase of the house for my family
3290,0,,58,secondary education,1,civil partnership,1,F,retiree,0,,to have a wedding
4182,1,,34,bachelor's degree,0,civil partnership,1,F,employee,0,,wedding ceremony
4851,0,,60,secondary education,1,civil partnership,1,F,retiree,0,,wedding ceremony
5557,0,,58,secondary education,1,civil partnership,1,F,retiree,0,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
20702,0,,64,secondary education,1,married,0,F,retiree,0,,supplementary education
21032,0,,60,secondary education,1,married,0,F,retiree,0,,to become educated
21132,0,,47,secondary education,1,married,0,F,employee,0,,housing renovation
21281,1,,30,bachelor's degree,0,married,0,F,employee,0,,buy commercial real estate


In [35]:
# Address the duplicates, if they exist
 
# It seems all of the duplicates fall into the group of missing_values, since they all have total_income=NaN and days_employed=0 (previosuly turned from NaN to 0)
# We don't have any differentiating data such as UID or names to know if duplicate data belongs to same people, but considering the relativelty low amount of duplicates and what I mentioned above, I will delete these.

data = data.drop_duplicates().reset_index(drop=True)

In [36]:
# Last check whether we have any duplicates
print(data.duplicated().sum())

0


In [37]:
# Check the size of the dataset that you now have after your first manipulations with it
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21277 entries, 0 to 21276
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21277 non-null  int64  
 1   days_employed     19193 non-null  float64
 2   dob_years         21277 non-null  int64  
 3   education         21277 non-null  object 
 4   education_id      21277 non-null  int64  
 5   family_status     21277 non-null  object 
 6   family_status_id  21277 non-null  int64  
 7   gender            21277 non-null  object 
 8   income_type       21277 non-null  object 
 9   debt              21277 non-null  int64  
 10  total_income      19193 non-null  float64
 11  purpose           21277 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 1.9+ MB


[Describe your new dataset: briefly say what's changed and what's the percentage of the changes, if there were any.]

The data now has 21277 entries, as opposed to 21525 which we started with. 
Changes done:

- Deleted 76 rows with children = 20
- Deleted 1 row with gender = XNA
- Deleted 100 rows with dob_years = 0
- Deleted 71 duplicate rows

# Working with missing values

[To speed up working with some data, you may want to work with dictionaries for some values, where IDs are provided. Explain why and which dictionaries you will work with.]

In [38]:
# Find the dictionaries

# I will create a dictionary of the medians of total_income based on education, from the table without missing values. 
# I will use this to fill missing values in total_income in the data table.

data_without_missing_income = data.dropna(subset=['total_income'])

education_medians = data_without_missing_income.groupby('education')['total_income'].median()
education_medians = education_medians.astype('int')

dict_education_medians=pd.Series(education_medians).to_dict()


print(dict_education_medians)

{"bachelor's degree": 28065, 'graduate degree': 25161, 'primary education': 18741, 'secondary education': 21835, 'some college': 25664}


In [39]:
# Visualizing missing values in total_income column in the data table

data[data['total_income'].isna()]


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21242,2,,47,secondary education,1,married,0,M,business,0,,purchase of a car
21247,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21249,0,,48,bachelor's degree,0,married,0,F,business,0,,building a property
21254,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


### Restoring missing values in `total_income`

[Briefly state which column(s) have values missing that you need to address. Explain how you will fix them.]

[Start with addressing total income missing values. Create and age category for clients. Create a new column with the age category. This strategy can help with calculating values for the total income.]

There are 3 major influencers inside our data that usually have a strong influence on people's income: age, income and education level. From this 3, I will use education to fill in the missing total_income values


In [40]:
# Let's write a function that calculates the age category

def age_category(age):
    if age < 30:
        return ('18-29')
    elif age < 40:
        return ('30-39')
    elif age < 50:
        return ('40-49')   
    elif age < 60:
        return ('50-59')
    elif age < 70:
        return ('60-69')
    else:
        return ('70+')

In [41]:
# Test if the function works

print(age_category(18))
print(age_category(75))

18-29
70+


In [42]:
# Creating new column based on function

data['age_category'] = data['dob_years'].apply(age_category)

In [43]:
# Checking how values in the new column

data.head(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,40-49
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,30-39
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,30-39
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,30-39
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,50-59
5,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house,18-29
6,0,2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions,40-49
7,0,152.779569,50,secondary education,1,married,0,M,employee,0,21731.829,education,50-59
8,2,6929.865299,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337.093,having a wedding,30-39
9,0,2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family,40-49


- Factors that usually affect total_income: education, income_type and age.

In [44]:
# Create a table without missing values and print a few of its rows to make sure it looks fine

data_without_missing_income = data.dropna(subset=['total_income'])
data_without_missing_income.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19193 entries, 0 to 21276
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          19193 non-null  int64  
 1   days_employed     19193 non-null  float64
 2   dob_years         19193 non-null  int64  
 3   education         19193 non-null  object 
 4   education_id      19193 non-null  int64  
 5   family_status     19193 non-null  object 
 6   family_status_id  19193 non-null  int64  
 7   gender            19193 non-null  object 
 8   income_type       19193 non-null  object 
 9   debt              19193 non-null  int64  
 10  total_income      19193 non-null  float64
 11  purpose           19193 non-null  object 
 12  age_category      19193 non-null  object 
dtypes: float64(2), int64(5), object(6)
memory usage: 2.1+ MB


In [45]:
# Look at the mean values for income based on your identified factors
print(data_without_missing_income.groupby('age_category')['total_income'].mean().sort_values(ascending=False))
print()
print(data_without_missing_income.groupby('education')['total_income'].mean().sort_values(ascending=False))
print()
print(data_without_missing_income.groupby('income_type')['total_income'].mean().sort_values(ascending=False))

age_category
40-49    28561.286782
30-39    28310.443866
50-59    25797.812665
18-29    25541.253509
60-69    23251.113057
70+      20125.658331
Name: total_income, dtype: float64

education
bachelor's degree      33186.447659
some college           29060.742951
graduate degree        27960.024667
secondary education    24590.586912
primary education      21144.882211
Name: total_income, dtype: float64

income_type
entrepreneur                   79866.103000
business                       32414.290135
civil servant                  27350.649756
employee                       25819.737239
retiree                        21943.056865
unemployed                     21014.360500
student                        15712.260000
paternity / maternity leave     8612.661000
Name: total_income, dtype: float64


In [46]:
# Look at the median values for income based on your identified factors
print(data_without_missing_income.groupby('age_category')['total_income'].median().sort_values(ascending=False))
print()
print(data_without_missing_income.groupby('education')['total_income'].median().sort_values(ascending=False))
print()
print(data_without_missing_income.groupby('income_type')['total_income'].median().sort_values(ascending=False))

age_category
40-49    24770.8760
30-39    24679.9890
18-29    22749.3960
50-59    22197.5515
60-69    19817.4400
70+      18751.3240
Name: total_income, dtype: float64

education
bachelor's degree      28065.7400
some college           25664.1810
graduate degree        25161.5835
secondary education    21835.2490
primary education      18741.9760
Name: total_income, dtype: float64

income_type
entrepreneur                   79866.1030
business                       27577.2720
civil servant                  24076.1150
employee                       22816.1930
unemployed                     21014.3605
retiree                        18956.9340
student                        15712.2600
paternity / maternity leave     8612.6610
Name: total_income, dtype: float64


- Both mean and median of the factors I compared with total_income have the same trends of who has more who earns more and less total_income

In [1]:
# Checking if total_income has outliers to define whether to use mean or median: 

import plotly.express as px      
fig = px.histogram(data_without_missing_income, x='total_income')
fig.show()

# As we can see from this graph total_income has quite a few outliers, therefore I will use the medians from 'education', 
# since this a factor that strongly influences people's income.

NameError: name 'data_without_missing_income' is not defined

In [48]:
#  Write a function that we will use for filling in missing values

data['total_income'] = data['total_income'].fillna(data.education.map(dict_education_medians))
data['total_income'] = data['total_income'].astype('int')

In [49]:
# Check if it works

print(data.loc[12])
print()
print(data.loc[21249])

children                              0
days_employed                       NaN
dob_years                            65
education           secondary education
education_id                          1
family_status         civil partnership
family_status_id                      1
gender                                M
income_type                     retiree
debt                                  0
total_income                      21835
purpose               to have a wedding
age_category                      60-69
Name: 12, dtype: object

children                              0
days_employed                       NaN
dob_years                            48
education             bachelor's degree
education_id                          0
family_status                   married
family_status_id                      0
gender                                F
income_type                    business
debt                                  0
total_income                      28065
purpose        

In [53]:
# Checking the number of entries in the columns

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21277 entries, 0 to 21276
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21277 non-null  int64  
 1   days_employed     19193 non-null  float64
 2   dob_years         21277 non-null  int64  
 3   education         21277 non-null  object 
 4   education_id      21277 non-null  int64  
 5   family_status     21277 non-null  object 
 6   family_status_id  21277 non-null  int64  
 7   gender            21277 non-null  object 
 8   income_type       21277 non-null  object 
 9   debt              21277 non-null  int64  
 10  total_income      21277 non-null  int64  
 11  purpose           21277 non-null  object 
 12  age_category      21277 non-null  object 
dtypes: float64(1), int64(6), object(6)
memory usage: 2.1+ MB


###  Restoring values in `days_employed`

In [54]:
# Distribution of `days_employed` medians based on your identified parameters

data_without_missing_days = data.dropna(subset=['days_employed'])


import plotly.express as px      
fig = px.histogram(data_without_missing_days, x='days_employed')
fig.show()
# The below table shows 2 groups of data for days_employed, a group with a smaller amount of days_employed, 
# and a group with unrealistically high amount of days employed. There


print(data_without_missing_days.groupby('age_category')['days_employed'].median().sort_values(ascending=False))
print()
print(data_without_missing_days.groupby('education')['days_employed'].median().sort_values(ascending=False))
print()
print(data_without_missing_days.groupby('income_type')['days_employed'].median().sort_values(ascending=False))
print()
data_without_missing_days['days_employed'].describe()

age_category
70+      361336.993449
60-69    354932.869424
50-59      4813.815685
40-49      2113.730191
30-39      1601.919871
18-29       999.095858
Name: days_employed, dtype: float64

education
graduate degree        5660.057032
primary education      3043.933615
secondary education    2394.069195
bachelor's degree      1896.866606
some college           1210.909349
Name: days_employed, dtype: float64

income_type
unemployed                     366413.652744
retiree                        365247.114512
paternity / maternity leave      3296.759962
civil servant                    2672.903939
employee                         1575.323578
business                         1554.621203
student                           578.751554
entrepreneur                      520.848083
Name: days_employed, dtype: float64



count     19193.000000
mean      66997.654188
std      139103.936185
min          24.141633
25%         927.789271
50%        2197.999009
75%        5555.632416
max      401755.400475
Name: days_employed, dtype: float64

In [55]:
data_without_missing_days.head(10)


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620,purchase of the house,40-49
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932,car purchase,30-39
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341,purchase of the house,30-39
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820,supplementary education,30-39
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378,to have a wedding,50-59
5,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922,purchase of the house,18-29
6,0,2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484,housing transactions,40-49
7,0,152.779569,50,secondary education,1,married,0,M,employee,0,21731,education,50-59
8,2,6929.865299,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337,having a wedding,30-39
9,0,2188.756445,41,secondary education,1,married,0,M,employee,0,23108,purchase of the house for my family,40-49


In [56]:
# Distribution of `days_employed` means based on your identified parameters

print(data_without_missing_days.groupby('age_category')['days_employed'].mean().sort_values(ascending=False))
print()
print(data_without_missing_days.groupby('education')['days_employed'].mean().sort_values(ascending=False))
print()
print(data_without_missing_days.groupby('income_type')['days_employed'].mean().sort_values(ascending=False))

age_category
70+      320819.151927
60-69    283776.694692
50-59    133107.241081
40-49     12361.679220
30-39      4162.648468
18-29      2084.576750
Name: days_employed, dtype: float64

education
primary education      130340.426349
graduate degree        121323.630206
secondary education     76481.124303
bachelor's degree       42447.427842
some college            20804.838909
Name: days_employed, dtype: float64

income_type
unemployed                     366413.652744
retiree                        365024.240512
civil servant                    3387.514221
paternity / maternity leave      3296.759962
employee                         2327.164891
business                         2117.879986
student                           578.751554
entrepreneur                      520.848083
Name: days_employed, dtype: float64


In [57]:
# Let's write a function that calculates means or medians (depending on your decision) based on your identified parameter

age_category_medians = data_without_missing_days.groupby('age_category')['days_employed'].median()
age_category_medians = age_category_medians.astype('int')
dict_age_category_medians=pd.Series(age_category_medians).to_dict()


data[data['days_employed'].isna()]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,21835,to have a wedding,60-69
26,0,,41,secondary education,1,married,0,M,civil servant,0,21835,education,40-49
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,21835,building a real estate,60-69
41,0,,50,secondary education,1,married,0,F,civil servant,0,21835,second-hand car purchase,50-59
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,21835,to have a wedding,50-59
...,...,...,...,...,...,...,...,...,...,...,...,...,...
21242,2,,47,secondary education,1,married,0,M,business,0,21835,purchase of a car,40-49
21247,1,,50,secondary education,1,civil partnership,1,F,employee,0,21835,wedding ceremony,50-59
21249,0,,48,bachelor's degree,0,married,0,F,business,0,28065,building a property,40-49
21254,1,,42,secondary education,1,married,0,F,employee,0,21835,building a real estate,40-49


In [58]:
# Check that the function works

data['days_employed'] = data['days_employed'].fillna(data.age_category.map(dict_age_category_medians))

In [59]:
# Apply function to the income_type

print(data.loc[12])
print()
print(data.loc[21249])

children                              0
days_employed                  354932.0
dob_years                            65
education           secondary education
education_id                          1
family_status         civil partnership
family_status_id                      1
gender                                M
income_type                     retiree
debt                                  0
total_income                      21835
purpose               to have a wedding
age_category                      60-69
Name: 12, dtype: object

children                              0
days_employed                    2113.0
dob_years                            48
education             bachelor's degree
education_id                          0
family_status                   married
family_status_id                      0
gender                                F
income_type                    business
debt                                  0
total_income                      28065
purpose        

In [60]:
# Check if function worked

data['days_employed'] = data['days_employed'].astype('int')
data.info()
data.describe(include="all")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21277 entries, 0 to 21276
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   children          21277 non-null  int64 
 1   days_employed     21277 non-null  int64 
 2   dob_years         21277 non-null  int64 
 3   education         21277 non-null  object
 4   education_id      21277 non-null  int64 
 5   family_status     21277 non-null  object
 6   family_status_id  21277 non-null  int64 
 7   gender            21277 non-null  object
 8   income_type       21277 non-null  object
 9   debt              21277 non-null  int64 
 10  total_income      21277 non-null  int64 
 11  purpose           21277 non-null  object
 12  age_category      21277 non-null  object
dtypes: int64(7), object(6)
memory usage: 2.1+ MB


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category
count,21277.0,21277.0,21277.0,21277,21277.0,21277,21277.0,21277,21277,21277.0,21277.0,21277,21277
unique,,,,5,,5,,2,8,,,38,6
top,,,,secondary education,,married,,F,employee,,,wedding ceremony,30-39
freq,,,,15049,,12242,,14056,10987,,,785,5640
mean,0.475161,64722.35865,43.480707,,0.817643,,0.973164,,,0.081073,26471.53015,,
std,0.751764,136974.235504,12.24572,,0.549079,,1.421203,,,0.272954,15730.206725,,
min,0.0,24.0,19.0,,0.0,,0.0,,,0.0,3306.0,,
25%,0.0,999.0,33.0,,1.0,,0.0,,,0.0,17219.0,,
50%,0.0,2113.0,43.0,,1.0,,0.0,,,0.0,22586.0,,
75%,1.0,5135.0,53.0,,1.0,,1.0,,,0.0,31320.0,,


In [62]:
# Check the entries in all columns - make sure we fixed all missing values

In [63]:
data.describe(include='all')

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category
count,21277.0,21277.0,21277.0,21277,21277.0,21277,21277.0,21277,21277,21277.0,21277.0,21277,21277
unique,,,,5,,5,,2,8,,,38,6
top,,,,secondary education,,married,,F,employee,,,wedding ceremony,30-39
freq,,,,15049,,12242,,14056,10987,,,785,5640
mean,0.475161,64722.35865,43.480707,,0.817643,,0.973164,,,0.081073,26471.53015,,
std,0.751764,136974.235504,12.24572,,0.549079,,1.421203,,,0.272954,15730.206725,,
min,0.0,24.0,19.0,,0.0,,0.0,,,0.0,3306.0,,
25%,0.0,999.0,33.0,,1.0,,0.0,,,0.0,17219.0,,
50%,0.0,2113.0,43.0,,1.0,,0.0,,,0.0,22586.0,,
75%,1.0,5135.0,53.0,,1.0,,1.0,,,0.0,31320.0,,


## Categorization of data

[To answer the questions and test the hypotheses, you will want to work with categorized data. Look at the questions that were posed to you and that you should answer. Think about which of the data will need to be categorized to answer these questions. Below you will find a template through which you can work your way when categorizing data. The first step-by-step processing covers the text data; the second one addresses the numerical data that needs to be categorized. You can use both or none of the suggested instructions - it's up to you.]

[Despite of how you decide to address the categorization, make sure to provide clear explanation of why you made your decision. Remember: this is your work and you make all decisions in it.]


In [64]:
# Print the values for your selected data for categorization

# It feels like there's not much room for categorization other that in total_income and purpose:

 #   children - not a lot of unique values, not worth categorizing      
 #   days_employed - not the most trustworthy due to the NaN's we had ro replace with medians from data with too many outliers
 #   dob_years - already a category for this in age_category       
 #   education - not a lot of unique values, not worth categorizing  
 #   education_id - not a lot of unique values, not worth categorizing        
 #   family_status - not a lot of unique values, not worth categorizing    
 #   family_status_id - not a lot of unique values, not worth categorizing 
 #   gender - not a lot of unique values, not worth categorizing             
 #   income_type - this one is worth categorizing since it has a lot of different values        
 #   debt - not a lot of unique values, not worth categorizing            
 #  total_income - this one is worth categorizing since it has a lot of different values     
 #  purpose - this one is worth categorizing since it has a lot of different values           
 #  age_category - is already a category
    
print(data['total_income'].value_counts())

import plotly.express as px      
fig = px.histogram(data, x='total_income')
fig.show()

21835    1468
28065     529
25664      69
18741      22
19552       7
         ... 
12160       1
14211       1
6023        1
18317       1
52973       1
Name: total_income, Length: 15293, dtype: int64


In [65]:
# Check the unique values of total_income
print(data['total_income'].unique())

[40620 17932 23341 ... 24618 35966 39054]


[What main groups can you identify based on the unique values?]

- $0− $40000
- $40001− $80000
- $80001+

[Based on these themes, we will probably want to categorize our data.]


In [66]:
# Let's write a function to categorize the data based on common topics

def income_category(x):
    if x <= 40000:
        return ('$0-$40000')
    if (x > 40000 and x <= 80000):
        return ('$40001-$80000')
    if x > 80000:
        return ('$80000+')

In [67]:
# Create a column with the categories and count the values for them

data['income_category'] = data['total_income'].apply(income_category)

In [68]:
# Looking through all the numerical data in your selected column for categorization
data.head(20)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category,income_category
0,1,8437,42,bachelor's degree,0,married,0,F,employee,0,40620,purchase of the house,40-49,$40001-$80000
1,1,4024,36,secondary education,1,married,0,F,employee,0,17932,car purchase,30-39,$0-$40000
2,0,5623,33,secondary education,1,married,0,M,employee,0,23341,purchase of the house,30-39,$0-$40000
3,3,4124,32,secondary education,1,married,0,M,employee,0,42820,supplementary education,30-39,$40001-$80000
4,0,340266,53,secondary education,1,civil partnership,1,F,retiree,0,25378,to have a wedding,50-59,$0-$40000
5,0,926,27,bachelor's degree,0,civil partnership,1,M,business,0,40922,purchase of the house,18-29,$40001-$80000
6,0,2879,43,bachelor's degree,0,married,0,F,business,0,38484,housing transactions,40-49,$0-$40000
7,0,152,50,secondary education,1,married,0,M,employee,0,21731,education,50-59,$0-$40000
8,2,6929,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337,having a wedding,30-39,$0-$40000
9,0,2188,41,secondary education,1,married,0,M,employee,0,23108,purchase of the house for my family,40-49,$0-$40000


In [69]:
# Check the unique values of purpose

print(data['purpose'].value_counts())

wedding ceremony                            785
having a wedding                            760
to have a wedding                           756
real estate transactions                    671
buy commercial real estate                  655
buying property for renting out             648
transactions with commercial real estate    644
housing transactions                        642
purchase of the house                       637
housing                                     636
purchase of the house for my family         636
property                                    628
construction of own property                627
transactions with my real estate            626
building a real estate                      620
building a property                         619
purchase of my own house                    619
buy real estate                             615
housing renovation                          603
buy residential real estate                 601
buying my own car                       

In [70]:
# Let's write a function to categorize the data based on common topics

purpose_dict = {
    "wedding":"Wedding", 
    "car":"Car", 
    "cars":"Car",
    "education": "Education", 
    "educated": "Education",
    "university": "Education", 
    "property": "Housing", 
    "house": "Housing", 
    "real estate": "Housing",  
    "housing": "Housing"}  

data['purpose_category'] = (data['purpose'].str.extract(fr"\b({'|'.join(purpose_dict.keys())})\b")[0].map(purpose_dict))

In [71]:
# Getting summary statistics for the column

data.tail(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category,income_category,purpose_category
21267,1,467,28,secondary education,1,married,0,F,employee,1,17517,to become educated,18-29,$0-$40000,Education
21268,0,914,42,bachelor's degree,0,married,0,F,business,0,51649,purchase of my own house,40-49,$40001-$80000,Housing
21269,0,404,42,bachelor's degree,0,civil partnership,1,F,business,0,28489,buying my own car,40-49,$0-$40000,Car
21270,0,373995,59,secondary education,1,married,0,F,retiree,0,24618,purchase of a car,50-59,$0-$40000,Car
21271,1,2351,37,graduate degree,4,divorced,3,M,employee,0,18551,buy commercial real estate,30-39,$0-$40000,Housing
21272,1,4529,43,secondary education,1,civil partnership,1,F,business,0,35966,housing transactions,40-49,$0-$40000,Housing
21273,0,343937,67,secondary education,1,married,0,F,retiree,0,24959,purchase of a car,60-69,$0-$40000,Car
21274,1,2113,38,secondary education,1,civil partnership,1,M,employee,1,14347,property,30-39,$0-$40000,Housing
21275,3,3112,38,secondary education,1,married,0,M,employee,1,39054,buying my own car,30-39,$0-$40000,Car
21276,2,1984,40,secondary education,1,married,0,F,employee,0,13127,to buy a car,40-49,$0-$40000,Car


[Decide what ranges you will use for grouping and explain why.]

In [74]:
# Count each categories values to see the distribution
data.describe(include='all')

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category,income_category,purpose_category
count,21277.0,21277.0,21277.0,21277,21277.0,21277,21277.0,21277,21277,21277.0,21277.0,21277,21277,21277,21277
unique,,,,5,,5,,2,8,,,38,6,3,4
top,,,,secondary education,,married,,F,employee,,,wedding ceremony,30-39,$0-$40000,Housing
freq,,,,15049,,12242,,14056,10987,,,785,5640,18489,10727
mean,0.475161,64722.35865,43.480707,,0.817643,,0.973164,,,0.081073,26471.53015,,,,
std,0.751764,136974.235504,12.24572,,0.549079,,1.421203,,,0.272954,15730.206725,,,,
min,0.0,24.0,19.0,,0.0,,0.0,,,0.0,3306.0,,,,
25%,0.0,999.0,33.0,,1.0,,0.0,,,0.0,17219.0,,,,
50%,0.0,2113.0,43.0,,1.0,,0.0,,,0.0,22586.0,,,,
75%,1.0,5135.0,53.0,,1.0,,1.0,,,0.0,31320.0,,,,


## Checking the Hypotheses


**Is there a correlation between having children and paying back on time?** 

Yes, the higher the number of children number a person has, the lower the default rate. Meaning, people with less children have a higher change of not paying on time.

In [75]:
# Check the children data and paying back on time

print(data.groupby(['children', 'debt']).agg({'debt':'count'}))
print()
print(data.groupby('children')['children'].count())
print()

# Calculating default-rate based on the number of children

default_number = data['debt'].mean()
print(f'General default rate: {default_number:.2%}')
print()

total_0_child = data.loc[data['children']==0, 'debt'].count()
total_1_child = data.loc[data['children']==1, 'debt'].count()
total_2_child = data.loc[data['children']==2, 'debt'].count()
total_3_child = data.loc[data['children']==3, 'debt'].count()
total_4_child = data.loc[data['children']==4, 'debt'].count()
total_5_child = data.loc[data['children']==5, 'debt'].count()

default_0_child=((data.debt == 1)&(data.children == 0)).sum()/total_0_child
default_1_child=((data.debt == 1)&(data.children == 1)).sum()/total_1_child
default_2_child=((data.debt == 1)&(data.children == 2)).sum()/total_2_child
default_3_child=((data.debt == 1)&(data.children == 3)).sum()/total_3_child
default_4_child=((data.debt == 1)&(data.children == 4)).sum()/total_4_child
default_5_child=((data.debt == 1)&(data.children == 5)).sum()/total_5_child

print(f'Default rate 0 children: {default_0_child:.2%}')
print(f'Default rate 1 children: {default_1_child:.2%}')
print(f'Default rate 2 children: {default_2_child:.2%}')
print(f'Default rate 3 children: {default_3_child:.2%}')
print(f'Default rate 4 children: {default_4_child:.2%}')
print(f'Default rate 5 children: {default_5_child:.2%}')                                                         

                debt
children debt       
0        0     12963
         1      1058
1        0      4397
         1       442
2        0      1845
         1       194
3        0       301
         1        27
4        0        37
         1         4
5        0         9

children
0    14021
1     4839
2     2039
3      328
4       41
5        9
Name: children, dtype: int64

General default rate: 8.11%

Default rate 0 children: 7.55%
Default rate 1 children: 9.13%
Default rate 2 children: 9.51%
Default rate 3 children: 8.23%
Default rate 4 children: 9.76%
Default rate 5 children: 0.00%


**Conclusion**

[Write your conclusions based on your manipulations and observations.]

- People with 0 children tend to default less
- The more children people have, the more they default.
- Data on people with 4 and 5 children is not sufficient to reach conclusions as there are not enough customers that fall into this category



In [76]:
pivot_1 = data.pivot_table(index='children', values='debt', aggfunc=['count', 'sum', 'mean'])

display(pivot_1)

Unnamed: 0_level_0,count,sum,mean
Unnamed: 0_level_1,debt,debt,debt
children,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
0,14021,1058,0.075458
1,4839,442,0.091341
2,2039,194,0.095145
3,328,27,0.082317
4,41,4,0.097561
5,9,0,0.0


**Is there a correlation between family status and paying back on time?**

In [77]:
# Check the family status data and paying back on time

print(data.groupby(['family_status', 'debt']).agg({'debt':'count'}))
print()
print(data.groupby('family_status')['family_status'].count())
print()

total_civil_partnership = data.loc[data['family_status']=='civil partnership', 'debt'].count()
total_divorced = data.loc[data['family_status']=='divorced', 'debt'].count()
total_married = data.loc[data['family_status']=='married', 'debt'].count()
total_unmarried = data.loc[data['family_status']=='unmarried', 'debt'].count()
total_widow = data.loc[data['family_status']=="widow / widower", 'debt'].count()

# Calculating default-rate based on family status

default_civil_partnership=((data.debt == 1)&(data.family_status == 'civil partnership')).sum()/total_civil_partnership
default_divorced=((data.debt == 1)&(data.family_status == 'divorced')).sum()/total_divorced
default_married=((data.debt == 1)&(data.family_status == 'married')).sum()/total_married
default_unmarried=((data.debt == 1)&(data.family_status == 'unmarried')).sum()/total_unmarried
default_widow=((data.debt == 1)&(data.family_status == 'widow / widower')).sum()/total_widow

print(f'Default rate civil partnership: {default_civil_partnership:.2%}')
print(f'Default rate divorced: {default_divorced:.2%}')
print(f'Default rate married: {default_married:.2%}')
print(f'Default rate unmarried: {default_unmarried:.2%}')
print(f'Default rate widows: {default_widow:.2%}')


                         debt
family_status     debt       
civil partnership 0      3734
                  1       383
divorced          0      1099
                  1        84
married           0     11318
                  1       924
unmarried         0      2513
                  1       272
widow / widower   0       888
                  1        62

family_status
civil partnership     4117
divorced              1183
married              12242
unmarried             2785
widow / widower        950
Name: family_status, dtype: int64

Default rate civil partnership: 9.30%
Default rate divorced: 7.10%
Default rate married: 7.55%
Default rate unmarried: 9.77%
Default rate widows: 6.53%


**Conclusion**

[Write your conclusions based on your manipulations and observations.]

- People that are unmarried and in civil partnerships have a higher default rate
- Divorced and married people have a lower default rate
- For widows, we may not have enough data to reach conclusions


**Is there a correlation between income level and paying back on time?**

In [78]:
# Check the income level data and paying back on time

print(data.groupby(['income_category', 'debt']).agg({'debt':'count'}))
print()
print(data.groupby('income_category')['income_category'].count())
print()

total_lowest_tier = data.loc[data['income_category']=='$0-$40000', 'debt'].count()
total_middle_tier = data.loc[data['income_category']=='$40001-$80000', 'debt'].count()
total_highest_tier = data.loc[data['income_category']=='$80000+', 'debt'].count()

# Calculating default-rate based on income level

default_lowest_tier=((data.debt == 1)&(data.income_category == "$0-$40000")).sum()/total_lowest_tier
default_middle_tier=((data.debt == 1)&(data.income_category == '$40001-$80000')).sum()/total_middle_tier
default_highest_tier=((data.debt == 1)&(data.income_category == '$80000+')).sum()/total_highest_tier

print(f'Default rate lowest tier income_level: {default_lowest_tier:.2%}')
print(f'Default rate middle tier income_level: {default_middle_tier:.2%}')
print(f'Default rate highest tier income_level: {default_highest_tier:.2%}')

                       debt
income_category debt       
$0-$40000       0     16957
                1      1532
$40001-$80000   0      2387
                1       179
$80000+         0       208
                1        14

income_category
$0-$40000        18489
$40001-$80000     2566
$80000+            222
Name: income_category, dtype: int64

Default rate lowest tier income_level: 8.29%
Default rate middle tier income_level: 6.98%
Default rate highest tier income_level: 6.31%


**Conclusion**

[Write your conclusions based on your manipulations and observations.]

- The lower the income, the higher the default rate.
- For the highest income tier, we don't have enough data to reach conclusions but it seems it goes in line with the above conclusion.

**How does credit purpose affect the default rate?**

In [79]:
# Check the percentages for default rate for each credit purpose and analyze them

display(data.groupby('purpose_category')['debt'].agg(Count='count', Sum='sum', Mean = 'mean'))
print()

total_Car = data.loc[data['purpose_category']=='Car', 'debt'].count()
total_Education = data.loc[data['purpose_category']=='Education', 'debt'].count()
total_Housing = data.loc[data['purpose_category']=='Housing', 'debt'].count()
total_Wedding = data.loc[data['purpose_category']=='Wedding', 'debt'].count()

# Calculating default-rate based on purpose

default_Car=((data.debt == 1)&(data.purpose_category == "Car")).sum()/total_Car
default_Education=((data.debt == 1)&(data.purpose_category == 'Education')).sum()/total_Education
default_Housing=((data.debt == 1)&(data.purpose_category == 'Housing')).sum()/total_Housing
default_Wedding=((data.debt == 1)&(data.purpose_category == 'Wedding')).sum()/total_Wedding



print(f'Default rate for car purposes: {default_Car:.2%}')
print(f'Default rate for education purposes: {default_Education:.2%}')
print(f'Default rate for housing purposes: {default_Housing:.2%}')
print(f'Default rate for wedding purposes: {default_Wedding:.2%}')

Unnamed: 0_level_0,Count,Sum,Mean
purpose_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Car,4269,398,0.09323
Education,3980,369,0.092714
Housing,10727,777,0.072434
Wedding,2301,181,0.078661



Default rate for car purposes: 9.32%
Default rate for education purposes: 9.27%
Default rate for housing purposes: 7.24%
Default rate for wedding purposes: 7.87%


**Conclusion**

[Write your conclusions based on your manipulations and observations.]

- When people take a loan to buy a car, the default rate is the highest.
- Followed by education, wedding and housing.



# General Conclusion 

- I replaced approximately 10% of data missing  in the total_income and days_worked columns, based on the medians of people who belonged to the same education and age groups.
- About 0.003% of data was duplicate, and though we dealt with this by eliminting the rows, we have no way of knowing those rows belonged to the same people, as there are no UID or name columns here.
- In order to avoid having so many missing values, it would be good to make answering important questions obligatory for those who fill up the forms.
- For the bank, my recommendations are the following:
    - People that are unmarried and in civil partnerships have a higher default rate, 
    - Divorced and married people have a lower default rate, these are a safer best.
    - If possible, approve loans to people with an income of $40001 or higher.
    - I recommend being careful when loaning for car purposes, while housing purposes are the safest for the bank.

