## Analyzing borrowers’ risk of defaulting

Your project is to prepare a report for a bank’s loan division. You’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

Your report will be considered when building a **credit scoring** of a potential customer. A ** credit scoring ** is used to evaluate the ability of a potential borrower to repay their loan.

### Step 1. Open the data file and have a look at the general information. 

In [679]:
import pandas as pd

credit_scoring = pd.read_csv('https://code.s3.yandex.net/datasets/credit_scoring_eng.csv')
credit_scoring.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


#### Take a look at the first 5 rows of the data

In [680]:
credit_scoring.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding


#### Use value_counts(), unique(), and describe(), to examine the column values

In [681]:
credit_scoring['children'].value_counts()

 0     14149
 1      4818
 2      2055
 3       330
 20       76
-1        47
 4        41
 5         9
Name: children, dtype: int64

In [682]:
credit_scoring['days_employed'].value_counts()

-986.927316     1
-7026.359174    1
-4236.274243    1
-6620.396473    1
-1238.560080    1
               ..
-2849.351119    1
-5619.328204    1
-448.829898     1
-1687.038672    1
-582.538413     1
Name: days_employed, Length: 19351, dtype: int64

In [683]:
credit_scoring['dob_years'].unique()

array([42, 36, 33, 32, 53, 27, 43, 50, 35, 41, 40, 65, 54, 56, 26, 48, 24,
       21, 57, 67, 28, 63, 62, 47, 34, 68, 25, 31, 30, 20, 49, 37, 45, 61,
       64, 44, 52, 46, 23, 38, 39, 51,  0, 59, 29, 60, 55, 58, 71, 22, 73,
       66, 69, 19, 72, 70, 74, 75])

In [684]:
credit_scoring['family_status'].value_counts()

married              12380
civil partnership     4177
unmarried             2813
divorced              1195
widow / widower        960
Name: family_status, dtype: int64

In [685]:
credit_scoring['family_status_id'].value_counts()

0    12380
1     4177
4     2813
3     1195
2      960
Name: family_status_id, dtype: int64

In [686]:
credit_scoring['gender'].value_counts()

F      14236
M       7288
XNA        1
Name: gender, dtype: int64

In [687]:
credit_scoring['income_type'].value_counts()

employee                       11119
business                        5085
retiree                         3856
civil servant                   1459
unemployed                         2
entrepreneur                       2
student                            1
paternity / maternity leave        1
Name: income_type, dtype: int64

In [688]:
credit_scoring['education'].value_counts()

secondary education    13750
bachelor's degree       4718
SECONDARY EDUCATION      772
Secondary Education      711
some college             668
BACHELOR'S DEGREE        274
Bachelor's Degree        268
primary education        250
Some College              47
SOME COLLEGE              29
PRIMARY EDUCATION         17
Primary Education         15
graduate degree            4
GRADUATE DEGREE            1
Graduate Degree            1
Name: education, dtype: int64

In [689]:
credit_scoring['education_id'].value_counts()

1    15233
0     5260
2      744
3      282
4        6
Name: education_id, dtype: int64

In [690]:
credit_scoring['debt'].value_counts()

0    19784
1     1741
Name: debt, dtype: int64

In [691]:
credit_scoring['total_income'].describe()

count     19351.000000
mean      26787.568355
std       16475.450632
min        3306.762000
25%       16488.504500
50%       23202.870000
75%       32549.611000
max      362496.645000
Name: total_income, dtype: float64

In [692]:
credit_scoring['purpose'].value_counts()

wedding ceremony                            797
having a wedding                            777
to have a wedding                           774
real estate transactions                    676
buy commercial real estate                  664
buying property for renting out             653
housing transactions                        653
transactions with commercial real estate    651
purchase of the house                       647
housing                                     647
purchase of the house for my family         641
construction of own property                635
property                                    634
transactions with my real estate            630
building a real estate                      626
buy real estate                             624
building a property                         620
purchase of my own house                    620
housing renovation                          612
buy residential real estate                 607
buying my own car                       

#### Use isna() and sum() method to find out how many null values in the dataset

In [693]:
credit_scoring.isna().sum()

children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

### Conclusion

A quick look at this dataset reveals a number of issues that will need to be fixed before any further analysis:

 1. days_employed and total_income have missing values. The proportion of missing values in both columns are slightly over 10% and thus is not insignificant. We will need look further into the details.
 
 
 2. 'children' column strangely contain 47 '-1' value, and 76 '20' value, which are likely to be errors. 
 
 
 3. 'age' column contains '0' value, which is likely to be an error. 
 
 
 4. 'gender' column contains one 'XNA' value, which it is hard to tell what it is.
 
 
 5. days_employed also has negative values. Even in the first ten rows all are negative except one and there is no indication that they are unemployed. This could be a mistake and needs to be further investigated. 
 
 
 6. in the 'purpose' column, inputs are overlapping and will need to be synthesized. 
 
 
 7. the education column contains both lower and upper case values. This will make analysis more difficult and need to be changed to all lower case values. 
 
 
 8. some of the column names need to be changed, including 'children' to 'number_of_children', 'dob.years' to 'age', 'education' to 'education_level', family_status' to 'marital_status', 'total_income' to 'monthly_income', 'debt' to 'debt_status', 'purpose' to 'loan_purpose', in order to more accurately reflect the column values. 
 
 
In the steps that follow, the above issues will be addressed. 

### Step 2. Data preprocessing

### Processing missing values

**Stepwise, let's fix the possible errors in the 'children', 'age', 'gender' columns before addressing the missing values in 'days_employed' and 'total_income' columns. First, let's look at the rows that have 'children' as -1**.


In [694]:
credit_scoring[credit_scoring['children'] == -1]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
291,-1,-4417.703588,46,secondary education,1,civil partnership,1,F,employee,0,16450.615,profile education
705,-1,-902.084528,50,secondary education,1,married,0,F,civil servant,0,22061.264,car purchase
742,-1,-3174.456205,57,secondary education,1,married,0,F,employee,0,10282.887,supplementary education
800,-1,349987.852217,54,secondary education,1,unmarried,4,F,retiree,0,13806.996,supplementary education
941,-1,,57,Secondary Education,1,married,0,F,retiree,0,,buying my own car
1363,-1,-1195.264956,55,SECONDARY EDUCATION,1,married,0,F,business,0,11128.112,profile education
1929,-1,-1461.303336,38,secondary education,1,unmarried,4,M,employee,0,17459.451,purchase of the house
2073,-1,-2539.761232,42,secondary education,1,divorced,3,F,business,0,26022.177,purchase of the house
3814,-1,-3045.290443,26,Secondary Education,1,civil partnership,1,F,civil servant,0,21102.846,having a wedding
4201,-1,-901.101738,41,secondary education,1,married,0,F,civil servant,0,36220.123,transactions with my real estate



**Family_status and age seem to be mixed and nothing in particular stands out. While it's always ideal to check with the person who provides the data and find out more about these errors, given that it's not possible in this case, let's treat this as entry error and that the values are intended to be '1'. We'll go ahead making the change and check the column again.**



In [695]:
credit_scoring.loc[credit_scoring['children'] == -1, 'children'] = 1

In [696]:
credit_scoring['children'].value_counts()

0     14149
1      4865
2      2055
3       330
20       76
4        41
5         9
Name: children, dtype: int64

**Now let's look at the rows that have 20 in 'children'.** 

In [697]:
credit_scoring[credit_scoring['children'] == 20]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
606,20,-880.221113,21,secondary education,1,married,0,M,business,0,23253.578,purchase of the house
720,20,-855.595512,44,secondary education,1,married,0,F,business,0,18079.798,buy real estate
1074,20,-3310.411598,56,secondary education,1,married,0,F,employee,1,36722.966,getting an education
2510,20,-2714.161249,59,bachelor's degree,0,widow / widower,2,F,employee,0,42315.974,transactions with commercial real estate
2941,20,-2161.591519,0,secondary education,1,married,0,F,employee,0,31958.391,to buy a car
...,...,...,...,...,...,...,...,...,...,...,...,...
21008,20,-1240.257910,40,secondary education,1,married,0,F,employee,1,21363.842,to own a car
21325,20,-601.174883,37,secondary education,1,married,0,F,business,0,16477.771,profile education
21390,20,,53,secondary education,1,married,0,M,business,0,,buy residential real estate
21404,20,-494.788448,52,secondary education,1,married,0,M,business,0,25060.749,transactions with my real estate


**This is clearly an error which is supposed to be 2. Let's fix that now, and check the column values again.**

In [698]:
credit_scoring.loc[credit_scoring['children'] == 20, 'children'] = 2

In [699]:
credit_scoring['children'].value_counts()

0    14149
1     4865
2     2131
3      330
4       41
5        9
Name: children, dtype: int64

**Column 'children' looks good now. Next, let's find the rows where 'dob_years' (age) is 0.** 

In [700]:
credit_scoring[credit_scoring['dob_years'] == 0]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
99,0,346541.618895,0,Secondary Education,1,married,0,F,retiree,0,11406.644,car
149,0,-2664.273168,0,secondary education,1,divorced,3,F,employee,0,11228.230,housing transactions
270,3,-1872.663186,0,secondary education,1,married,0,F,employee,0,16346.633,housing renovation
578,0,397856.565013,0,secondary education,1,married,0,F,retiree,0,15619.310,construction of own property
1040,0,-1158.029561,0,bachelor's degree,0,divorced,3,F,business,0,48639.062,to own a car
...,...,...,...,...,...,...,...,...,...,...,...,...
19829,0,,0,secondary education,1,married,0,F,employee,0,,housing
20462,0,338734.868540,0,secondary education,1,married,0,F,retiree,0,41471.027,purchase of my own house
20577,0,331741.271455,0,secondary education,1,unmarried,4,F,retiree,0,20766.202,property
21179,2,-108.967042,0,bachelor's degree,0,married,0,M,business,0,38512.321,building a real estate


**We'll have to decide what to do with the 0 in these 101 rows. Age could be a crucial factor when deciding on a person's credit scoring. There are a number of columns in the data that could give some hint to people's age, such as days_employed, marital status, and income_type. Among these, income_type distinguishes between retiree from others. Still, it's not the best predictor for age but will do for now.** 

**We will find out which group by income_type contain the '0' value for 'dob_years', calcuate the mean age for that group, and replace 0 with the mean. Make sure the result is changed to the integer type to be consistent with column data type.** 

In [701]:
credit_scoring[credit_scoring['dob_years'] == 0].groupby('income_type')['income_type'].count()

income_type
business         20
civil servant     6
employee         55
retiree          20
Name: income_type, dtype: int64

In [702]:
age_mean_business = int(credit_scoring[credit_scoring ['income_type'] == 
                                   'business']['dob_years'].mean())
age_mean_civil = int(credit_scoring[credit_scoring ['income_type'] == 
                                   'civil servant']['dob_years'].mean())
age_mean_employee = int(credit_scoring[credit_scoring ['income_type'] == 
                                   'employee']['dob_years'].mean())
age_mean_retiree = int(credit_scoring[credit_scoring ['income_type'] == 
                                   'retiree']['dob_years'].mean())

**Check these values:**

In [703]:
print(age_mean_business, age_mean_civil, age_mean_employee, age_mean_retiree)

39 40 39 59


**Replace the 0 in 'dob_years' with the means respectively.** 

In [704]:
credit_scoring.loc[(credit_scoring['dob_years'] == 0) & (
    credit_scoring['income_type'] == 'business'), 'dob_years'] = age_mean_business

credit_scoring.loc[(credit_scoring['dob_years'] == 0) & (
    credit_scoring['income_type'] == 'civil servant'), 'dob_years'] = age_mean_civil

credit_scoring.loc[(credit_scoring['dob_years'] == 0) & (
    credit_scoring['income_type'] == 'employee'), 'dob_years'] = age_mean_employee

credit_scoring.loc[(credit_scoring['dob_years'] == 0) & (
    credit_scoring['income_type'] == 'retiree'), 'dob_years'] = age_mean_retiree

**Now check the 'dob_years' column using unique() again.** 

In [705]:
credit_scoring['dob_years'].unique()

array([42, 36, 33, 32, 53, 27, 43, 50, 35, 41, 40, 65, 54, 56, 26, 48, 24,
       21, 57, 67, 28, 63, 62, 47, 34, 68, 25, 31, 30, 20, 49, 37, 45, 61,
       64, 44, 52, 46, 23, 38, 39, 51, 59, 29, 60, 55, 58, 71, 22, 73, 66,
       69, 19, 72, 70, 74, 75])

**All the '0' is gone. All good. Now, let's fix the 'gender' column.** 

In [706]:
credit_scoring['gender'].value_counts()

F      14236
M       7288
XNA        1
Name: gender, dtype: int64

In [707]:
credit_scoring[credit_scoring['gender'] == 'XNA']

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
10701,0,-2358.600502,24,some college,2,civil partnership,1,XNA,business,0,32624.825,buy real estate


**There is only one row, and there is no indication from other columns as to what the gender could be. Therefore, let's remove it and check the gender column again.** 

In [708]:
credit_scoring = credit_scoring.drop(index = 10701).reset_index(drop = True)

In [709]:
credit_scoring['gender'].value_counts()

F    14236
M     7288
Name: gender, dtype: int64

**Column 'gender' all fixed. Now, let's move on to address the missing values in 'days_employed' and 'total_income', let's have a look at these rows to see if they are related or have any underlying patterns.** 

In [710]:
credit_scoring[(credit_scoring['days_employed'].isna()) & 
               (credit_scoring['total_income'].isna())].count()

children            2174
days_employed          0
dob_years           2174
education           2174
education_id        2174
family_status       2174
family_status_id    2174
gender              2174
income_type         2174
debt                2174
total_income           0
purpose             2174
dtype: int64

**Both columns have missing values in exactly the same rows. Now let's have a look at if how these missing values are related to 'income_type'.**

In [711]:
credit_scoring[credit_scoring['total_income'].isna()]['income_type'].value_counts()

employee         1105
business          508
retiree           413
civil servant     147
entrepreneur        1
Name: income_type, dtype: int64

**Nothing in particular stands out. Before we decide what to do with the missing values, let's have look at these rows using the method of head().**

In [712]:
credit_scoring[credit_scoring['total_income'].isna()].head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding


**Again, there is nothing particularly alarming about these entries. A decision has to be made, and therefore without having any further information, in order to minimize the impact any change may have on the overall pattern of the data, let's replace the missing values with the mean or median values grouped by 'income_type'. This is because total income is most likely to be closely related to income type. Missing values in days_employed is modified accordingly for consistency.**

**First, let's find out the median and mean of 'days_employed' and 'total_income' for each of the income types that have missing values.** 

In [713]:
print(credit_scoring.groupby('income_type')['days_employed'].mean())
print(credit_scoring.groupby('income_type')['days_employed'].median())
print(credit_scoring.groupby('income_type')['total_income'].mean())
print(credit_scoring.groupby('income_type')['total_income'].median())

income_type
business                        -2111.470404
civil servant                   -3399.896902
employee                        -2326.499216
entrepreneur                     -520.848083
paternity / maternity leave     -3296.759962
retiree                        365003.491245
student                          -578.751554
unemployed                     366413.652744
Name: days_employed, dtype: float64
income_type
business                        -1546.333214
civil servant                   -2689.368353
employee                        -1574.202821
entrepreneur                     -520.848083
paternity / maternity leave     -3296.759962
retiree                        365213.306266
student                          -578.751554
unemployed                     366413.652744
Name: days_employed, dtype: float64
income_type
business                       32386.741818
civil servant                  27343.729582
employee                       25820.841683
entrepreneur                   79866.103

**Differences between mean and median is not huge. So let's use the median values to replace the missing values.** 

In [714]:
days_employed_employee_median = credit_scoring[credit_scoring['income_type'] 
                                               == 'employee']['days_employed'].median()
days_employed_business_median = credit_scoring[credit_scoring['income_type'] 
                                               == 'business']['days_employed'].median()
days_employed_retiree_median = credit_scoring[credit_scoring['income_type'] 
                                               == 'retiree']['days_employed'].median()
days_employed_civilserv_median = credit_scoring[credit_scoring['income_type'] 
                                               == 'civil servant']['days_employed'].median()
days_employed_entrepreneur_median = credit_scoring[credit_scoring['income_type'] 
                                               == 'entrepreneur']['days_employed'].median()

income_employee_median = credit_scoring[credit_scoring['income_type'] 
                                               == 'employee']['total_income'].median()
income_business_median = credit_scoring[credit_scoring['income_type'] 
                                               == 'business']['total_income'].median()
income_retiree_median = credit_scoring[credit_scoring['income_type'] 
                                               == 'retiree']['total_income'].median()
income_civilserv_median = credit_scoring[credit_scoring['income_type'] 
                                               == 'civil servant']['total_income'].median()
income_entrepreneur_median = credit_scoring[credit_scoring['income_type'] 
                                               == 'entrepreneur']['total_income'].median()

In [715]:

credit_scoring.loc[(credit_scoring['total_income'].isna()) & 
               (credit_scoring['income_type'] == 'employee'), 'days_employed'] = credit_scoring.loc[(credit_scoring['total_income'].isna()) & 
               (credit_scoring['income_type'] == 'employee'), 'days_employed'].fillna(
                   days_employed_employee_median)

credit_scoring.loc[(credit_scoring['total_income'].isna()) & 
               (credit_scoring['income_type'] == 'employee'), 'total_income'] = credit_scoring.loc[(credit_scoring['total_income'].isna()) & 
               (credit_scoring['income_type'] == 'employee'), 'total_income'].fillna(
                   income_employee_median)

credit_scoring.loc[(credit_scoring['total_income'].isna()) & 
               (credit_scoring['income_type'] == 'business'), 'days_employed'] = credit_scoring.loc[(credit_scoring['total_income'].isna()) & 
               (credit_scoring['income_type'] == 'business'), 'days_employed'].fillna(
                   days_employed_business_median)

credit_scoring.loc[(credit_scoring['total_income'].isna()) & 
               (credit_scoring['income_type'] == 'business'), 'total_income'] = credit_scoring.loc[(credit_scoring['total_income'].isna()) & 
               (credit_scoring['income_type'] == 'business'), 'total_income'].fillna(
                   income_business_median)


credit_scoring.loc[(credit_scoring['total_income'].isna()) & 
               (credit_scoring['income_type'] == 'retiree'), 'days_employed'] = credit_scoring.loc[(credit_scoring['total_income'].isna()) & 
               (credit_scoring['income_type'] == 'retiree'), 'days_employed'].fillna(
                   days_employed_retiree_median)

credit_scoring.loc[(credit_scoring['total_income'].isna()) & 
               (credit_scoring['income_type'] == 'retiree'), 'total_income'] = credit_scoring.loc[(credit_scoring['total_income'].isna()) & 
               (credit_scoring['income_type'] == 'retiree'), 'total_income'].fillna(
                   income_retiree_median)  

credit_scoring.loc[(credit_scoring['total_income'].isna()) & 
               (credit_scoring['income_type'] == 'civil servant'), 'days_employed'] = credit_scoring.loc[(credit_scoring['total_income'].isna()) & 
               (credit_scoring['income_type'] == 'civil servant'), 'days_employed'].fillna(
                   days_employed_civilserv_median)

credit_scoring.loc[(credit_scoring['total_income'].isna()) & 
               (credit_scoring['income_type'] == 'civil servant'), 'total_income'] = credit_scoring.loc[(credit_scoring['total_income'].isna()) & 
               (credit_scoring['income_type'] == 'civil servant'), 'total_income'].fillna(
                   income_civilserv_median)

credit_scoring.loc[(credit_scoring['total_income'].isna()) & 
               (credit_scoring['income_type'] == 'entrepreneur'), 'days_employed'] = credit_scoring.loc[(credit_scoring['total_income'].isna()) & 
               (credit_scoring['income_type'] == 'entrepreneur'), 'days_employed'].fillna(
                   days_employed_entrepreneur_median)

credit_scoring.loc[(credit_scoring['total_income'].isna()) & 
               (credit_scoring['income_type'] == 'entrepreneur'), 'total_income'] = credit_scoring.loc[(credit_scoring['total_income'].isna()) & 
               (credit_scoring['income_type'] == 'entrepreneur'), 'total_income'].fillna(
                   income_entrepreneur_median)

**Now let's check if there are still any NA values in the data.**

In [716]:
credit_scoring.isna().sum()

children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
dtype: int64

### Conclusion

The missing values in 'days_employed' and 'total_income' columns are replaced with the median value of the income type group that these rows belong to. Data is complete. Now we can move on to check on the data types. 

### Data type replacement

**It has been a while since we had a look at data as a whole. So let's take another look at the general information of the data.**

In [717]:
credit_scoring.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21524 entries, 0 to 21523
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21524 non-null  int64  
 1   days_employed     21524 non-null  float64
 2   dob_years         21524 non-null  int64  
 3   education         21524 non-null  object 
 4   education_id      21524 non-null  int64  
 5   family_status     21524 non-null  object 
 6   family_status_id  21524 non-null  int64  
 7   gender            21524 non-null  object 
 8   income_type       21524 non-null  object 
 9   debt              21524 non-null  int64  
 10  total_income      21524 non-null  float64
 11  purpose           21524 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


**The two columns: 'days_employed' and 'total_income', contain real number data of the float64 type. For the purpose of this analysis, it is not necessary to have all the digits after the decimal point (some have up to six). It's also highly likely that the negative '-' sign for the 'days_employed' column values is an error. Now let's change the data type of these two columns to integer and remove the '-' sign.** 

In [718]:
credit_scoring['days_employed'] = credit_scoring['days_employed'].astype('int').abs()

In [719]:
credit_scoring['total_income'] = credit_scoring['total_income'].astype('int')

**Let's double check to see if the values have been changed.**

In [720]:
credit_scoring[['days_employed', 'total_income']]

Unnamed: 0,days_employed,total_income
0,8437,40620
1,4024,17932
2,5623,23341
3,4124,42820
4,340266,25378
...,...,...
21519,4529,35966
21520,343937,24959
21521,2113,14347
21522,3112,39054


### Conclusion

Now the two columns contain whole number values and it will make the follow up analysis and calculation easier. Next, let's check duplicatess. 

### Processing duplicates

**Previous analysis shows that there are duplicated values in 'education' column due to the mix of upper and lower case entries, and duplicates in 'purpose' column as a result of same meaning expressed in different ways. Let's fix the 'education' column first.** 

In [721]:
credit_scoring['education'].value_counts()

secondary education    13750
bachelor's degree       4718
SECONDARY EDUCATION      772
Secondary Education      711
some college             667
BACHELOR'S DEGREE        274
Bachelor's Degree        268
primary education        250
Some College              47
SOME COLLEGE              29
PRIMARY EDUCATION         17
Primary Education         15
graduate degree            4
GRADUATE DEGREE            1
Graduate Degree            1
Name: education, dtype: int64

**Change all values to lower case using the to_lower() method**

In [722]:
credit_scoring['education'] = credit_scoring['education'].str.lower()

**Use value_counts() to check the values again to make sure the changes have taken effect.** 

In [723]:
credit_scoring['education'].value_counts()

secondary education    15233
bachelor's degree       5260
some college             743
primary education        282
graduate degree            6
Name: education, dtype: int64

**Now let's take a look at the 'purpose' column:**

In [724]:
credit_scoring['purpose'].value_counts()

wedding ceremony                            797
having a wedding                            777
to have a wedding                           774
real estate transactions                    676
buy commercial real estate                  664
buying property for renting out             653
housing transactions                        653
transactions with commercial real estate    651
purchase of the house                       647
housing                                     647
purchase of the house for my family         641
construction of own property                635
property                                    634
transactions with my real estate            630
building a real estate                      626
buy real estate                             623
building a property                         620
purchase of my own house                    620
housing renovation                          612
buy residential real estate                 607
buying my own car                       

**Given the values in the data, we will need the stemming process to look for values that contain a particular stem and use that as the basis for modifying values.** 

**The values can be modified to belonging to one of four types: 'wedding', 'property', 'education', and 'car'.** 

**While the others ones are rather straightforward, it's a little tricky with property related categories. There are some values which clearly indicate whether it is for self or for investment but there are also those that didn't specify. While it would be ideal to separate property for commercial from property for self, it's not likely that setting arbitrary criteria such as containing words like 'commercial' or 'renting' would be accurate because other values such as 'real estate transactions' and 'buy real estate' could also mean commercial property purposes. Given such considerations, there will be just one category: 'property', for all property related purposes.** 


**Download the package for stemming**

In [725]:
from nltk.stem import SnowballStemmer
english_stemmer = SnowballStemmer('english')                              

**Build the variables used for stemming** 

In [726]:
wedd = english_stemmer.stem('wedding')
hous = english_stemmer.stem('house')
estat = english_stemmer.stem('estate')
prop = english_stemmer.stem('property')
educ = english_stemmer.stem('education')
univers = english_stemmer.stem('university')
car = english_stemmer.stem('car')

commerc = english_stemmer.stem('commercial')
rent = english_stemmer.stem('renting')

**It is a good idea to keep the data as it is in the original column and put the modified value in two new columns: 'loan_purpose' and 'loan_purpose_id', containing the category and id respectively in the form of string and integer.**

wedding = 1

property = 2

car = 3

education = 4

In [727]:
credit_scoring['loan_purpose'] = ''
credit_scoring['loan_purpose_id'] = ''

In [728]:
for i in range(0, 21524):
    for word in credit_scoring.loc[i, 'purpose'].split(" "):
        if english_stemmer.stem(word) == wedd:
            credit_scoring.loc[i, 'loan_purpose'] = 'wedding'
            credit_scoring.loc[i, 'loan_purpose_id'] = 1
        if english_stemmer.stem(word) in [hous, estat, prop]:
            credit_scoring.loc[i, 'loan_purpose'] = 'property'
            credit_scoring.loc[i, 'loan_purpose_id'] = 2
        if english_stemmer.stem(word) == car:
            credit_scoring.loc[i, 'loan_purpose'] = 'car'
            credit_scoring.loc[i, 'loan_purpose_id'] = 3
        if english_stemmer.stem(word) in [educ, univers]:
            credit_scoring.loc[i, 'loan_purpose'] = 'education'
            credit_scoring.loc[i, 'loan_purpose_id'] = 4

Use value_counts() to check the 2 columns

In [729]:
print(credit_scoring['loan_purpose'].value_counts())
print(credit_scoring['loan_purpose_id'].value_counts())

property     10839
car           4315
education     4022
wedding       2348
Name: loan_purpose, dtype: int64
2    10839
3     4315
4     4022
1     2348
Name: loan_purpose_id, dtype: int64


### Conclusion

All duplicates in the data are now removed. The next step is to categorize the data and start to look for patterns in light of the questions that we hope to address.  

### Categorizing Data

**First, let's change column names to make them more representative of column values. We'll take a look at the current names.** 

In [730]:
credit_scoring.columns

Index(['children', 'days_employed', 'dob_years', 'education', 'education_id',
       'family_status', 'family_status_id', 'gender', 'income_type', 'debt',
       'total_income', 'purpose', 'loan_purpose', 'loan_purpose_id'],
      dtype='object')

**Create a new column name list**

In [731]:
colnames = ['number_of_children', 'days_employed', 'age', 'education', 'education_id', 
            'marital_status', 'marital_status_id', 'gender', 'income_type', 'have_debt',
           'monthly_income', 'purpose', 'loan_purpose', 'loan_purpose_id']

**Rename the columns using the list**

In [732]:
credit_scoring.columns = colnames

**Check column names again**

In [733]:
credit_scoring.columns

Index(['number_of_children', 'days_employed', 'age', 'education',
       'education_id', 'marital_status', 'marital_status_id', 'gender',
       'income_type', 'have_debt', 'monthly_income', 'purpose', 'loan_purpose',
       'loan_purpose_id'],
      dtype='object')

**Next, the 'age' column now contains values that are all individual digits. It's not the most helpful and makes analysis difficult. It makes more sense to change these into belong to different age groups. Let's take a look at the age range first.** 

In [734]:
credit_scoring['age'].describe()

count    21524.000000
mean        43.496144
std         12.229884
min         19.000000
25%         34.000000
50%         43.000000
75%         53.000000
max         75.000000
Name: age, dtype: float64

**It shows that mean age is 43, and 75% of the people in the data are aged 53 and under. We can predict that age is an important factor as far as the risk of a loan is concerned because this indicates at what stage of working life they are. Therefore it would make more sense to divide the age into groups of: 25 and under, 26 to 45, 46 to 65, above 65. This is because we predict people under 25 tend to be still studying or at the early stage of their work life. People between 26 and 45 tend to be starting families and getting homes, people between 46 and 65 tend to relevantly more stable financially, and people tend to retire around the age of 65.** 

**Following these arbitrary criteria, let's add a new column called age_group_id, containing and following id and add data to it using if conditions.**

1: if age <= 25

2: if 26 <= age <= 45

3: if 46 <= age <= 65

4: if age >65

In [735]:
credit_scoring['age_group_id'] = ''

In [736]:
for i in range(0, 21524):
    if credit_scoring.loc[i,'age'] <= 25:
        credit_scoring.loc[i,'age_group_id'] = 1
    if 26 <= credit_scoring.loc[i,'age'] <= 45:
        credit_scoring.loc[i,'age_group_id'] = 2
    if 46 <= credit_scoring.loc[i,'age'] <= 65:
        credit_scoring.loc[i,'age_group_id'] = 3
    if credit_scoring.loc[i,'age'] > 65:
        credit_scoring.loc[i,'age_group_id'] = 4     

**Check the 'age_group_id' column**

In [737]:
credit_scoring['age_group_id'].value_counts()

2    11074
3     8512
1     1233
4      705
Name: age_group_id, dtype: int64

**Now let's created a dataframe with reduced columns only containing those that have numeric values and category ids related to the questions to be addressed.** 

In [738]:
credit_scoring_log = credit_scoring[['have_debt', 'age_group_id','number_of_children', 'marital_status_id', 
                                        'monthly_income', 'loan_purpose_id']]

In [739]:
credit_scoring_log

Unnamed: 0,have_debt,age_group_id,number_of_children,marital_status_id,monthly_income,loan_purpose_id
0,0,2,1,0,40620,2
1,0,2,1,0,17932,3
2,0,2,0,0,23341,2
3,0,2,3,0,42820,4
4,0,3,0,1,25378,1
...,...,...,...,...,...,...
21519,0,2,1,1,35966,2
21520,0,4,0,0,24959,3
21521,1,2,1,1,14347,2
21522,1,2,3,0,39054,3


**Let's also create two dictionaries that store the index, string columns and their id counterpart for later use.** 

In [740]:
marital_dict = credit_scoring[['marital_status_id', 'marital_status']]
marital_dict = marital_dict.drop_duplicates().reset_index(drop=True)

In [741]:
loan_purpose_dict = credit_scoring[['loan_purpose_id', 'loan_purpose']]
loan_purpose_dict = loan_purpose_dict.drop_duplicates().reset_index(drop=True)

In [742]:
print(marital_dict)
print(loan_purpose_dict)

   marital_status_id     marital_status
0                  0            married
1                  1  civil partnership
2                  2    widow / widower
3                  3           divorced
4                  4          unmarried
  loan_purpose_id loan_purpose
0               2     property
1               3          car
2               4    education
3               1      wedding


### Conclusion

Now we have a clean data subset that contain mostly digits which helps to keep the file size small and analysis faster. Next, we will synthesize data and address the key questions the bank is concerned with. 

### Step 3. Answer these questions

### Question 1: Is there a relation between having kids and repaying a loan on time?

**To find out the information to address this question, let's use sum() to calculate the total number of those who have debts as grouped by how many children they have, then divide the sum by the row count of each group.** 

In [743]:
group_by_children_count = credit_scoring_log.groupby('number_of_children')['have_debt'].count()

In [744]:
group_by_children_debt = credit_scoring_log.groupby('number_of_children')['have_debt'].sum()

In [745]:
debt_ratio = group_by_children_debt / group_by_children_count

In [746]:
debt_ratio

number_of_children
0    0.075134
1    0.091470
2    0.094791
3    0.081818
4    0.097561
5    0.000000
Name: have_debt, dtype: float64

### Conclusion

Overall, there is a higher debt rate among people who have children (ranging from 8.2% to 9.8%), compared to those who don't (7.5%). However, one exception is that the 9 families which have 5 children have 0 debt. 

Therefore, at first glance there does seem to be a relation between having children and repaying loan on time. Whether the difference is significant or not will need to be further investigated. 

### Question 2: Is there a relation between marital status and repaying a loan on time?

**To find out information to address this question, let's use pivot_table() method.** 

In [747]:
marital_loan_pivot = credit_scoring.pivot_table(index = 'marital_status_id', values = 'have_debt', 
                                                aggfunc = ['sum', 'count'])

In [748]:
marital_loan_pivot

Unnamed: 0_level_0,sum,count
Unnamed: 0_level_1,have_debt,have_debt
marital_status_id,Unnamed: 1_level_2,Unnamed: 2_level_2
0,931,12380
1,388,4176
2,63,960
3,85,1195
4,274,2813


**Create a new column containing the results of sum divided by count, and merge the table with the dictionary that contains the marital_status information.** 

In [749]:
marital_loan_pivot.columns = ['sum_have_debt', 'count']
marital_loan_pivot['debt_ratio'] = marital_loan_pivot['sum_have_debt'] / marital_loan_pivot['count']

In [750]:
marital_loan_pivot

Unnamed: 0_level_0,sum_have_debt,count,debt_ratio
marital_status_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,931,12380,0.075202
1,388,4176,0.092912
2,63,960,0.065625
3,85,1195,0.07113
4,274,2813,0.097405


In [751]:
marital_loan_pivot.merge(marital_dict, on = 'marital_status_id', how = 'left')

Unnamed: 0,marital_status_id,sum_have_debt,count,debt_ratio,marital_status
0,0,931,12380,0.075202,married
1,1,388,4176,0.092912,civil partnership
2,2,63,960,0.065625,widow / widower
3,3,85,1195,0.07113,divorced
4,4,274,2813,0.097405,unmarried


### Conclusion

It's interesting to notice that the widow/widower group have the lowest debt ratio, followed by divorced, and married group. In contrast, those who are in civil partnership or unmarried have the highest ratio of having debt. There does seem to be some relationship between one's marital status and repaying a loan on time. 

### Question 3: Is there a relation between income level and repaying a loan on time?

**To address this question, we will use a similar method as above, and find out the mean, minimum, max monthly income for those who have no debt and those who do.** 

In [752]:
income_loan_pivot = credit_scoring.pivot_table(index = 'have_debt', values = 'monthly_income', 
                                                aggfunc = ['mean', 'min', 'max'])

In [753]:
income_loan_pivot

Unnamed: 0_level_0,mean,min,max
Unnamed: 0_level_1,monthly_income,monthly_income,monthly_income
have_debt,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
0,26492.383309,3392,362496
1,25784.850661,3306,352136


### Conclusion

From the information above, the average monthly income among those who have debt is only slightly lower than those who dont'. The minimum and maximum monthly income are also similar. The relationship between income level and repaying loan on time doesn't seem to be strong. 

### Question 4: How do different loan purposes affect on-time repayment of the loan?

**Let's use pivot table one more time to find out the relationship between loan purposes and on-time repayment.** 

In [754]:
purpose_loan_pivot = credit_scoring.pivot_table(index = 'loan_purpose_id', values = 'have_debt', 
                                                aggfunc = ['sum', 'count'])

In [755]:
purpose_loan_pivot

Unnamed: 0_level_0,sum,count
Unnamed: 0_level_1,have_debt,have_debt
loan_purpose_id,Unnamed: 1_level_2,Unnamed: 2_level_2
1,186,2348
2,782,10839
3,403,4315
4,370,4022


In [756]:
purpose_loan_pivot.columns = ['sum_have_debt', 'count']

In [757]:
purpose_loan_pivot['debt_ratio'] = purpose_loan_pivot['sum_have_debt'] / purpose_loan_pivot['count'] 

In [758]:
purpose_loan_pivot

Unnamed: 0_level_0,sum_have_debt,count,debt_ratio
loan_purpose_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,186,2348,0.079216
2,782,10839,0.072147
3,403,4315,0.093395
4,370,4022,0.091994


**Merge the pivot table with loan_purpose_dict**

In [759]:
purpose_loan_pivot.merge(loan_purpose_dict, on = 'loan_purpose_id', how = 'left')

Unnamed: 0,loan_purpose_id,sum_have_debt,count,debt_ratio,loan_purpose
0,1,186,2348,0.079216,wedding
1,2,782,10839,0.072147,property
2,3,403,4315,0.093395,car
3,4,370,4022,0.091994,education


### Conclusion

The contrast is quite clear: those who loan for the purpose of wedding and property are doing much better repaying the loan on time than those who loan for car and education. 

### Step 4. General conclusion

**The initial sythesis of the data reveals that there is a relationship between the number of children and ontime repayment of the loan: not taking into account those who have 5 children, there is a lower percentage of having debt among people who have no children than those who do**

**In addition, there is also a lower percentage of having debt among those who are married than those who are not. However, it's the widow/widoer group who seem to be most likely pay back loan on time.**

**It's interesting to notice that the average monthly income among those who payback on time and those who don't are approximately the same. It might be worth investigating to find out if there is any gender or age differences there.**

**Finally, it seems lending money to buy cars is the most risky, followed closely by education loans. Property loans perform best in paying back on time. Having said that, property loans can be further divided into different types. Given more information, it would be interesting to find out how different types of property loans perform in ontime repayment.**


### Project Readiness Checklist


- [x]  file open;
- [x]  file examined;
- [x]  missing values defined;
- [x]  missing values are filled;
- [x]  an explanation of which missing value types were detected;
- [x]  explanation for the possible causes of missing values;
- [x]  an explanation of how the blanks are filled;
- [x]  replaced the real data type with an integer;
- [x]  an explanation of which method is used to change the data type and why;
- [x]  duplicates deleted;
- [x]  an explanation of which method is used to find and remove duplicates;
- [x]  description of the possible reasons for the appearance of duplicates in the data;
- [x]  data is categorized;
- [x]  an explanation of the principle of data categorization;
- [x]  an answer to the question "Is there a relation between having kids and repaying a loan on time?";
- [x]  an answer to the question " Is there a relation between marital status and repaying a loan on time?";
- [x]  an answer to the question " Is there a relation between income level and repaying a loan on time?";
- [x]  an answer to the question " How do different loan purposes affect on-time repayment of the loan?"
- [x]  conclusions are present on each stage;
- [x]  a general conclusion is made.