# Analyzing borrowers’ risk of defaulting

The project is to prepare a report for a bank’s loan division. We’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

The report will be considered when building the **credit score** of a potential customer. The **credit score** is used to evaluate the ability of a potential borrower to repay their loan.


## Purpuses of the analysys
To build up the  table for the Loan Department, showing the probability of customers to default on the loan and to determine the factors which influence on it
## Hypotheses:
1. Married people are less likely default on loans than divorced or unmarried
2. People who have less than 2 children are more likely to default on loans 
3. People with low income are more likely to default than those who have middle or high level of income
4. Loan purpose has no influence on the probability to default on a loan

## Open the data file and have a look at the general information. 

Importing pandas library and reading csv

In [1]:
# Loading all the libraries
import pandas as pd
import numpy as np

#to output more than 1 variable
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

# Load the data
data = pd.read_csv('/datasets/credit_scoring_eng.csv')

## Task 1. Data exploration

**Description of the data**
- `children` - the number of children in the family
- `days_employed` - work experience in days
- `dob_years` - client's age in years
- `education` - client's education
- `education_id` - education identifier
- `family_status` - marital status
- `family_status_id` - marital status identifier
- `gender` - gender of the client
- `income_type` - type of employment
- `debt` - was there any debt on loan repayment
- `total_income` - monthly income
- `purpose` - the purpose of obtaining a loan

Reading file info and printing the first 10 rows of the table to check for potential issues with the data

In [2]:
# Let's see how many rows and columns our dataset has
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


In [3]:
# let's print the first 10 rows
print(data.head(10))


   children  days_employed  dob_years            education  education_id  \
0         1   -8437.673028         42    bachelor's degree             0   
1         1   -4024.803754         36  secondary education             1   
2         0   -5623.422610         33  Secondary Education             1   
3         3   -4124.747207         32  secondary education             1   
4         0  340266.072047         53  secondary education             1   
5         0    -926.185831         27    bachelor's degree             0   
6         0   -2879.202052         43    bachelor's degree             0   
7         0    -152.779569         50  SECONDARY EDUCATION             1   
8         2   -6929.865299         35    BACHELOR'S DEGREE             0   
9         0   -2188.756445         41  secondary education             1   

       family_status  family_status_id gender income_type  debt  total_income  \
0            married                 0      F    employee     0     40620.102   
1

There are almoust no problem with the columns names and with the types of the columns, but 'dob_years' will be changed to 'age', so that it will  explain the content more clearly. We have problems with the content of some columns:
1. We have missing values in 'days_employed' and 'total_income' columns.
1. 'days_employed' column has many negative values, we need to investigate this.
1. There are obvious duplicated info in 'education' column, some info is written in caps, we need to solve this.
We are not seeing missing values for now, but it should be checked definitely.

In [4]:
# Get info on data
data.isna().sum()

children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

We have some missing values in 'days_employed' and 'total_income' column. We will investigate this

In [5]:
# Let's look in the filtered table at the the first column with missing data
data[data.days_employed.isna()]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


It seems that there are missing values in both columns 'days_employed' and 'total_income'.  The number of rows of those columns is the the same,  But the number of missing values is big. So, we will investigate this further

In [6]:
# Let's apply multiple conditions for filtering data and look at the number of rows in the filtered table.
data[(data.days_employed.isna())&(data.total_income.isna())]


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


**Intermediate conclusion**

We have filtered data with both columns with missing values, and we have equal number of rows 2174. It means that those people, who did not fill the 'day_employed column', did not fill the 'total_income' column either. As we can not remove this values becouse of the big number of them, we should replace them to the median or mean values accurately, so that it will not influence to the results of our analisis. 

We have 2174 rows with missing values, and the total data has 21525 rows. So, 10% of missing values. It is considerably large piece of data.

So, first we will filter the data with the missing values. Then we'll check if there is dependence on income type or on the purpose 

In [7]:
# Let's investigate clients who do not have data on identified characteristic and the column with the missing values
data_filt = data[(data.days_employed.isna())&(data.total_income.isna())]
(data_filt.value_counts(subset = ['purpose'])/data_filt.value_counts(subset = ['purpose']).sum()).map("{:.1%}".format)


purpose                                 
having a wedding                            4.2%
to have a wedding                           3.7%
wedding ceremony                            3.5%
construction of own property                3.4%
housing transactions                        3.4%
buy real estate                             3.3%
transactions with my real estate            3.3%
purchase of the house for my family         3.3%
transactions with commercial real estate    3.2%
housing renovation                          3.2%
buy commercial real estate                  3.1%
buying property for renting out             3.0%
property                                    2.9%
buy residential real estate                 2.8%
real estate transactions                    2.8%
housing                                     2.8%
building a property                         2.7%
cars                                        2.6%
going to university                         2.6%
to become educated          

In [8]:
# Checking distribution
(data_filt.value_counts(subset = ['income_type'])/data_filt.value_counts(subset = ['income_type']).sum()).map("{:.1%}".format)


income_type  
employee         50.8%
business         23.4%
retiree          19.0%
civil servant     6.8%
entrepreneur      0.0%
dtype: object

We made distribution if total income depending on the loan purpose and income type for the data without missing values.

**Possible reasons for missing values in data - the abcence of job for the people for the moment of observation or some technical problems during the process of collecting data**

We can check whether the missing values are random, just making the same filters on the total data

In [9]:
# Checking the distribution in the whole dataset
print((data.value_counts(subset = ['income_type'])/data.value_counts(subset = ['income_type']).sum()).map("{:.1%}".format))
print((data.value_counts(subset = ['purpose'])/data.value_counts(subset = ['purpose']).sum()).map("{:.1%}".format))


income_type                
employee                       51.7%
business                       23.6%
retiree                        17.9%
civil servant                   6.8%
entrepreneur                    0.0%
unemployed                      0.0%
paternity / maternity leave     0.0%
student                         0.0%
dtype: object
purpose                                 
wedding ceremony                            3.7%
having a wedding                            3.6%
to have a wedding                           3.6%
real estate transactions                    3.1%
buy commercial real estate                  3.1%
buying property for renting out             3.0%
housing transactions                        3.0%
transactions with commercial real estate    3.0%
purchase of the house                       3.0%
housing                                     3.0%
purchase of the house for my family         3.0%
construction of own property                3.0%
property                         

**Intermediate conclusion**

The distribution in the original dataset is similar to the distribution of the filtered table, that means that all peole without the job did not fill anything reagarding the days they are employed and the income. Or it could be other issue, we will investigate this later

We can also check distribution of and gender and debt to make final conclusion 

In [10]:
# Check for other reasons and patterns that could lead to missing values. Checking gender
print((data_filt.value_counts(subset = ['gender'])/data_filt.value_counts(subset = ['gender']).sum()).map("{:.1%}".format))
print((data.value_counts(subset = ['gender'])/data.value_counts(subset = ['gender']).sum()).map("{:.1%}".format))


gender
F         68.3%
M         31.7%
dtype: object
gender
F         66.1%
M         33.9%
XNA        0.0%
dtype: object


**Intermediate conclusion**

So, there is noo patterns, missing values are accidential. Just check olso debt column

In [11]:
# Checking for other patterns - checking debt
print((data_filt.value_counts(subset = ['debt'])/data_filt.value_counts(subset = ['debt']).sum()).map("{:.1%}".format))
print((data.value_counts(subset = ['debt'])/data.value_counts(subset = ['debt']).sum()).map("{:.1%}".format))

debt
0       92.2%
1        7.8%
dtype: object
debt
0       91.9%
1        8.1%
dtype: object


**Conclusions**

There are no patterns that would prove the dependence of the missing values on enything.
So we'll replace the missing values with the mmean or median values in order not to loose the big part of our data and to make the conclusions about the whole data.



Now there is time to transform data. We'll check all string columns for duplicates, than replace missing values, and we will also rename the column 'dob_years' with 'age'.

## Data transformation

Let's fix the not obvious duplicates in the columns

In [12]:
# Let's see all values in education column to check if and what spellings will need to be fixed
data['education'].unique()


array(["bachelor's degree", 'secondary education', 'Secondary Education',
       'SECONDARY EDUCATION', "BACHELOR'S DEGREE", 'some college',
       'primary education', "Bachelor's Degree", 'SOME COLLEGE',
       'Some College', 'PRIMARY EDUCATION', 'Primary Education',
       'Graduate Degree', 'GRADUATE DEGREE', 'graduate degree'],
      dtype=object)

In [13]:
# Fix the registers'PRIMARY EDUCATION', 'Primary Education' and others

# Writing the function which replace duplicates for any column
def replace_wrong_names(dic, column):
    for key, value in dic.items():
        data[column] = data[column].replace(value, key)

#Dictionary for the column `education`
names_dic = {"bachelor's degree": ['BACHELOR\'S DEGREE', 'Bachelor\'s Degree'],
         'secondary education': ['Secondary Education', 'SECONDARY EDUCATION'],
         'some college': ['SOME COLLEGE', 'SOME COLLEGE', 'Some College'],
         'primary education': ['PRIMARY EDUCATION', 'Primary Education'],
         'graduate degree': ['Graduate Degree', 'GRADUATE DEGREE']
}

# Replacing the duplicates
replace_wrong_names(names_dic, 'education')

In [14]:
# Checking all the values in the column to make sure we fixed them
data['education'].unique()


array(["bachelor's degree", 'secondary education', 'some college',
       'primary education', 'graduate degree'], dtype=object)

Check the data the `children` column


In [15]:
# Let's see the distribution of values in the `children` column
data.value_counts(subset = ['children'])

children
 0          14149
 1           4818
 2           2055
 3            330
 20            76
-1             47
 4             41
 5              9
dtype: int64

We have -1 children for 47 people. and 20 children for 76 people which is absolutely impossible. The total number of those values is less than 0,5%, so not big at all. I guess that people have simply mistaken with minus sign, so we will change it to 1 child and for 20 children possibly it is 2 instead.

In [16]:
# [fix the data using the created function]

child_dic = {1:-1, 2:20}
replace_wrong_names(child_dic, 'children')

In [17]:
# Checking the `children` column again to make sure it's all fixed
data['children'].unique()


array([1, 0, 3, 2, 4, 5])

Check the data in the `days_employed` column. 

In [18]:
# Find problematic data in `days_employed`, and their distribution
data['days_employed'].min()
data['days_employed'].max()
data[data['days_employed'] <0]['days_employed'].count()
data[data['days_employed'] >18400]['days_employed'].count()

-18388.949900568383

401755.40047533

15906

3445

We have very ambiguous values in `days_employed` column. Most of them are negative, and we have maximum value of 401000 days, which is more than 1000 years. But the number of the values more than 18400 (about 50 years) is relatievly low. So, we can first replace the negative values with their positive modulus, and then replace the numbers of values more than 18400 and the missing values with the mean or median values.

In [19]:
# Let's see the distribution of values in the `perpose` column
data['purpose'].unique()


array(['purchase of the house', 'car purchase', 'supplementary education',
       'to have a wedding', 'housing transactions', 'education',
       'having a wedding', 'purchase of the house for my family',
       'buy real estate', 'buy commercial real estate',
       'buy residential real estate', 'construction of own property',
       'property', 'building a property', 'buying a second-hand car',
       'buying my own car', 'transactions with commercial real estate',
       'building a real estate', 'housing',
       'transactions with my real estate', 'cars', 'to become educated',
       'second-hand car purchase', 'getting an education', 'car',
       'wedding ceremony', 'to get a supplementary education',
       'purchase of my own house', 'real estate transactions',
       'getting higher education', 'to own a car', 'purchase of a car',
       'profile education', 'university education',
       'buying property for renting out', 'to buy a car',
       'housing renovation', 'going

We see the duplicates there too, let's fix this, using our function

In [20]:
#fix the data using the created function
purpose_dic = {'house purchase':['purchase of the house', 'purchase of the house for my family', 
                                 'buy real estate', 'purchase of my own house',],
               'car purchase':['buying a second-hand car', 'buying my own car', 'cars', 'second-hand car purchase', 'car',
                               'to own a car', 'purchase of a car', 'to buy a car'],
               'education':['supplementary education', 'to become educated', 'getting an education', 
                            'to get a supplementary education', 'getting higher education', 'profile education',
                           'university education', 'going to university'],
               'wedding':['to have a wedding', 'having a wedding', 'wedding ceremony'],
               
               'housing transactions':['construction of own property', 'property', 'building a property', 
                                       'transactions with commercial real estate', 'building a real estate', 'housing',
                                      'transactions with my real estate', 'real estate transactions', 'housing renovation'],
               'commercial house purchase':['buy real estate', 'buy commercial real estate', 'buy residential real estate',
                                           'buying property for renting out']   
}

replace_wrong_names(purpose_dic, 'purpose')

In [21]:
# Check the result - make sure it's fixed
data['purpose'].unique()

array(['house purchase', 'car purchase', 'education', 'wedding',
       'housing transactions', 'commercial house purchase'], dtype=object)

Now we look at the client's age

In [22]:
#Renaming the `dob_years` with the `age`
data.rename(columns = {'dob_years':'age'}, inplace = True)

# Check the `age` for suspicious values and count the percentage
print(data['age'].unique())
"{:.1%}".format(data[data['age']==0]['age'].count()/data['age'].count())


[42 36 33 32 53 27 43 50 35 41 40 65 54 56 26 48 24 21 57 67 28 63 62 47
 34 68 25 31 30 20 49 37 45 61 64 44 52 46 23 38 39 51  0 59 29 60 55 58
 71 22 73 66 69 19 72 70 74 75]


'0.5%'

We have 0.5% of people who put value 0 to their age column. It is not a big number and we can remoove this data in order to perform our analisis properly and check our hypotheses.

In [23]:
# Remooving the data with the zero age in the `dob_years` column
data.drop(data[data['age']==0].index, inplace = True)


In [24]:
# Check the result - make sure it's fixed
data[data['age']==0]

Unnamed: 0,children,days_employed,age,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose


let's check the `family_status` column

In [25]:
# Let's see the values for the column
data['family_status'].unique()


array(['married', 'civil partnership', 'widow / widower', 'divorced',
       'unmarried'], dtype=object)

The only problem we have very long name 'widow / widower', So we will change the vlue to the 'widow(er)'

In [26]:
# Replacing 'widow / widower' with 'widow(er)'
status_dic = {'widow(er)': 'widow / widower'}
replace_wrong_names(status_dic, 'family_status')

In [27]:
# Check the result - make sure it's fixed
data['family_status'].unique()

array(['married', 'civil partnership', 'widow(er)', 'divorced',
       'unmarried'], dtype=object)

Now we will check the `gender` column

In [28]:
# Let's see the values in the `gender` column
data.value_counts(subset = ['gender'])

gender
F         14164
M          7259
XNA           1
dtype: int64

We have strange value XNA, but only for one person. So, we can delete this row from the data

In [29]:
# Address the problematic value
data.drop(data[data['gender']=='XNA'].index, inplace = True)

In [30]:
# Check the result - make sure it's fixed
data.value_counts(subset = ['gender'])

gender
F         14164
M          7259
dtype: int64

Now we will check the `income_type` column

In [31]:
# Let's see the values in the column
data.value_counts(subset = ['income_type'])

income_type                
employee                       11064
business                        5064
retiree                         3836
civil servant                   1453
entrepreneur                       2
unemployed                         2
paternity / maternity leave        1
student                            1
dtype: int64

We have 2 people with enterpreneur status, which actually the same as business, and maternity leave and student are unemployed. We have only 6 such values in total, so it won't be a problem if we simpliefy our data, using our `replace_wrong_names` function. We will not change the `civil servant` to `employee`, because their numbe is rather big, so it can influence to the result. 

In [32]:
# Address the problematic values, if they exist
income_dic = {'business':'entrepreneur',
              'unemployed':['paternity / maternity leave', 'student']
}
replace_wrong_names(income_dic, 'income_type')

In [33]:
# Check the result - make sure it's fixed
data.value_counts(subset = ['income_type'])


income_type  
employee         11064
business          5066
retiree           3836
civil servant     1453
unemployed           4
dtype: int64

Only 4 unemployed people, but Ok, the most of unemployed people don't even try to get  a loan.


Let's see if we have any duplicates in our data

In [34]:
# Checking duplicates
data.duplicated().sum()


285

In [35]:
# Address the duplicates
data.drop_duplicates(inplace = True)

In [36]:
# Last check whether we have any duplicates
data.duplicated().sum()

0

In [37]:
# Check the size of the dataset that you now have after your first manipulations with it
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21138 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21138 non-null  int64  
 1   days_employed     19259 non-null  float64
 2   age               21138 non-null  int64  
 3   education         21138 non-null  object 
 4   education_id      21138 non-null  int64  
 5   family_status     21138 non-null  object 
 6   family_status_id  21138 non-null  int64  
 7   gender            21138 non-null  object 
 8   income_type       21138 non-null  object 
 9   debt              21138 non-null  int64  
 10  total_income      19259 non-null  float64
 11  purpose           21138 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.1+ MB


We have fixed issues with the columns: `children`, `education`, `family_status`, `income_type`, `purpose` where we optimized our data replacing some duplicates, also we changed name of the column `dob_years` with `age`. With theese changes we don't loose any piece of our data.
    We removed data of people who did not define the age and gender. There were not big piece of those data, and after that we removed the duplicates.  Now we have 21138 rows or 98,2% from the original data. Our results will be relevant.

##  Working with missing values

 We have `education` and `family_status` columns with thier id columns accordingly. So we can create dictionarys in order to speed up working with this data

In [38]:
# Find the dictionaries
#Education dictionary
education_dic = data.set_index('education_id')['education'].to_dict()

#Family status dictionary
family_dic = data.set_index('family_status_id')['family_status'].to_dict()
education_dic
family_dic


{0: "bachelor's degree",
 1: 'secondary education',
 2: 'some college',
 3: 'primary education',
 4: 'graduate degree'}

{0: 'married',
 1: 'civil partnership',
 2: 'widow(er)',
 3: 'divorced',
 4: 'unmarried'}

### Restoring missing values in `total_income`

We have missing values in `days_employed` and `total_incomes` columns. We need to replace them with median or mean values  for specific groups of people. For example income can be up to age, so we need theese mean values for each specific group to get relevants values.

Let's start with addressing total income missing values


In [39]:
# Let's write a function that calculates the age category
def age_cat(age):
    
    if age <10:
        return '0-9'
    if age < 20:
        return '10-19'
    elif age < 30:
        return '20-29'
    elif age < 40:
        return '30-39'
    elif age < 50:
        return '40-49'
    elif age < 60:
        return '50-59'
    elif age < 70:
        return '60-69'
    else: return '70+'
    

In [40]:
# Test if the function works
print (age_cat(18))
print (age_cat(45))
print (age_cat(65))

10-19
40-49
60-69


In [41]:
# Creating new column based on function
data['age_group'] = data['age'].apply(age_cat)


In [42]:
# Checking how values in the new column
data['age_group'].value_counts()

30-39    5629
40-49    5294
50-59    4591
20-29    3149
60-69    2292
70+       169
10-19      14
Name: age_group, dtype: int64

Income can depend on age and type of employment. So, we will see the distribution of income depending on that factors

We will make a table without missing values to see the distribution and to restor the missing values

In [43]:
# Create a table without missing values and print a few of its rows to make sure it looks fine
data_clean = data.dropna()
data_clean.head(10)


Unnamed: 0,children,days_employed,age,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_group
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,house purchase,40-49
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,30-39
2,0,-5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,house purchase,30-39
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,education,30-39
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,wedding,50-59
5,0,-926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,house purchase,20-29
6,0,-2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions,40-49
7,0,-152.779569,50,secondary education,1,married,0,M,employee,0,21731.829,education,50-59
8,2,-6929.865299,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337.093,wedding,30-39
9,0,-2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,house purchase,40-49


In [44]:
# Look at the mean and median values for income based on each other column (age group, income type, family status, gender, 
# education, debt)
data_clean.groupby('age_group').agg({'total_income':['mean', 'median']}).sort_values(by=('total_income','median'))
data_clean.groupby('income_type').agg({'total_income':['mean', 'median']}).sort_values(by=('total_income','median'))
data_clean.groupby('family_status').agg({'total_income':['mean', 'median']}).sort_values(by=('total_income','median'))
data_clean.groupby('gender').agg({'total_income':['mean', 'median']}).sort_values(by=('total_income','median'))
data_clean.groupby('education').agg({'total_income':['mean', 'median']}).sort_values(by=('total_income','median'))
data_clean.groupby('debt').agg({'total_income':['mean', 'median']}).sort_values(by=('total_income','median'))

Unnamed: 0_level_0,total_income,total_income
Unnamed: 0_level_1,mean,median
age_group,Unnamed: 1_level_2,Unnamed: 2_level_2
10-19,16993.942462,14934.901
70+,20125.658331,18751.324
60-69,23242.812818,19817.44
50-59,25811.700327,22203.0745
20-29,25570.172966,22798.665
30-39,28312.479963,24667.528
40-49,28551.375635,24764.229


Unnamed: 0_level_0,total_income,total_income
Unnamed: 0_level_1,mean,median
income_type,Unnamed: 1_level_2,Unnamed: 2_level_2
unemployed,16588.4105,12652.6895
retiree,21939.310393,18969.149
employee,25824.679592,22815.1035
civil servant,27361.316126,24083.5065
business,32407.719326,27564.893


Unnamed: 0_level_0,total_income,total_income
Unnamed: 0_level_1,mean,median
family_status,Unnamed: 1_level_2,Unnamed: 2_level_2
widow(er),23006.808776,20523.267
unmarried,26943.601742,23139.404
civil partnership,26702.249322,23195.636
married,27045.38353,23377.708
divorced,27202.683563,23584.9695


Unnamed: 0_level_0,total_income,total_income
Unnamed: 0_level_1,mean,median
gender,Unnamed: 1_level_2,Unnamed: 2_level_2
F,24664.752169,21469.0015
M,30905.772981,26819.567


Unnamed: 0_level_0,total_income,total_income
Unnamed: 0_level_1,mean,median
education,Unnamed: 1_level_2,Unnamed: 2_level_2
primary education,21144.882211,18741.976
secondary education,24600.353617,21839.4075
graduate degree,27960.024667,25161.5835
some college,29035.057865,25608.7945
bachelor's degree,33172.428387,28054.531


Unnamed: 0_level_0,total_income,total_income
Unnamed: 0_level_1,mean,median
debt,Unnamed: 1_level_2,Unnamed: 2_level_2
1,26105.101486,22928.48
0,26854.991871,23224.6865


As we see the people who are businessmen, with the age 30 - 59 years and bachelor degree have the highest income. It seems like income_type, age and aducation are the factors that influence to the total income. We see also, that median values are less than mean ones, so, it seems that the distribution is not equal, there can be relatievly small number of the people with the relatievly high income, Let's group our data with the 3 factors that we have determined and see what we will get

In [45]:
#Let's groop the data with 'age_group', 'education', and 'income_type' columns and calculat mean and median for them
data_clean.groupby(['age_group', 'income_type', 'education']).agg({
    'total_income':['mean', 'median']}).sort_values(by=('total_income','median'), ascending = False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,total_income,total_income
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,mean,median
age_group,income_type,education,Unnamed: 3_level_2,Unnamed: 4_level_2
40-49,civil servant,primary education,78410.774000,78410.7740
70+,civil servant,bachelor's degree,57508.032000,57508.0320
50-59,employee,graduate degree,42945.794000,42945.7940
60-69,business,some college,32607.246000,37146.5350
70+,business,bachelor's degree,35343.050000,36808.9680
...,...,...,...,...
50-59,civil servant,primary education,12709.275500,12709.2755
10-19,civil servant,some college,12125.986000,12125.9860
20-29,retiree,secondary education,11637.739167,11619.3835
30-39,unemployed,secondary education,9102.890000,9102.8900


So, after grouping with the 3 factors we got a different picture. The civill servants have the highest income, and the means are almost equal to medians or ever higher. That means that we can use the means for replacing the missing values. Let's do this.

In [46]:
# Write a function that we will use for filling in missing values
def replace_mean(column_replace, column_sorting1, column_sorting2, column_sorting3):
    data[column_replace] = data[column_replace].fillna(data.groupby([
        column_sorting1, column_sorting2, column_sorting3])[column_replace].transform('mean'))

In [47]:
# aplying the function and checking if it works
replace_mean('total_income', 'age_group', 'income_type', 'education')
data.loc[12]

children                              0
days_employed                       NaN
age                                  65
education           secondary education
education_id                          1
family_status         civil partnership
family_status_id                      1
gender                                M
income_type                     retiree
debt                                  0
total_income               20482.478461
purpose                         wedding
age_group                         60-69
Name: 12, dtype: object

In [48]:
# Check if we got any errors
data[data.total_income.isna()]

Unnamed: 0,children,days_employed,age,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_group
1303,1,,70,primary education,3,civil partnership,1,F,employee,0,,housing transactions,70+
8142,0,,64,primary education,3,civil partnership,1,F,civil servant,0,,wedding,60-69


We have 2 errors. Let's fix them manualy. First we'll see what the values we should replace with and then we'll replace them

In [49]:
#Check the date with primary education colu'education'mn
data_clean.query("education=='primary education'").groupby(['age_group', 'income_type', 'education']).agg({
    'total_income':['mean']}).sort_values(by=('income_type'))

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,total_income
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,mean
age_group,income_type,education,Unnamed: 3_level_2
20-29,business,primary education,30460.981833
50-59,business,primary education,25171.059
30-39,business,primary education,25068.715417
40-49,business,primary education,25941.415857
50-59,civil servant,primary education,12709.2755
40-49,civil servant,primary education,78410.774
30-39,civil servant,primary education,21150.696
20-29,civil servant,primary education,30563.383
40-49,employee,primary education,22164.718342
20-29,employee,primary education,26614.028556


So. as we see, we have no people, that matches all the 3 parametrs, that's why we could not adress them. The missing factor is the age group. Let's replace the reos of the missing values the means without age group

In [50]:
#Replacing the missing values with 2 factor median.
replace_mean('total_income', 'income_type', 'income_type', 'education')

#Checking for mistakes
data[data.total_income.isna()]

Unnamed: 0,children,days_employed,age,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_group


Well, we've finished with `total_income`, let's check that the total number of values in this column matches the number of values in other ones.

In [51]:
# Checking the number of entries in the columns
data.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 21138 entries, 0 to 21524
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21138 non-null  int64  
 1   days_employed     19259 non-null  float64
 2   age               21138 non-null  int64  
 3   education         21138 non-null  object 
 4   education_id      21138 non-null  int64  
 5   family_status     21138 non-null  object 
 6   family_status_id  21138 non-null  int64  
 7   gender            21138 non-null  object 
 8   income_type       21138 non-null  object 
 9   debt              21138 non-null  int64  
 10  total_income      21138 non-null  float64
 11  purpose           21138 non-null  object 
 12  age_group         21138 non-null  object 
dtypes: float64(2), int64(5), object(6)
memory usage: 2.8+ MB


###  Restoring values in `days_employed`

We have already noticed, that there are very ambiguous values in `days_employed` column. There are many negative numbers and also there are very big numbers, Let's restore this. First we will replace all the values more than 18400 with missing values, than we will replace the negative values with their modulus. After that we will see what values we have depending on age group and replace them.

In [52]:
# Replacing the huge values
data.loc[data['days_employed']>18000, 'days_employed'] = np.nan

#Replacing the negative values
data['days_employed'] = data['days_employed'].apply(lambda x:abs(x))

#Check if we have negative values or the values more than 18400
data[(data['days_employed'] <0) |(data['days_employed'] >18400)]

Unnamed: 0,children,days_employed,age,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_group


In [53]:
#Let's filter the data from the missing values and check means and medians depanding on the age group
data_clean = data.dropna()
data_clean.head(10)
data_clean.groupby('age_group').agg({'days_employed':['mean', 'median']}).sort_values(by=('days_employed','median'))


Unnamed: 0,children,days_employed,age,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_group
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,house purchase,40-49
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,30-39
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,house purchase,30-39
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,education,30-39
5,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,house purchase,20-29
6,0,2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions,40-49
7,0,152.779569,50,secondary education,1,married,0,M,employee,0,21731.829,education,50-59
8,2,6929.865299,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337.093,wedding,30-39
9,0,2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,house purchase,40-49
10,2,4171.483647,36,bachelor's degree,0,married,0,M,business,0,18230.959,house purchase,30-39


Unnamed: 0_level_0,days_employed,days_employed
Unnamed: 0_level_1,mean,median
age_group,Unnamed: 1_level_2,Unnamed: 2_level_2
10-19,633.678086,724.49261
20-29,1212.143628,999.405139
30-39,2026.411457,1589.781401
40-49,2733.227228,2020.095893
50-59,3260.836218,2261.863018
60-69,3839.177181,2669.073965
70+,4226.808923,2680.232791


As we see, there is a strong correlation between the age group and the `days_employed` column. We can use median values for now, because they are less than means. The only group is 10-19 years, where we have median more than the mean, but there are only 14 people in that group, so it won't be a mistake to replace all the missing values with the median

In [54]:
#Writing a function for replacing with the median
def replace_median(column_replace, column_sorting):
    data[column_replace] = data[column_replace].fillna(data.groupby(column_sorting)[column_replace].transform('median'))

#Applying the fanction
replace_median('days_employed', 'age_group')

#Check if it works
data.loc[12]

children                              0
days_employed               2669.073965
age                                  65
education           secondary education
education_id                          1
family_status         civil partnership
family_status_id                      1
gender                                M
income_type                     retiree
debt                                  0
total_income               20482.478461
purpose                         wedding
age_group                         60-69
Name: 12, dtype: object

Ok, now our data should be clean. Let's check this.

In [55]:
# Checking the  data info
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21138 entries, 0 to 21524
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21138 non-null  int64  
 1   days_employed     21138 non-null  float64
 2   age               21138 non-null  int64  
 3   education         21138 non-null  object 
 4   education_id      21138 non-null  int64  
 5   family_status     21138 non-null  object 
 6   family_status_id  21138 non-null  int64  
 7   gender            21138 non-null  object 
 8   income_type       21138 non-null  object 
 9   debt              21138 non-null  int64  
 10  total_income      21138 non-null  float64
 11  purpose           21138 non-null  object 
 12  age_group         21138 non-null  object 
dtypes: float64(2), int64(5), object(6)
memory usage: 2.8+ MB


As we see,  the total number of values in all columns are identical. We can go on for categorization of the data

## Categorization of data

Now we need to categorize our data. According to our hypotheses we need to make groups of people with the different mumber of children, marital status groups, income type groups and loan purpose groups. As we need to make a credit raitng we will filter the data and see what data we have from the people who vave already defaulted. We start from the text data, then we'll adress the numerical data


In [56]:
# Print the values for your selected data for categorization
data_default = data[data['debt']==1]
data_default.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1732 entries, 14 to 21523
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          1732 non-null   int64  
 1   days_employed     1732 non-null   float64
 2   age               1732 non-null   int64  
 3   education         1732 non-null   object 
 4   education_id      1732 non-null   int64  
 5   family_status     1732 non-null   object 
 6   family_status_id  1732 non-null   int64  
 7   gender            1732 non-null   object 
 8   income_type       1732 non-null   object 
 9   debt              1732 non-null   int64  
 10  total_income      1732 non-null   float64
 11  purpose           1732 non-null   object 
 12  age_group         1732 non-null   object 
dtypes: float64(2), int64(5), object(6)
memory usage: 189.4+ KB


Let's check unique values of the `children` column

In [57]:
# Check the unique values of children
data_default['children'].unique()

array([0, 1, 2, 3, 4])

According to our hypotheses, let's make the groups for people having children:'No', '1-2', '3 and more' 

In [58]:
# Let's write a function to categorize the data based on common topics
def child_cat(num):
    if num == 0:
        return 'No' 
    if num <3:
        return '1-2'
    return '3 and more' 


In [77]:
# Create a column with the categories and count the values for them
data_default.loc[:,'child_cat'] = data_default['children'].apply(child_cat)
data_default['child_cat'].value_counts()



No            1057
1-2            644
3 and more      31
Name: child_cat, dtype: int64

Let's check unique values of the `purpose` column

In [60]:
data_default['purpose'].unique()

array(['commercial house purchase', 'wedding', 'education',
       'housing transactions', 'car purchase', 'house purchase'],
      dtype=object)

let's make the groups for purposes: 'real estate', 'car', 'wedding', 'education'

In [61]:
# Let's write a function to categorize the data based on our decision
def purpose_cat(goal):
    if 'hous' in goal:
        return 'real estate' 
    elif goal == 'wedding':
        return 'wedding'
    elif 'car' in goal:
        return 'car'
    else: return 'education'


In [62]:
# Create a column with the categories and count the values for them
data_default.loc[:,'purpose_cat'] = data_default['purpose'].apply(purpose_cat)
data_default['purpose_cat'].value_counts()

real estate    779
car            399
education      370
wedding        184
Name: purpose_cat, dtype: int64

Let's check unique values of the `family_status` column

In [63]:
data_default['family_status'].unique()

array(['civil partnership', 'unmarried', 'married', 'widow(er)',
       'divorced'], dtype=object)

According to our hypotheses, let's make 2 groups for family status: 'courple', 'single'

In [64]:
def family_cat(status):
    if status == 'civil partnership' or status =='married':
        return 'couple' 
    else: return 'single'

In [65]:
# Create a column with the categories and count the values for them
data_default.loc[:, 'family_cat'] = data_default['family_status'].apply(family_cat)
data_default['family_cat'].value_counts()

couple    1312
single     420
Name: family_cat, dtype: int64

Now let's look to the `total_income` column and examine the values we have there

In [66]:
# Looking through the numerical data in `total income` column for categorization
data_default['total_income'].max()
data_default['total_income'].min()
data_default['total_income'].mean()
data_default['total_income'].median()

352136.354

3306.762

26043.44589493624

23498.097999999998

In [67]:
# Getting summary statistics for the column
data_default['total_income'].value_counts().sum()
data_default.loc[data['total_income']>50000].shape[0]
data_default.loc[data['total_income']<10000].shape[0]

1732

92

58

We have minimum income about 3000, maximum about 350000, median does not differ much, from the mean, and the total number of people about 1700. The majority of the people has income level from 10000 to 50000. That means that the distribution of total income is more or less proportional and we can group the `total_income` column to 3 groups: less than 10000 - low income, from 10000 to 50000 medium income, and for the people who's income is more than 50000 - high income

In [68]:
# Creating function for categorizing into different numerical groups based on ranges for `total_income` column
def income_cat(num):
    if num < 10000:
        return 'low'
    elif num <50000:
        return 'medium'
    else:
        return 'high'


In [78]:
# Creating column with categories
data_default.loc[:, 'income_cat'] = data_default['total_income'].apply(income_cat)


In [70]:
# Count each categories values to see the distribution
(data_default['income_cat'].value_counts()/data_default['income_cat'].value_counts().sum()).map("{:.1%}".format)

medium    91.3%
high       5.3%
low        3.3%
Name: income_cat, dtype: object

## Checking the Hypotheses


**Is there a correlation between having children and paying back on time?**  <br> To check this we will  find the share of the defaulted people with different number of children to the people with the same number of children from the people who didn't defalt yet. 

In [71]:
# Check the children data and paying back on time

#Creating `child_cat` column in the whole dataFrame
data['child_cat']= data['children'].apply(child_cat)

# Creating a pivot table and finding the share of fefaulted people to non defaulted people in the same age group
pivot_child = data.pivot_table(
    index = 'child_cat',
    columns = 'debt',
    values = 'days_employed',
    aggfunc = 'count'
)
pivot_child['ratio'] = (pivot_child[1] / pivot_child[0])
pivot_child

#Creating a dictionary of the default rate
child_cat_dict = pd.Series(pivot_child['ratio']).to_dict()

# Calculating default-rate based on the number of children
data['child_def_r'] = data['child_cat'].map(child_cat_dict)
data[['children','child_def_r']].head(10)

debt,0,1,ratio
child_cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1-2,6272,644,0.102679
3 and more,346,31,0.089595
No,12788,1057,0.082656


Unnamed: 0,children,child_def_r
0,1,0.102679
1,1,0.102679
2,0,0.082656
3,3,0.089595
4,0,0.082656
5,0,0.082656
6,0,0.082656
7,0,0.082656
8,2,0.102679
9,0,0.082656


**Conclusion**

We have calculated probability of default depending on the number of children and created the new column with that default raiting. The people who have 1 or 2 children are a more likely not to pay on time. So, our hypothese is incorrect. People who have 3 or more children are less likely to default. And the most responsible are people without children.


**Is there a correlation between family status and paying back on time?**

In [72]:
# Check the family and paying back on time

#Creating `family_cat` column in the whole dataFrame
data['family_cat']= data['family_status'].apply(family_cat)

# Creating a pivot table and finding the share of fefaulted people to non defaulted people in the same marital status
pivot_family = data.pivot_table(
    index = 'family_cat',
    columns = 'debt',
    values = 'days_employed',
    aggfunc = 'count'
)
pivot_family['ratio'] = (pivot_family[1] / pivot_family[0])
pivot_family

#Creating a dictionary of the default rate
family_cat_dict = pd.Series(pivot_family['ratio']).to_dict()

# Calculating default-rate based on the marital status
data['family_def_r'] = data['family_cat'].map(family_cat_dict)
data[['family_status','family_def_r']].head(10)

debt,0,1,ratio
family_cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
couple,14923,1312,0.087918
single,4483,420,0.093687


Unnamed: 0,family_status,family_def_r
0,married,0.087918
1,married,0.087918
2,married,0.087918
3,married,0.087918
4,civil partnership,0.087918
5,civil partnership,0.087918
6,married,0.087918
7,married,0.087918
8,civil partnership,0.087918
9,married,0.087918


**Conclusion**

Our hypothese is correct, the married people are less likely to default, the difference is 0.5%. 9.4% of single people can default in future.

**Is there a correlation between income level and paying back on time?**

In [73]:
# Check the income level and paying back on time

#Creating `income_cat` column in the whole dataFrame
data['income_cat']= data['total_income'].apply(income_cat)

# Creating a pivot table and finding the share of fefaulted people to non defaulted people in the same income category
pivot_income = data.pivot_table(
    index = 'income_cat',
    columns = 'debt',
    values = 'days_employed',
    aggfunc = 'count'
)
pivot_income['ratio'] = (pivot_income[1] / pivot_income[0])
pivot_income

#Creating a dictionary of the default rate
income_cat_dict = pd.Series(pivot_income['ratio']).to_dict()

# Calculating default-rate based on the total income
data['income_def_r'] = data['income_cat'].map(income_cat_dict)
data[['total_income','income_def_r']].head(10)

debt,0,1,ratio
income_cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
high,1226,92,0.075041
low,863,58,0.067207
medium,17317,1582,0.091355


Unnamed: 0,total_income,income_def_r
0,40620.102,0.091355
1,17932.802,0.091355
2,23341.752,0.091355
3,42820.568,0.091355
4,25378.572,0.091355
5,40922.17,0.091355
6,38484.156,0.091355
7,21731.829,0.091355
8,15337.093,0.091355
9,23108.15,0.091355


**Conclusion**

Our hypothese is completely incorrect. People with low income are less likely to default. And the difference of probability is  2,4% with the defaulted people with average income who are the most risky category.

**How does credit purpose affect the default rate?**

In [74]:
# Check the percentages for default rate for each credit purpose and analyze them

#Creating `purpose_cat` column in the whole dataFrame
data['purpose_cat']= data['purpose'].apply(purpose_cat)

# Creating a pivot table and finding the share of fefaulted people to non defaulted people in the same credit purpose category
pivot_purpose = data.pivot_table(
    index = 'purpose_cat',
    columns = 'debt',
    values = 'days_employed',
    aggfunc = 'count'
)
pivot_purpose['ratio'] = (pivot_purpose[1] / pivot_purpose[0])
pivot_purpose

#Creating a dictionary of the default rate
purpose_cat_dict = pd.Series(pivot_purpose['ratio']).to_dict()

# Calculating default-rate based on the credit purpose
data['purpose_def_r'] = data['purpose_cat'].map(purpose_cat_dict)
data[['purpose','purpose_def_r']].head(10)


debt,0,1,ratio
purpose_cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
car,3851,399,0.103609
education,3576,370,0.103468
real estate,9871,779,0.078918
wedding,2108,184,0.087287


Unnamed: 0,purpose,purpose_def_r
0,house purchase,0.078918
1,car purchase,0.103609
2,house purchase,0.078918
3,education,0.103468
4,wedding,0.087287
5,house purchase,0.078918
6,housing transactions,0.078918
7,education,0.103468
8,wedding,0.087287
9,house purchase,0.078918


**Conclusion**

This hypothese is completely incorrect. The people who ask money for  the wedding aswell as for real estate are less likely to default. Those who want improve their education o to by a car are under the risk of future default.


## Default raiting

Now we can calculate the avarage raitng and make the pivot table. For sure we should not add consrtucively the raitings with different categories as we are calculating the same people, but we will calculate the average probability of default. We have the range from 6,7% to 10,4% which we have calculated on previous steps. Now we will look to the average numbers and categorize them according to our range. Also we will include defaulted people to the extremely high category

In [75]:
#Creating the function which calculate risk category
def risk_cat(number):
    if number <0.08:
        return 'low'
    elif number <0.09:
        return 'medium'
    else:
        return 'high'
#Check if it works
risk_cat(0.06)
risk_cat(0.088)
risk_cat(0.1)
#Creating the column of the risk_avg
data = data.assign(risk_avg = data[['purpose_def_r','family_def_r', 'child_def_r', 'income_def_r']].mean(axis = 1))

#Assighning the function and creating the column `risk_cat`
data['risk_cat'] = data['risk_avg'].apply(risk_cat)

#Replacing the defaulted people with "extremly high" value and checking that it works
data.loc[data['debt']==1,'risk_cat'] = 'extremely high'
data['risk_cat'].iloc[55]


'low'

'medium'

'high'

'extremely high'

Now we are ready to calculate pivot table with the default risk for the total data. We aggregate number of people with different credit risk depanding on their age category

In [76]:
#Creating the pivot table
data.pivot_table(
    values = 'risk_avg',
    index = 'age_group',
    columns = 'risk_cat',
    aggfunc = 'count',
    fill_value = 0,
    margins = True
    
)

risk_cat,extremely high,high,low,medium,All
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10-19,1,3,0,10,14
20-29,348,1620,18,1163,3149
30-39,551,3392,18,1668,5629
40-49,404,2659,50,2181,5294
50-59,305,1815,74,2397,4591
60-69,117,842,66,1267,2292
70+,6,56,7,100,169
All,1732,10387,233,8786,21138


# General Conclusion 

Our project was to prepare a report for a bank’s loan division. we needed to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. Purpose of the analysis was to build up the credit raiting table for the Loan Department, showing the probability of customers default on the loan and to determine the factors which influence on it. <br>
We recieved the data consisting of 21525 rows. There were problems with missing values and duplicates not only in quantative columns but also some categories were the same an with the different spelling. We have fixed issues with the columns where we optimized our data replacing some duplicates. With theese changes we didn't loose any piece of our data. We removed data of people who did not define the age and gender. There were not big number of them, and after that we removed 285 duplicates.  After the preprocessing atsge our data consisted of 21138 rows or 98,2% from the original data. We restored the `days_employed` column with the median values depending on age. We have also corrected the negative values.  <br>

Then we categorized our data. According to our hypotheses we took the part of the data where the people have already defaulted and made groups of people with the different mumber of children, marital status groups, income type groups and loan purpose groups. Then we calculated the share of the defaulted people to the non defaulted data depending different factors according to our gypotheses and checked them. <br>

We checked all the 4 hypotheses:
1. Married people are less likely to default on loans than divorced or unmarried <br>
   **This hypothese is correct, the married people are less likely to default**  9.4%    of single people are under the risk. <br>
1. People who have less than 2 children are more likely to default on loans <br>
   **The hypothese is incorrect. People who have 1 or 2 children are more likely to default.** And the people without children are more likely to pay on time
1. People with low income are more likely to default than those who have middle or high level of income. <br>
   **This hypothese is also completely incorrect. People with low income are less likely to default.** And the difference of        probability is about 2% with the defaulted people with average income who are the most risky category.
1. Loan purpose has no influence on the probability to default on a loan <br>
   **The final hypothese is completely incorrect. The people who ask money for  the wedding aswell as for real estate are less    likely to default.** Those who want improve their education o to by a car are under the risk of future default. <br>
So, our suggestions were not true, but we made conclusions based on the data. It was a bit suprising but we have proved this. 


In the end we have calculated the average  risk and categorized it to low, medium, high and extremely high categories. We made a table showing the distibution of the risk. We can see, that though 8% of the people have extremely high risk of default because they have defaulted already, 50% of the people asking for a loan have a high risk of default. It is very dangerous and the credit departement should examine our report carefully in order to make the right decision. The average probability of the default is stated for each person in our data.
