# The investigation on the reliability of borrowers

The customer is the credit department of the bank. It is necessary to understand whether the marital status and the number of children of the client affect the fact of repayment of the loan on time. Input data from the bank — statistics on the solvency of customers.

The results of the study will be taken into account when building a model of **credit scoring** — a special system that evaluates the ability of a potential borrower to repay a loan to the bank.

## Data Preparation and Exploration

In [38]:
import pandas as pd
from IPython.display import display
from pymystem3 import Mystem
m = Mystem() 
from collections import Counter

In [39]:
data = pd.read_csv('/Users/vintera/Git/my_projects/dataset/project_01/data.csv')

In [40]:
data.info()
data.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,debt,total_income
count,21525.0,19351.0,21525.0,21525.0,21525.0,21525.0,19351.0
mean,0.538908,63046.497661,43.29338,0.817236,0.972544,0.080883,167422.3
std,1.381587,140827.311974,12.574584,0.548138,1.420324,0.272661,102971.6
min,-1.0,-18388.949901,0.0,0.0,0.0,0.0,20667.26
25%,0.0,-2747.423625,33.0,1.0,0.0,0.0,103053.2
50%,0.0,-1203.369529,42.0,1.0,0.0,0.0,145017.9
75%,1.0,-291.095954,53.0,1.0,1.0,0.0,203435.1
max,20.0,401755.400475,75.0,4.0,4.0,1.0,2265604.0


In [41]:
data.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,высшее,0,женат / замужем,0,F,сотрудник,0,253875.639453,покупка жилья
1,1,-4024.803754,36,среднее,1,женат / замужем,0,F,сотрудник,0,112080.014102,приобретение автомобиля
2,0,-5623.42261,33,Среднее,1,женат / замужем,0,M,сотрудник,0,145885.952297,покупка жилья
3,3,-4124.747207,32,среднее,1,женат / замужем,0,M,сотрудник,0,267628.550329,дополнительное образование
4,0,340266.072047,53,среднее,1,гражданский брак,1,F,пенсионер,0,158616.07787,сыграть свадьбу


In [42]:
data.tail()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
21520,1,-4529.316663,43,среднее,1,гражданский брак,1,F,компаньон,0,224791.862382,операции с жильем
21521,0,343937.404131,67,среднее,1,женат / замужем,0,F,пенсионер,0,155999.806512,сделка с автомобилем
21522,1,-2113.346888,38,среднее,1,гражданский брак,1,M,сотрудник,1,89672.561153,недвижимость
21523,3,-3112.481705,38,среднее,1,женат / замужем,0,M,сотрудник,1,244093.0505,на покупку своего автомобиля
21524,2,-1984.507589,40,среднее,1,женат / замужем,0,F,сотрудник,0,82047.418899,на покупку автомобиля


In [43]:
data.sample(frac=0.1, random_state=1)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
1383,0,353802.811675,37,среднее,1,вдовец / вдова,2,F,пенсионер,0,216452.226085,строительство недвижимости
300,1,-359.193975,33,СРЕДНЕЕ,1,гражданский брак,1,M,сотрудник,0,223001.623994,на проведение свадьбы
6565,2,-1064.854333,35,среднее,1,гражданский брак,1,F,компаньон,0,163591.209323,свадьба
17027,0,,48,высшее,0,гражданский брак,1,F,сотрудник,0,,операции с жильем
4077,0,-7059.100220,45,высшее,0,гражданский брак,1,F,компаньон,1,194820.185757,сыграть свадьбу
...,...,...,...,...,...,...,...,...,...,...,...,...
3871,0,-3161.018082,30,среднее,1,женат / замужем,0,F,сотрудник,0,195225.109850,высшее образование
12523,0,-311.045592,45,среднее,1,женат / замужем,0,M,сотрудник,0,173858.235033,жилье
7530,2,-3425.051465,49,высшее,0,женат / замужем,0,M,компаньон,0,147165.593376,операции с недвижимостью
17091,0,-3757.852014,44,среднее,1,гражданский брак,1,F,компаньон,0,102418.840829,жилье


In [44]:
data.columns

Index(['children', 'days_employed', 'dob_years', 'education', 'education_id',
       'family_status', 'family_status_id', 'gender', 'income_type', 'debt',
       'total_income', 'purpose'],
      dtype='object')

#### Comment
It is convenient to work with the specified column names. Corrections are unnecessary here.

In [45]:
data['children'].unique()

array([ 1,  0,  3,  2, -1,  4, 20,  5])

#### Comment
20 children? Theoretically possible.. But negative values should be eliminated.

In [46]:
data.loc[data['days_employed'] >= 0].sort_values(by='days_employed')

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
20444,0,328728.720605,72,среднее,1,вдовец / вдова,2,F,пенсионер,0,96519.339647,покупка жилья для семьи
9328,2,328734.923996,41,высшее,0,женат / замужем,0,M,пенсионер,0,126997.497760,операции со своей недвижимостью
17782,0,328771.341387,56,среднее,1,женат / замужем,0,F,пенсионер,0,68648.047062,операции с коммерческой недвижимостью
14783,0,328795.726728,62,высшее,0,женат / замужем,0,F,пенсионер,0,79940.196752,на покупку своего автомобиля
7229,1,328827.345667,32,среднее,1,гражданский брак,1,F,пенсионер,0,122162.965695,сыграть свадьбу
...,...,...,...,...,...,...,...,...,...,...,...,...
7794,0,401663.850046,61,среднее,1,гражданский брак,1,F,пенсионер,0,48286.441362,свадьба
2156,0,401674.466633,60,среднее,1,женат / замужем,0,M,пенсионер,0,325395.724541,автомобили
7664,1,401675.093434,61,среднее,1,женат / замужем,0,F,пенсионер,0,126214.519212,операции с жильем
10006,0,401715.811749,69,высшее,0,Не женат / не замужем,4,F,пенсионер,0,57390.256908,получение образования


#### Comment
We already know that there are omissions and negative values in the work record. With positive values, it is no better - the shortest experience here is 923 years, if you recalculate in days. If we assume that the data collection system made an error and counted the experience in hours, then the experience of 328827/24/356 = 38 years for a 32-year-old borrower also seems doubtful.

In [47]:
data.loc[data['dob_years'] < 18].sort_values(by='dob_years')

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
99,0,346541.618895,0,Среднее,1,женат / замужем,0,F,пенсионер,0,71291.522491,автомобиль
13968,1,-1018.525283,0,среднее,1,женат / замужем,0,F,сотрудник,1,155341.706429,свой автомобиль
13741,0,,0,среднее,1,гражданский брак,1,F,сотрудник,0,,на проведение свадьбы
13521,0,-681.907359,0,высшее,0,Не женат / не замужем,4,M,сотрудник,0,115165.323707,строительство жилой недвижимости
13439,0,-1036.644001,0,среднее,1,женат / замужем,0,M,сотрудник,1,271371.522623,операции с жильем
...,...,...,...,...,...,...,...,...,...,...,...,...
6778,0,-1478.092467,0,высшее,0,Не женат / не замужем,4,F,сотрудник,0,157362.970952,получение высшего образования
6670,0,,0,Высшее,0,в разводе,3,F,пенсионер,0,,покупка жилой недвижимости
6411,0,,0,высшее,0,гражданский брак,1,F,пенсионер,0,,свадьба
7344,0,-401.461262,0,среднее,1,женат / замужем,0,M,сотрудник,0,158913.767700,операции с жильем


#### Comment
101 borrowers aged 0 years.

In [48]:
data['education'].value_counts()

среднее                13750
высшее                  4718
СРЕДНЕЕ                  772
Среднее                  711
неоконченное высшее      668
ВЫСШЕЕ                   274
Высшее                   268
начальное                250
Неоконченное высшее       47
НЕОКОНЧЕННОЕ ВЫСШЕЕ       29
НАЧАЛЬНОЕ                 17
Начальное                 15
ученая степень             4
Ученая степень             1
УЧЕНАЯ СТЕПЕНЬ             1
Name: education, dtype: int64

#### Comment
5 types of education are displayed as 15 due to differences in registers.

In [49]:
data['gender'].value_counts()

F      14236
M       7288
XNA        1
Name: gender, dtype: int64

#### Comment
XNA gender? Given that such a value is single, it will not affect the results of the study.

### Summary
The provided table consists of 21525 rows and 12 columns containing various data types. In the columns "days_employed" and "total_income" 19351 cells are filled, that is, there is no data in 2174 cells (10%) of these columns. The resulting missing values may have been caused by errors in the data collection system.
Also, when preparing data for working with them, you should pay attention to negative values and excessively high indicators in the "days_employed" column, the age of 0 years for borrowers and bring education data to a single register.

## Data preprocessing and research

### Processing of missing values

In [50]:
# We find median income values for various types of employment.
income_by_income_type = data.groupby('income_type').agg({'total_income' : ['median']})
income_by_income_type.columns = ['median_total_income']
data = data.merge(income_by_income_type, on= ['income_type'])
data[['income_type', 'total_income', 'median_total_income']][data['total_income'].isna()]

Unnamed: 0,income_type,total_income,median_total_income
46,сотрудник,,142594.396847
47,сотрудник,,142594.396847
49,сотрудник,,142594.396847
54,сотрудник,,142594.396847
55,сотрудник,,142594.396847
...,...,...,...
21457,госслужащий,,150447.935283
21489,госслужащий,,150447.935283
21511,госслужащий,,150447.935283
21513,госслужащий,,150447.935283


In [51]:
# Replace the missing values in income with median values, taking into account the type of employment.
data.loc[data['total_income'].isna(), 'total_income'] = data.loc[data['total_income'].isna(), 'median_total_income']

In [52]:
# Find the median value in the seniority column, among the values below 0, as among the most adequate.
median_days_employed = data.loc[data['days_employed'] < 0, 'days_employed'].median()
median_days_employed

-1630.0193809778218

In [53]:
# Replace omissions and very large values in the column with experience with median values.
data['days_employed'] = data['days_employed'].fillna(median_days_employed)
data.loc[data['days_employed'] > 0, 'days_employed'] = median_days_employed

#### Comment
Considering that most of the data on work experience had negative indicators, and their natural values were amenable to logic when recalculated for years, it can be assumed that when filling in the data, the length of service was counted in reverse order. 
Also remember that there are negligible few rows with a negative value (-1) in the data about children (47).
We turn negative values into positive ones by returning the modulus of the number.

In [54]:
# Turn negative values into positive ones by returning the modulus of the number.
data['days_employed'] = data['days_employed'].abs()
data['children'] = data['children'].abs()
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21525 entries, 0 to 21524
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   children             21525 non-null  int64  
 1   days_employed        21525 non-null  float64
 2   dob_years            21525 non-null  int64  
 3   education            21525 non-null  object 
 4   education_id         21525 non-null  int64  
 5   family_status        21525 non-null  object 
 6   family_status_id     21525 non-null  int64  
 7   gender               21525 non-null  object 
 8   income_type          21525 non-null  object 
 9   debt                 21525 non-null  int64  
 10  total_income         21525 non-null  float64
 11  purpose              21525 non-null  object 
 12  median_total_income  21525 non-null  float64
dtypes: float64(3), int64(5), object(5)
memory usage: 2.3+ MB


In [55]:
# Get rid of borrowers with an age below acceptable and reset the indexes.
data = data[data['dob_years'] >= 18].reset_index(drop=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21424 entries, 0 to 21423
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   children             21424 non-null  int64  
 1   days_employed        21424 non-null  float64
 2   dob_years            21424 non-null  int64  
 3   education            21424 non-null  object 
 4   education_id         21424 non-null  int64  
 5   family_status        21424 non-null  object 
 6   family_status_id     21424 non-null  int64  
 7   gender               21424 non-null  object 
 8   income_type          21424 non-null  object 
 9   debt                 21424 non-null  int64  
 10  total_income         21424 non-null  float64
 11  purpose              21424 non-null  object 
 12  median_total_income  21424 non-null  float64
dtypes: float64(3), int64(5), object(5)
memory usage: 2.1+ MB


In [56]:
# Rewriting education data in lowercase.
data['education'] = data['education'].str.lower()
data['education'].unique()

array(['высшее', 'среднее', 'неоконченное высшее', 'начальное',
       'ученая степень'], dtype=object)

In [57]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21424 entries, 0 to 21423
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   children             21424 non-null  int64  
 1   days_employed        21424 non-null  float64
 2   dob_years            21424 non-null  int64  
 3   education            21424 non-null  object 
 4   education_id         21424 non-null  int64  
 5   family_status        21424 non-null  object 
 6   family_status_id     21424 non-null  int64  
 7   gender               21424 non-null  object 
 8   income_type          21424 non-null  object 
 9   debt                 21424 non-null  int64  
 10  total_income         21424 non-null  float64
 11  purpose              21424 non-null  object 
 12  median_total_income  21424 non-null  float64
dtypes: float64(3), int64(5), object(5)
memory usage: 2.1+ MB


#### Summary

After correcting the errors identified in the first chapter, we get a table consisting of 21424 rows, which is approximately 0.5% less than the original number. At the cost of the lost rows, we get a table without omissions, abnormal indicators and values equal to 0, with data from which it is already possible to work.

### Data type converting

In [58]:
data['days_employed'] = data['days_employed'].astype('int')

In [59]:
data['total_income'] = data['total_income'].astype('int')

In [60]:
data['median_total_income'] = data['median_total_income'].astype('int')

In [61]:
data.dtypes

children                int64
days_employed           int64
dob_years               int64
education              object
education_id            int64
family_status          object
family_status_id        int64
gender                 object
income_type            object
debt                    int64
total_income            int64
purpose                object
median_total_income     int64
dtype: object

#### Summary

For the convenience of using the table, the data in the columns "days_employed" and "total_income" were replaced with integers.

### Processing duplicates

In [62]:
data.duplicated().sum()

71

In [63]:
data = data.drop_duplicates(inplace=False).reset_index(drop=True)

#### Summary

Given that there are no unique identifiers in the provided data, the search for duplicates is performed for a complete match. Since the probability of a complete coincidence of the data of different borrowers in all parameters is extremely low, it can be assumed that the nature of their occurrence is a failure in the data collection system. Delete the duplicate 71 lines.

### Lemmatization

In [64]:
# For clarity
data['purpose'].value_counts()

свадьба                                   786
на проведение свадьбы                     764
сыграть свадьбу                           760
операции с недвижимостью                  672
покупка коммерческой недвижимости         658
покупка жилья для сдачи                   649
операции с коммерческой недвижимостью     648
операции с жильем                         646
жилье                                     640
покупка жилья                             640
покупка жилья для семьи                   637
строительство собственной недвижимости    633
недвижимость                              629
операции со своей недвижимостью           627
строительство жилой недвижимости          621
покупка своего жилья                      619
строительство недвижимости                619
покупка недвижимости                      618
ремонт жилью                              605
покупка жилой недвижимости                603
на покупку своего автомобиля              502
заняться высшим образованием      

In [65]:
# Get information about words and the number of their uses in the "purpose" column.
purpose_list = data['purpose'].unique()
lemmas = m.lemmatize(' '.join(purpose_list))
Counter(lemmas).most_common()

[(' ', 96),
 ('покупка', 10),
 ('недвижимость', 10),
 ('автомобиль', 9),
 ('образование', 9),
 ('жилье', 7),
 ('с', 5),
 ('на', 4),
 ('свой', 4),
 ('операция', 4),
 ('свадьба', 3),
 ('строительство', 3),
 ('получение', 3),
 ('высокий', 3),
 ('дополнительный', 2),
 ('для', 2),
 ('коммерческий', 2),
 ('подержать', 2),
 ('заниматься', 2),
 ('сделка', 2),
 ('жилой', 2),
 ('приобретение', 1),
 ('проведение', 1),
 ('семья', 1),
 ('собственный', 1),
 ('сыграть', 1),
 ('со', 1),
 ('профильный', 1),
 ('сдача', 1),
 ('ремонт', 1),
 ('\n', 1)]

#### Summary

After analyzing the information received, we can identify the main groups of purposes for obtaining a loan: "Real Estate", "Motor Transport", "Education" and, oddly enough, "Wedding".

### Categorization

In [66]:
# We categorize the data on the loan objectives into four main groups using the results of lemmatization.
def purpose_grouping(row_values):
    lemm = m.lemmatize(row_values['purpose'])
    if 'автомобиль' in lemm:
        return 'автокредит'
    
    if 'образование' in lemm:
        return 'кредит на образование'
    
    if 'свадьба' in lemm:
        return 'кредит на свадьбу'
    
    if 'жилье' or 'недвижимость' in lemm:
        return 'операции с недвижимостью'
    
    return 'иные цели'

data['purpose_group'] = data.apply(purpose_grouping, axis=1)
data['purpose_group'].value_counts()

операции с недвижимостью    10764
автокредит                   4284
кредит на образование        3995
кредит на свадьбу            2310
Name: purpose_group, dtype: int64

In [67]:
# The column containing information about the income of borrowers, for the convenience of drawing conclusions, we will divide into 4 categories, which will be approximately equal in their values.
def total_income_grouping(row):
    income = row['total_income']
    
    if income <= 110000:
        return 'до 110 тысяч'
    
    if 110000 < income <= 145000:
        return 'от 110 до 145 тысяч'
    
    if 145000 < income <= 200000:
        return 'от 145 до 200 тысяч'
    
    return 'свыше 200 тысяч'

data['income_group'] = data.apply(total_income_grouping, axis=1)
data['income_group'].value_counts()

до 110 тысяч           5611
от 110 до 145 тысяч    5466
от 145 до 200 тысяч    5235
свыше 200 тысяч        5041
Name: income_group, dtype: int64

In [68]:
# Information about the number of borrowers' children will be divided into three groups.
def children_amount_grouping(row):
    children = row['children']
    
    if children == 0:
        return 'бездетные'
    
    if children <= 2:
        return '1 - 2 ребенка'
    
    return 'многодетные'

data['children_group'] = data.apply(children_amount_grouping, axis=1)
data['children_group'].value_counts()

бездетные        14022
1 - 2 ребенка     6878
многодетные        453
Name: children_group, dtype: int64

#### Summary

For the convenience of preparing answers to the questions put forward by the customer, in this section we have categorized the data of interest about borrowers.

## Exploratory Data Analysis

### The relationship between having children and repayment of the loan on time

In [69]:
# A table with the data of interest.
required_data = data[['children_group', 'family_status', 'income_group', 'purpose_group', 'debt']]
required_data

Unnamed: 0,children_group,family_status,income_group,purpose_group,debt
0,1 - 2 ребенка,женат / замужем,свыше 200 тысяч,операции с недвижимостью,0
1,1 - 2 ребенка,женат / замужем,от 110 до 145 тысяч,автокредит,0
2,бездетные,женат / замужем,от 145 до 200 тысяч,операции с недвижимостью,0
3,многодетные,женат / замужем,свыше 200 тысяч,кредит на образование,0
4,бездетные,женат / замужем,от 110 до 145 тысяч,кредит на образование,0
...,...,...,...,...,...
21348,бездетные,гражданский брак,свыше 200 тысяч,операции с недвижимостью,0
21349,бездетные,женат / замужем,свыше 200 тысяч,операции с недвижимостью,0
21350,бездетные,гражданский брак,свыше 200 тысяч,кредит на свадьбу,0
21351,бездетные,Не женат / не замужем,до 110 тысяч,операции с недвижимостью,0


In [70]:
answer_1 = required_data.groupby('children_group').agg({'debt' : ['count', 'sum', 'mean']})
answer_1.columns = ['общее кол-во заёмщиков', 'из них, имевших долги', '%невозврата']
answer_1 = answer_1.sort_values(by='%невозврата', ascending=False)
answer_1 = answer_1.style.format({'%невозврата':'{:.2%}'})
answer_1

Unnamed: 0_level_0,общее кол-во заёмщиков,"из них, имевших долги",%невозврата
children_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1 - 2 ребенка,6878,636,9.25%
многодетные,453,39,8.61%
бездетные,14022,1058,7.55%


#### Summary

Despite the fact that childless borrowers are almost twice as likely to apply for a loan and subsequently have debts on them, borrowers with 1-2 children are almost 2% more likely not to repay the loan on time, compared with childless ones.

### The relationship between marital status and repayment of the loan on time

In [71]:
answer_2 = required_data.groupby('family_status').agg({'debt' : ['count', 'sum', 'mean']})
answer_2.columns = ['общее кол-во заёмщиков', 'из них, имевших долги', '%невозврата']
answer_2 = answer_2.sort_values(by='%невозврата', ascending=False)
answer_2 = answer_2.style.format({'%невозврата':'{:.2%}'})
answer_2

Unnamed: 0_level_0,общее кол-во заёмщиков,"из них, имевших долги",%невозврата
family_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Не женат / не замужем,2794,273,9.77%
гражданский брак,4130,386,9.35%
женат / замужем,12290,927,7.54%
в разводе,1185,85,7.17%
вдовец / вдова,954,62,6.50%


#### Summary

The largest group of married borrowers significantly prevails over all other groups combined in terms of the number of loans and debts on them. However, despite this fact, borrowers who have not tied the knot or are in a civil marriage are almost 2% more likely to not repay their loan debts in the prescribed period than married/married. Based on the results of the study, widowed borrowers are the most punctual payers.

### The relationship between the income level and the repayment of the loan on time

In [72]:
answer_3 = required_data.groupby('income_group').agg({'debt' : ['count', 'sum', 'mean']})
answer_3.columns = ['общее кол-во заёмщиков', 'из них, имевших долги', '%невозврата']
answer_3 = answer_3.sort_values(by='%невозврата', ascending=False)
answer_3 = answer_3.style.format({'%невозврата':'{:.2%}'})
answer_3

Unnamed: 0_level_0,общее кол-во заёмщиков,"из них, имевших долги",%невозврата
income_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
от 110 до 145 тысяч,5466,480,8.78%
от 145 до 200 тысяч,5235,443,8.46%
до 110 тысяч,5611,453,8.07%
свыше 200 тысяч,5041,357,7.08%


#### Summary

The relationship between the income level and the repayment of the loan on time is insignificant. Groups of borrowers earning up to 200 thousand have almost the same number of defaults, ranging from 8.09% to 8.8%. Those earning over 200 thousand are the most conscientious payers of debts with a percentage of non-repayment - 7.09.

### The relationship between the purpose of the loan and the repayment of the loan on time

In [73]:
answer_4 = required_data.groupby('purpose_group').agg({'debt' : ['count', 'sum', 'mean']})
answer_4.columns = ['общее кол-во заёмщиков', 'из них, имевших долги', '%невозврата']
answer_4 = answer_4.sort_values(by='%невозврата', ascending=False)
answer_4 = answer_4.style.format({'%невозврата':'{:.2%}'})
answer_4

Unnamed: 0_level_0,общее кол-во заёмщиков,"из них, имевших долги",%невозврата
purpose_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
автокредит,4284,400,9.34%
кредит на образование,3995,370,9.26%
кредит на свадьбу,2310,184,7.97%
операции с недвижимостью,10764,779,7.24%


#### Summary

The studied group of borrowers who applied for a loan for real estate transactions is almost equal to the number of borrowers for all remaining target groups combined. At the same time, this group has the lowest percentage of non-repayment of debt - 7.25%, which can be said about those who took out a loan for a wedding - 7.97%. Non-repayment on a car loan and an education loan is almost 2pp higher and amounts to 9.34% and 9.28%, respectively.

## General Conclusion and Recommendations

Having examined the data obtained on borrowers, it can be concluded that:

1. The percentage of non-return for all interest groups is relatively low and does not exceed 10%
2. The most conscientious payers are borrowers with incomes over 200 thousand, widowed, having no children and being credited for real estate transactions.
3. Most often, car loan payers who are not officially married, with an income of 110 to 145 thousand, and have 1-2 children do not repay the debt in due time.
4. The correlation between the studied groups of borrowers and the percentage of non-repayment of their debt is extremely low and varies at the level of 1-3pp.