## Research of Borrowers' Reliability _(part 2)_

### 3. Opening a pre-processed dataset

In [2]:
import pandas as pd
data_cleaned = pd.read_csv('/Users/yuliabezginova/PycharmProjects/project-1_bank-credit-scoring/data_cleaned.csv')
display(data_cleaned.head(10))

Unnamed: 0.1,Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,total_income_category,purpose_category
0,0,1,8437.673028,42,высшее,0,женат / замужем,0,F,сотрудник,0,253875,покупка жилья,B,операции с недвижимостью
1,1,1,4024.803754,36,среднее,1,женат / замужем,0,F,сотрудник,0,112080,приобретение автомобиля,C,операции с автомобилем
2,2,0,5623.42261,33,среднее,1,женат / замужем,0,M,сотрудник,0,145885,покупка жилья,C,операции с недвижимостью
3,3,3,4124.747207,32,среднее,1,женат / замужем,0,M,сотрудник,0,267628,дополнительное образование,B,получение образования
4,4,0,340266.072047,53,среднее,1,гражданский брак,1,F,пенсионер,0,158616,сыграть свадьбу,C,проведение свадьбы
5,5,0,926.185831,27,высшее,0,гражданский брак,1,M,компаньон,0,255763,покупка жилья,B,операции с недвижимостью
6,6,0,2879.202052,43,высшее,0,женат / замужем,0,F,компаньон,0,240525,операции с жильем,B,операции с недвижимостью
7,7,0,152.779569,50,среднее,1,женат / замужем,0,M,сотрудник,0,135823,образование,C,получение образования
8,8,2,6929.865299,35,высшее,0,гражданский брак,1,F,сотрудник,0,95856,на проведение свадьбы,C,проведение свадьбы
9,9,0,2188.756445,41,среднее,1,женат / замужем,0,M,сотрудник,0,144425,покупка жилья для семьи,C,операции с недвижимостью


## 4. Exploring a pre-processed data and investigating research questions (RQ)

### RQ 1 - Is there a relationship between _the number of children_ in the family and loan payment on time?

Calculating the Pearson correlation in this case (in all four RQs below) is not effective, since one of the variables is categorical ('debt' = [0; 1]).

**1) Let's check what values a variable _'children'_ have in the dataset.**

In [5]:
print(data_cleaned['children'].unique())

[1 0 3 2 4 5]


**2)  Let's investigate a relationships between number of children in the family and on-time loan payment.**

In [6]:
children_grouped = data_cleaned.groupby('children').agg({'debt': ['count', 'sum']})
children_debt_quality = children_grouped['debt']['sum'] / children_grouped['debt']['count']
children_debt_quality

children
0    0.075438
1    0.092346
2    0.094542
3    0.081818
4    0.097561
5    0.000000
dtype: float64

### Conclusion:
- the most _unreliable_ borrowers with the **highest** debt share (0.097) are large families with 4 children;
- the most _reliable_ borrowers with the **lowest** loan debt share  (0.075) are borrowers _without children_ at all.

This is interesting that for families with 5 children, the debt share seems to be zero.

However, keeping in mind the fact that for large children with 4 children the share of payment deferral is maximum, it would be wrong to conclude that families with 5 children are more reliable borrowers than those with 4. Most likely, the reason for this is in the structure of the dataset: there is a data on the number of children, but there is no data on debts for borrowers with 5 children.

### RQ 2 - Is there a relationship between _marital status_ and loan repayment on time?

**1) Let's check what values a variable _'family_Status'_ have in the dataset.**

In [8]:
print(data_cleaned['family_status'].unique())

['женат / замужем' 'гражданский брак' 'вдовец / вдова' 'в разводе'
 'Не женат / не замужем']


**2)  Let's investigate a relationships between _maritual status_ and on-time loan payment.**

In [9]:
family_grouped = data_cleaned.groupby('family_status').agg({'debt': ['count', 'sum']})
family_debt_quality = family_grouped['debt']['sum'] / family_grouped['debt']['count']
family_debt_quality

family_status
Не женат / не замужем    0.097639
в разводе                0.070648
вдовец / вдова           0.066246
гражданский брак         0.093130
женат / замужем          0.075606
dtype: float64

### Conclusion:

- Single borrowers without families (_family_status = 'single'_) have the highest share of debt rate and deferrals in repayment (0.097). This is an unexpected conclusion, because above there was a conclusion that families WITHOUT children pay the loan regularly. However, now it is possible to expand the conclusion that from bank scoring point of view, it is not enough not to have children in order to be a reliable borrower, one must also not be single (married / married);
- the smallest share of overdue loans (0.66) was found among widowed spouses (_family_status = 'widower / widow'_).

### RQ 3 - Is there any relationship between income level and loan repayment on time?

**From the clean data there are the following categories of income level (in RUB per month):**
- 0–30000 — 'E'; 
- 30001–50000 — 'D';
- 50001–200000 — 'C';
- 200001–1000000 — 'B';
- 1000001 и more — 'A'.

***The share of payment delays according to the category of income we determine below:***

In [11]:
total_income_grouped = data_cleaned.groupby('total_income_category').agg({'debt':['sum','count']})
total_income_quality = total_income_grouped['debt']['sum']/total_income_grouped['debt']['count']
total_income_quality

total_income_category
A    0.080000
B    0.070602
C    0.084982
D    0.060172
E    0.090909
dtype: float64

**Conclusions:**

- we see that the highest share of loan deferrals (0.09) is among potential borrowers from the 'E' category with an income level of 0-30,000 rubles;
- potential borrowers from categories 'A' with income above 1,000,000 rubles (share of deferrals in payment = 0.08) and 'C' with income of 50,001–200,000 rubles (share of deferrals in payment  = 0.084);
- finally, the most reliable borrowers fall into category 'D' with an income of 30,001–50,000 rubles (deferral rate = 0.06).

It turns out that borrowers with an income of 30,001–50,000 rubles pay the loan most regularly. Considering that Rosstat for 4 quarter 2021 announced the average salary in Russia ~ 54,000 rubles, it turns out that the most reliable borrowers who earn below the average in Russia. This conclusion is difficult to generalize for the whole of Russia; a study for each region of the Russian Federation is recommended.

### RQ 4 - How do different purposes of a loan affect its repayment on time?

In [12]:
purpose_grouped = data_cleaned.groupby('purpose_category').agg({'debt':['sum','count']})
purpose_debt = purpose_grouped['debt']['sum']/purpose_grouped['debt']['count']
print(purpose_debt.sort_values(ascending=False))

purpose_category
операции с автомобилем      0.093480
получение образования       0.092528
проведение свадьбы          0.079118
операции с недвижимостью    0.072551
dtype: float64


**Conclusions:**

- the most unreliable borrowers are those who take loans for education (delay rate 0.092) and car purchase (0.93);
- next, borrowers who take loans for weddings (delay rate = 0.079);
- the most reliable borrowers are those who take out loans for real estate transactions (delay rate = 0.072).

### _Notes_

_**Possible reasons for missings in the dataset are:** manual data entry errors; bias in filling out the questionnaire for a bank loan by customers or bank employees; some technical error._

_**Why is filling missings with the median values is the best solution for quantitative variables?** The median value eliminates the distortion of the results: if an experiment with an abnormal value appears, and we still use the median, then this will not affect the validity of the results._

## 5. General conclusions

### 5.1  The most reliable borrowers (with the lowest rate of deferrals in loan repayment) for banks are: :
- those who want a residential loan (not for buying automobile, getting an education or to celebrate a wedding);
- married people without children, preferrably widowed;
- with an income level 10-15% below the average in Russia 30,001–50,000 RUB per month.

### 5.2  The most unreliable borrowers (with the highest rate of deferrals in loan repayment) for bank are: 
- those who want to take a loan for purchaing a car or getting an education;
- single, unmarried people;
- families with 4 children; it is important to note that families with a child also have delays, they are in second and third place in terms of the share of delays after large families with 4 children;
- with an income level below 30,000 rubles (income category 'E').

_Your questions or comments regarding the project and data analysis are more than welcome to my email **ybezginova2021@gmail.com** or Telegram ***@ybezginova***_

_Best wishes,_
_Yulia_