## CREDIT SCORING PROJECT (Borrower Reliability Analysis)

**Research Objective
To test two main hypotheses:**

1. The client's family status affects the likelihood of repaying a loan on time.
2. The number of children in the client's family affects the likelihood of repaying a loan on time.

**Dataset Description (from documentation):**

children — number of children in the family  
days_employed — total work experience in days  
dob_years — client's age in years  
education — client's education level  
education_id — education level identifier  
family_status — family status  
family_status_id — family status identifier   
gender — client's gender  
income_type — type of employment  
debt — had any loan repayment delinquency  
total_income — monthly income purpose — purpose of the loan  


# Research Plan: Analysis of Factors Affecting On-Time Loan Repayment
(Credit Scoring Task / Delinquency Risk Classification)

1. Data Loading and Initial Exploration  


Open the data table
Study general information about the data (size, variable types, describe() statistics, info())
Visualize basic distributions (histograms, boxplots for numerical features)
Identify the target variable (fact of on-time loan repayment / delinquency)


2. Data Preprocessing  
2.1 Handling / Removing Missing Values  

Identify the proportion and location of missing values (isnull().sum(), heatmap)  
Remove rows/columns with critically high proportion of missing values (if >70–80%)  
Handle missing values in numerical variables (median / mean / KNN-imputer)  
Handle missing values in categorical variables (mode / separate category "unknown")  

2.2 Handling Outliers / Anomalous Values  

Detect outliers (boxplot, IQR method, z-score)  
Analyze the nature of anomalies (real extreme values or input errors?)  
Decision: removal / replacement with boundary values / winsorization  

2.3 Duplicate Detection and Handling  
  
Search for full row duplicates (duplicated())  
Search for duplicates by key fields (if client ID exists)  
Removal / retention with explanation  

2.4 Data Type Conversion  

Type casting (int → category for small sets of values, object → datetime)  
Correct encoding of binary and ordinal features  

2.5 Categorization / Binning of Features  

Transform continuous variables into categories (age, income, loan amount)
Create new features (age groups, income categories, debt burden, etc.)


3. Exploratory Data Analysis (EDA) and Answers to Key Questions  
3.1 Investigation of Relationships with the Target Variable (on-time loan repayment)  
3.1.1 Is there a relationship between the number of children and on-time loan repayment?  
3.1.2 Is there a relationship between family status and on-time loan repayment?  
3.1.3 Is there a relationship between income level and on-time loan repayment?  
3.1.4 How do different loan purposes affect on-time loan repayment?  
3.1.5 Provide possible reasons for the appearance of missing values in the original data  
3.1.6 Justify why filling missing values with the median is the best solution for numerical variables in this task  

(For each item: groupby aggregations, pivot_table / crosstab, visualizations — barplot, countplot, boxplot by groups, statistical tests if necessary)  

4. Overall Conclusion of the Study
Key factors influencing loan repayment
Strengths and weaknesses of the data

In [1]:
import pandas as pd

## 1. Data Loading and Initial Exploration

Open the data table
Study general information about the data (size, variable types, describe() statistics, info())
Visualize basic distributions (histograms, boxplot for numerical features)
Identify the target variable (fact of on-time loan repayment / delinquency)

In [2]:
import pandas as pd

try:
    data = pd.read_csv('/datasets/data.csv')
except:
    data = pd.read_csv('https://code.s3.yandex.net/datasets/data.csv')

In [3]:
# display the first 20 rows of the dataframe data on the screen
data.head(20)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,высшее,0,женат / замужем,0,F,сотрудник,0,253875.639453,покупка жилья
1,1,-4024.803754,36,среднее,1,женат / замужем,0,F,сотрудник,0,112080.014102,приобретение автомобиля
2,0,-5623.42261,33,Среднее,1,женат / замужем,0,M,сотрудник,0,145885.952297,покупка жилья
3,3,-4124.747207,32,среднее,1,женат / замужем,0,M,сотрудник,0,267628.550329,дополнительное образование
4,0,340266.072047,53,среднее,1,гражданский брак,1,F,пенсионер,0,158616.07787,сыграть свадьбу
5,0,-926.185831,27,высшее,0,гражданский брак,1,M,компаньон,0,255763.565419,покупка жилья
6,0,-2879.202052,43,высшее,0,женат / замужем,0,F,компаньон,0,240525.97192,операции с жильем
7,0,-152.779569,50,СРЕДНЕЕ,1,женат / замужем,0,M,сотрудник,0,135823.934197,образование
8,2,-6929.865299,35,ВЫСШЕЕ,0,гражданский брак,1,F,сотрудник,0,95856.832424,на проведение свадьбы
9,0,-2188.756445,41,среднее,1,женат / замужем,0,M,сотрудник,0,144425.938277,покупка жилья для семьи


In [4]:
# display the main information about the dataframe using the info() method
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


**Conclusion:**
21,525 records (RangeIndex: 0–21524)  
12 columns    

Data types:
int64: 5 columns (children, dob_years, education_id, family_status_id, debt)  
float64: 2 columns (days_employed, total_income)   
object: 5 columns (education, family_status, gender, income_type, purpose)  

Missing values are observed only in two columns:
days_employed — 19351 non-null → ~2174 missing (~10.1%)  
total_income — 19351 non-null → ~2174 missing (the same rows)  

## 2. Data Preprocessing

### 2.1 Handling / Removing Missing Values  

Identify the proportion and location of missing values (isnull().sum(), heatmap)  
Remove rows/columns with critically high proportion of missing values (if >70–80%)  
Handle missing values in numerical variables (median / mean / KNN-imputer)  
Handle missing values in categorical variables (mode / separate category "unknown")  

In [5]:
# display the number of missing values for each column. Use a combination of two methods
data.isna().sum()

children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

There are missing values in two columns. One of them is days_employed. The missing values in this column will be handled at the next stage. The other column with missing values is total_income — it stores income data. The amount of income is most strongly influenced by the type of employment, so the missing values in this column should be filled with the median value for each type from the income_type column. For example, for a person with employment type "сотрудник" (employee), the missing value in the total_income column should be filled with the median income among all records with the same type.

In [6]:
# the missing value in the total_income column should be filled with the median income among all records with the same type
for t in data['income_type'].unique():
    data.loc[(data['income_type'] == t) & (data['total_income'].isna()), 'total_income'] = \
    data.loc[(data['income_type'] == t), 'total_income'].median()

### 2.2 Handling Outliers / Anomalous Values

Detection of outliers (boxplot, IQR method, z-score)
Analysis of the nature of anomalies (real extreme values or input errors?)
Decision making: removal / replacement with boundary values / winsorization

Artifacts (anomalies) may occur in the data — values that do not reflect reality and appeared due to some error. Such an artifact would be a negative number of work days in the days_employed column. For real data, this is normal. We will process the values in this column: replace all negative values with positive ones using the abs() method.

In [7]:
data['days_employed'] = data['days_employed'].abs()

In [8]:
#for each employment type, display the median value of work experience days_employed in days
data.groupby('income_type')['days_employed'].agg('median')

income_type
безработный        366413.652744
в декрете            3296.759962
госслужащий          2689.368353
компаньон            1547.382223
пенсионер          365213.306266
предприниматель       520.848083
сотрудник            1574.202821
студент               578.751554
Name: days_employed, dtype: float64

For two types (unemployed and retirees), anomalously large values will be obtained. It is difficult to correct such values, so we will leave them as is. Moreover, this column will not be needed for the study.

In [9]:
# Display the list of unique values in the children column
data['children'].unique()

array([ 1,  0,  3,  2, -1,  4, 20,  5], dtype=int64)

There are two anomalous values in the children column. We will remove the rows containing such anomalous values from the dataframe data

In [10]:
data = data[(data['children'] != -1) & (data['children'] != 20)]

In [11]:
# Display the list of unique values in the children column to make sure the artifacts have been removed
data['children'].unique()

array([1, 0, 3, 2, 4, 5], dtype=int64)

### 2.3 Removing / Handling Duplicates

- Search for full row duplicates (duplicated())  
- Search for duplicates by key fields (if client ID exists)  
- Removal / retention with explanation  


In [12]:
# Fill the missing values in the days_employed column with median values for each employment type income_type
for t in data['income_type'].unique():
    data.loc[(data['income_type'] == t) & (data['days_employed'].isna()), 'days_employed'] = \
    data.loc[(data['income_type'] == t), 'days_employed'].median()

In [13]:
# Make sure that all missing values have been filled.
# Once again, display the number of missing values for each column using two methods
data.isna().sum()

children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
dtype: int64

### 2.4 Data Type Conversion

- Type casting (int → category for small sets of values, object → datetime)
- Correct encoding of binary and ordinal features

In [14]:
# Replace the float data type in the total_income column with an integer type using the astype() method
data['total_income'] = data['total_income'].astype(int)

### 2.5 Categorization / Binning of Features

- Transformation of continuous variables into categories (age, income, loan amount)
- Creation of new features (age groups, income categories, debt burden, etc.)

We will handle implicit duplicates in the education column.   
In this column, there are the same values but recorded differently: using uppercase and lowercase letters.  
We will convert them to lowercase. We will check the other columns

In [15]:
data['education'] = data['education'].str.lower()

In [16]:
# Display the number of duplicate rows in the data on the screen. If such rows are present, remove them
data.duplicated().sum()

71

In [17]:
data = data.drop_duplicates()

#### 2.5.1 Data Categorization
Based on the ranges specified below, we will create a column total_income_category in the dataframe data with the following categories:

0–30000 — 'E';  
30001–50000 — 'D';  
50001–200000 — 'C';  
200001–1000000 — 'B';  
1000001 and above — 'A'.  
For example, a borrower with an income of 25000 should be assigned category 'E', and a client receiving 235000 should be assigned category 'B'.  
We will use a custom function named categorize_income() and the apply() method.

In [18]:
def categorize_income(income):
    try:
        if 0 <= income <= 30000:
            return 'E'
        elif 30001 <= income <= 50000:
            return 'D'
        elif 50001 <= income <= 200000:
            return 'C'
        elif 200001 <= income <= 1000000:
            return 'B'
        elif income >= 1000001:
            return 'A'
    except:
        pass

In [19]:
data['total_income_category'] = data['total_income'].apply(categorize_income)

In [34]:
# Display on the screen the list of unique loan purposes from the purpose column
data['purpose'].unique()

array(['покупка жилья', 'приобретение автомобиля',
       'дополнительное образование', 'сыграть свадьбу',
       'операции с жильем', 'образование', 'на проведение свадьбы',
       'покупка жилья для семьи', 'покупка недвижимости',
       'покупка коммерческой недвижимости', 'покупка жилой недвижимости',
       'строительство собственной недвижимости', 'недвижимость',
       'строительство недвижимости', 'на покупку подержанного автомобиля',
       'на покупку своего автомобиля',
       'операции с коммерческой недвижимостью',
       'строительство жилой недвижимости', 'жилье',
       'операции со своей недвижимостью', 'автомобили',
       'заняться образованием', 'сделка с подержанным автомобилем',
       'получение образования', 'автомобиль', 'свадьба',
       'получение дополнительного образования', 'покупка своего жилья',
       'операции с недвижимостью', 'получение высшего образования',
       'свой автомобиль', 'сделка с автомобилем',
       'профильное образование', 'высшее об

We will create a function that, based on the data from the purpose column, will form a new column purpose_category,
which will include the following categories:  
'car operations',  
'real estate operations',  
'wedding ceremony',  
'obtaining education'.  
For example, if the purpose column contains the substring 'на покупку автомобиля', then the purpose_category column should contain the string 'операции с автомобилем'.  
We will use a custom function named categorize_purpose() and the apply() method. We will study the data in the purpose column and determine which substrings will help correctly identify the category.

In [21]:
def categorize_purpose(row):
    try:
        if 'автом' in row:
            return 'операции с автомобилем'
        elif 'жил' in row or 'недвиж' in row:
            return 'операции с недвижимостью'
        elif 'свад' in row:
            return 'проведение свадьбы'
        elif 'образов' in row:
            return 'получение образования'
    except:
        return 'нет категории'

In [22]:
data['purpose_category'] = data['purpose'].apply(categorize_purpose)

In [35]:
# How many loans are there in each purpose category
print("Loan Purpose Distribution:")
display(data['purpose_category'].value_counts())

# In percentages, % (convenient for conclusions)
print("\nIn percentages, %:")
display(round(data['purpose_category'].value_counts(normalize=True) * 100, 2))

Loan Purpose Distribution:


purpose_category
операции с недвижимостью    10751
операции с автомобилем       4279
получение образования        3988
проведение свадьбы           2313
Name: count, dtype: int64


In percentages, %:


purpose_category
операции с недвижимостью    50.40
операции с автомобилем      20.06
получение образования       18.70
проведение свадьбы          10.84
Name: proportion, dtype: float64

## 3. Exploratory Data Analysis (EDA) and Answers to Key Questions

### 3.1 Investigation of Relationships with the Target Variable (on-time loan repayment)

- 3.1.1 Is there a relationship between the number of children and on-time loan repayment?
- 3.1.2 Is there a relationship between family status and on-time loan repayment?
- 3.1.3 Is there a relationship between income level and on-time loan repayment?
- 3.1.4 How do different loan purposes affect on-time loan repayment?
- 3.1.5 Provide possible reasons for the appearance of missing values in the original data
- 3.1.6 Justify why filling missing values with the median is the best solution for numerical variables in this task
(For each item: groupby aggregations, pivot_table / crosstab, visualizations — barplot, countplot, boxplot by groups, statistical tests if necessary)


#### 3.1.1 Is there a relationship between the number of children and on-time loan repayment?

In [24]:
# Categorize borrowers by the number of children
def categorize_children(children):
    if children == 0:
        return 'без детей'
    elif children == 1:
        return '1 ребенок'
    elif children == 2:
        return '2 детей'
    else:
        return 'многодетные'

In [25]:
data['children_category'] = data['children'].apply(categorize_children)

In [26]:
#Calculate the percentage of on-time loan repayment for each children category
grouped_children = data.groupby('children_category')['debt'].mean() * 100
grouped_children

children_category
1 ребенок      9.234609
2 детей        9.454191
без детей      7.543822
многодетные    8.157895
Name: debt, dtype: float64

In [36]:
# Creating a pivot table
pivot_table_children = data.pivot_table(index='children_category', values='debt', aggfunc=['count', 'sum'])

# Rename the columns
pivot_table_children.columns = ['total_clients', 'debtors']

# Calculate the delinquency percentage
pivot_table_children['debt_percentage'] = (pivot_table_children['debtors'] / pivot_table_children['total_clients']) * 100
pivot_table_children

Unnamed: 0_level_0,total_clients,debtors,debt_percentage
children_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1 ребенок,4808,444,9.234609
2 детей,2052,194,9.454191
без детей,14091,1063,7.543822
многодетные,380,31,8.157895


**Conclusion:**   
The loan repayment percentage for the borrower category 'without children' is lower than for borrowers with one or two children.
It is worth noting that borrowers with many children repay loans on time more often than borrowers without children. Overall, the percentage ratio differs insignificantly, and it is not possible to identify a strongly pronounced dependence of loan repayment on the number of children.  

Analyzing the data, we noticed a small difference in the proportion of on-time loan repayment depending on whether the borrowers have children. After displaying the pivot table, it can be noted that the samples are not balanced: the number of borrowers with children is significantly smaller than the number of childless borrowers.  

Taking this imbalance in the samples into account, the results should be interpreted carefully. Despite the small difference in percentages, it may have statistical significance due to the large overall number of childless borrowers.  

Thus, it can be preliminarily stated that the presence of children affects on-time loan repayment, but for more accurate and reliable conclusions, a deeper statistical analysis is necessary, possibly taking into account other variables, to confirm or refute this assumption.

#### 3.1.2 Is there a relationship between family status and on-time loan repayment?

In [28]:
# Calculate the percentage of on-time loan repayment for each family status category
grouped_family_status = data.groupby('family_status')['debt'].mean() * 100
grouped_family_status 

family_status
Не женат / не замужем    9.763948
в разводе                7.064760
вдовец / вдова           6.624606
гражданский брак         9.313014
женат / замужем          7.560558
Name: debt, dtype: float64

In [29]:
# Creating a pivot table
pivot_table_family_status = data.pivot_table(index='family_status', values='debt', aggfunc=['count', 'sum'])

# Rename the columns 
pivot_table_family_status.columns = ['total_clients', 'debtors']

# Calculate the delinquency percentage
pivot_table_family_status['debt_percentage'] = (pivot_table_family_status['debtors'] / pivot_table_family_status['total_clients']) * 100
pivot_table_family_status

Unnamed: 0_level_0,total_clients,debtors,debt_percentage
family_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Не женат / не замужем,2796,273,9.763948
в разводе,1189,84,7.06476
вдовец / вдова,951,63,6.624606
гражданский брак,4134,385,9.313014
женат / замужем,12261,927,7.560558


**Conclusion:**  
The result shows the percentage of on-time loan repayment for each family status category. If the on-time loan repayment percentage is higher or lower than the average value in the dataset, this may indicate the presence of a relationship between family status and on-time loan repayment. In the borrower category 'widow/widower', the on-time loan repayment percentage is the lowest and differs by almost 1.5 times from the categories 'not married / not married' or 'civil marriage', where the on-time loan repayment percentage is the highest. It can be concluded that there is some relationship between income level and on-time loan repayment.  

After displaying the pivot table, it can be noted that the samples are not balanced: the total number of borrowers who are not in an official marriage, widowed, and divorced is smaller than the number of borrowers in the 'married' category.  

Taking this imbalance in the samples into account, the results should be interpreted carefully. Despite the small difference in percentages, it may have statistical significance due to the large overall number of borrowers who are in an official marriage.   

Thus, it can be preliminarily stated that borrowers who are or have been married appear more reliable.

#### 3.1.3 Is there a relationship between income level and on-time loan repayment?

In [37]:
# Income levels are already categorized in section 2.5
# Let's calculate the loan repayment rate within each category
grouped_income = data.groupby('total_income_category')['debt'].mean() * 100
grouped_income

total_income_category
A    8.000000
B    7.060231
C    8.498210
D    6.017192
E    9.090909
Name: debt, dtype: float64

In [31]:
# Creating a pivot table
pivot_table_income = data.pivot_table(index='total_income_category', values='debt', aggfunc=['count', 'sum'])
# Rename the columns
pivot_table_income.columns = ['total_clients', 'debtors']
# Adding a column with the delinquency rate
pivot_table_income['debt_percentage'] = (pivot_table_income['debtors'] / pivot_table_income['total_clients']) * 100
# Sorting the table by delinquency rate in descending order
pivot_table_income = pivot_table_income.sort_values(by='debt_percentage', ascending=False)
pivot_table_income

Unnamed: 0_level_0,total_clients,debtors,debt_percentage
total_income_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
E,22,2,9.090909
C,15921,1353,8.49821
A,25,2,8.0
B,5014,354,7.060231
D,349,21,6.017192


**Conclusion:**
Based on the analysis of on-time loan repayment data by income level, the following conclusions can be drawn:

- Category 'A' (income 1,000,001 and above): The repayment rate is 8.00%. This suggests that individuals with very high income levels have a relatively low delinquency rate.

- Category 'B' (income from 200,001 to 1,000,000): The repayment rate in this category is 7.06%. This is also a low percentage, which may indicate that people with high incomes are generally quite responsible when it comes to repaying loans.

- Category 'C' (income from 50,001 to 200,000): The repayment rate in this category is 8.50%. This is a moderate level, suggesting that individuals with average incomes repay loans on time slightly better than the overall average.

- Category 'D' (income from 30,001 to 50,000): The repayment rate here is 6.02%. This is the lowest percentage among all categories, which may indicate that individuals with lower-middle incomes face more difficulties in repaying loans.

- Category 'E' (income up to 30,000): The repayment rate in this category is 9.09%. This is the highest repayment rate, suggesting that individuals with the lowest incomes repay loans on time more frequently than other categories.  

Thus, we can conclude that there is some relationship between income level and on-time loan repayment. Individuals with higher incomes tend to repay loans on time more often, while reliability generally decreases as income declines.

#### 3.1.4 How do different loan purposes affect on-time loan repayment?

In [32]:
# Let’s calculate the repayment rate for each loan purpose category
purpose_debt_percentage = data.groupby('purpose_category')['debt'].mean() * 100
purpose_debt_percentage

purpose_category
операции с автомобилем      9.347978
операции с недвижимостью    7.255139
получение образования       9.252758
проведение свадьбы          7.911803
Name: debt, dtype: float64

In [33]:
# Creating a pivot table
pivot_table_purpose_category = data.pivot_table(index='purpose_category', values='debt', aggfunc=['count', 'sum'])

# Rename the columns 
pivot_table_purpose_category.columns = ['total_clients', 'debtors']

# Calculate the delinquency percentage
pivot_table_purpose_category['debt_percentage'] = (pivot_table_purpose_category['debtors'] / pivot_table_purpose_category['total_clients']) * 100
pivot_table_purpose_category

Unnamed: 0_level_0,total_clients,debtors,debt_percentage
purpose_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
операции с автомобилем,4279,400,9.347978
операции с недвижимостью,10751,780,7.255139
получение образования,3988,369,9.252758
проведение свадьбы,2313,183,7.911803


Conclusion:

- Car-related transactions: the repayment rate for clients taking out a loan for car-related purposes is 9.35%. This means that approximately 91.65% of clients successfully repay their loans on time, making this category one of the most reliable in terms of repayment.

- Real estate transactions: clients taking out loans for real estate purposes have a repayment rate of 7.26%. This indicates that 92.74% of clients repay their loans on time, which is a fairly high показатель.

- Education: the repayment rate for clients taking out loans for education is 9.25%. The majority of clients (90.75%) successfully repay their loans in this category.

- Wedding expenses: clients taking out loans for wedding expenses have a repayment rate of 7.91%. This means that 92.09% of clients repay their loans on time.  

Based on these data, it can be said that clients taking out loans for real estate transactions and wedding expenses tend to be more reliable borrowers than those borrowing for education or car-related purposes. However, the difference in repayment rates across loan purpose categories is not very significant.

#### 3.1.5 Provide possible reasons for the appearance of missing values in the original data

Reasons for missing values in the data:

- Human factor: data entry errors when information is entered into the system.

- Lack of information: some questionnaire data may be missing due to the absence of a client response or the requirement to provide certain details (for example, a client may not want to disclose their income or marital status due to personal beliefs or privacy concerns).

- Technical issues: system failures or errors during data extraction from the database.

- Data transmission errors: errors may occur when transferring data between systems or during manual data entry, leading to missing values.

#### 3.1.6 Justify why filling missing values with the median is the best solution for numerical variables in this task

Filling missing values with the median is one of the common methods for handling missing data in quantitative variables. This approach is often used for several reasons:

1. Preserving central tendency: the median is a measure of central tendency that is not sensitive to outliers. If the data contain extreme values (outliers), the mean may be distorted, whereas the median is more robust.

2. Suitable for skewed distributions: if the data distribution is skewed, the median can be a more representative measure of central tendency than the mean.

3. Simplicity and efficiency: replacing missing values with the median is easy to implement and does not require complex calculations. It is also an effective way to handle large datasets.


## 4. Overall Conclusion

Nothing is known about the quality of the data. Therefore, before testing hypotheses, a data overview is required.

The examination of the general information about the dataset provided by the bank showed that the number of values differs across columns, and there are missing values in the total_income and days_employed columns. Since these are quantitative variables, the missing values were filled with the median. This decision makes it possible to retain all rows in the dataset.

Anomalous values that could affect the analysis were processed, such as negative numbers of children and negative employment duration. Implicit duplicates were handled, and the list of unique loan purposes was categorized.

This preliminary data preparation resulted in a cleaner and more structured dataset for analysis.

With the prepared data, a more detailed analysis can be conducted and various hypotheses can be tested to identify important relationships and patterns in the data.

As a result of the analysis of the provided data on clients’ creditworthiness, the following conclusions can be drawn:

- Children and loan repayment:
Borrowers with children have a slightly higher default rate. However, this difference is small and does not allow for definitive conclusions about the relationship between loan repayment and having children.

- Marital status and loan repayment:
Married individuals or those in a civil partnership repay loans better than single, divorced, or widowed borrowers.

- Income level and loan repayment:
Borrowers with higher income levels repay loans more reliably. However, it is interesting that the group with incomes between 30,000 and 50,000 also shows relatively strong repayment reliability, possibly due to its larger size.

- Loan purpose and repayment:
People taking out loans for education or car-related purposes repay them more often than those borrowing for weddings or real estate.  

Based on these findings, the bank can adjust its lending strategies. For example, it may pay more attention to clients with children, offer more flexible terms to borrowers with certain income levels, and take loan purpose into account when determining interest rates and repayment periods. It is also important to continue monitoring and analyzing the data to identify new patterns and trends in borrower behavior and to draw employees’ attention to the presence of missing values. Missing data can introduce bias into the analysis. Whenever possible, missing values should be replaced with the most accurate information available to ensure more precise analysis.