In [None]:
#Problem statement is related to whether a client will repay a loan or not

Home Credit Group, a leading consumer finance provider, is facing a significant challenge in assessing the creditworthiness of loan applicants. The company currently relies on traditional credit scoring methods, which have limitations in predicting the default risk of individuals with limited or no credit history. This results in a high number of applications being rejected, even for individuals who may be capable of repaying the loan.

To address this challenge, Home Credit Group is seeking innovative solutions to predict the default risk of loan applicants with greater accuracy. By developing a more robust prediction model, Home Credit Group can improve its risk assessment process, reduce the number of rejected applications, and expand its lending opportunities.

Data Description

The Home Credit Default Risk dataset consists of a comprehensive collection of information on over 300,000 loan applicants. The dataset includes a wide range of features, including:

Demographic information: Age, gender, education level, marital status, dependents, employment status, and income level
Credit history: Credit bureau inquiries, credit card balances, loan balances, and payment history
Application details: Loan amount, loan purpose, and interest rate
Behavioral information: Online activity, mobile usage, and social media engagement
The target variable in the dataset is whether or not the loan applicant defaulted on the loan. This information can be used to train machine learning models to predict the default risk of new applicants.

In [1]:
import numpy as np
import pandas as pd

In [10]:
df=pd.read_csv(r"/content/application_train.csv")

In [11]:
df

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0.0,0.0,0.0,0.0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15518,118111,0,Cash loans,F,Y,Y,2,54000.0,62554.5,7551.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
15519,118112,0,Cash loans,F,N,Y,1,202500.0,1546020.0,45333.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
15520,118113,0,Cash loans,F,N,Y,1,360000.0,229230.0,18238.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
15521,118114,0,Cash loans,F,N,Y,0,135000.0,590337.0,25141.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


AMT_REQ_CREDIT_BUREAU_HOUR: Inquiries made within the hour before the loan application
AMT_REQ_CREDIT_BUREAU_DAY: Inquiries made on the day of the loan application, excluding the last hour
AMT_REQ_CREDIT_BUREAU_WEEK: Inquiries made within the week before the loan application, excluding the last day
AMT_REQ_CREDIT_BUREAU_MON: Inquiries made within the month before the loan application, excluding the last week
AMT_REQ_CREDIT_BUREAU_QRT: Inquiries made within the quarter before the loan application, excluding the last month

In [12]:
df['AMT_REQ_CREDIT_BUREAU_HOUR'].value_counts()
# 0:No inquiries to the Credit Bureau within the hour before the application
#1: One inquiry to the Credit Bureau within the hour before the application
#2: Two or more inquiries to the Credit Bureau within the hour before the application

0.0    13345
1.0       90
2.0        3
Name: AMT_REQ_CREDIT_BUREAU_HOUR, dtype: int64

More amount of inquiries from particular client denotes more financial instability

In [20]:
df['FLAG_OWN_REALTY'].value_counts()#Y-property owned,N-no Property owned yet

Y    10827
N     4696
Name: FLAG_OWN_REALTY, dtype: int64

In [13]:
df.columns

Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY',
       ...
       'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
       'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR',
       'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
       'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
       'AMT_REQ_CREDIT_BUREAU_YEAR'],
      dtype='object', length=122)

In [15]:
df['TARGET']#1- client repaid and 0-client not repaid yet

0        1
1        0
2        0
3        0
4        0
        ..
15518    0
15519    0
15520    0
15521    0
15522    0
Name: TARGET, Length: 15523, dtype: int64

Flag document- Any valid document whether client provided or not ,if not provided then particular client will be risk for us

In [19]:
df['FLAG_DOCUMENT_20'].value_counts()#in which 1-clients provide valid document and then 0-not provide value document

0.0    15514
1.0        8
Name: FLAG_DOCUMENT_20, dtype: int64

AMT_INCOME_TOTAL: The total income of the loan applicant. This value is calculated by adding up the applicant's reported income from all sources.

AMT_CREDIT: The total amount of credit that the loan applicant is currently using. This value includes the applicant's outstanding credit card balances, loan balances, and other outstanding debts.

AMT_ANNUITY: The annual payment amount for the loan that the applicant is applying for. This value is calculated based on the loan amount, interest rate, and loan term.

In [22]:
list(df.columns)

['SK_ID_CURR',
 'TARGET',
 'NAME_CONTRACT_TYPE',
 'CODE_GENDER',
 'FLAG_OWN_CAR',
 'FLAG_OWN_REALTY',
 'CNT_CHILDREN',
 'AMT_INCOME_TOTAL',
 'AMT_CREDIT',
 'AMT_ANNUITY',
 'AMT_GOODS_PRICE',
 'NAME_TYPE_SUITE',
 'NAME_INCOME_TYPE',
 'NAME_EDUCATION_TYPE',
 'NAME_FAMILY_STATUS',
 'NAME_HOUSING_TYPE',
 'REGION_POPULATION_RELATIVE',
 'DAYS_BIRTH',
 'DAYS_EMPLOYED',
 'DAYS_REGISTRATION',
 'DAYS_ID_PUBLISH',
 'OWN_CAR_AGE',
 'FLAG_MOBIL',
 'FLAG_EMP_PHONE',
 'FLAG_WORK_PHONE',
 'FLAG_CONT_MOBILE',
 'FLAG_PHONE',
 'FLAG_EMAIL',
 'OCCUPATION_TYPE',
 'CNT_FAM_MEMBERS',
 'REGION_RATING_CLIENT',
 'REGION_RATING_CLIENT_W_CITY',
 'WEEKDAY_APPR_PROCESS_START',
 'HOUR_APPR_PROCESS_START',
 'REG_REGION_NOT_LIVE_REGION',
 'REG_REGION_NOT_WORK_REGION',
 'LIVE_REGION_NOT_WORK_REGION',
 'REG_CITY_NOT_LIVE_CITY',
 'REG_CITY_NOT_WORK_CITY',
 'LIVE_CITY_NOT_WORK_CITY',
 'ORGANIZATION_TYPE',
 'EXT_SOURCE_1',
 'EXT_SOURCE_2',
 'EXT_SOURCE_3',
 'APARTMENTS_AVG',
 'BASEMENTAREA_AVG',
 'YEARS_BEGINEXPLUATATION_A

In [23]:
df["AMT_GOODS_PRICE"]#This value can be used to assess the applicant's creditworthiness and the purpose of the loan.
#Lower amount-Low support/money needed
#Higher amount-High support/money needed or expecting some expensive needed

0         351000.0
1        1129500.0
2         135000.0
3         297000.0
4         513000.0
           ...    
15518      54000.0
15519    1350000.0
15520     202500.0
15521     477000.0
15522     270000.0
Name: AMT_GOODS_PRICE, Length: 15523, dtype: float64

In [26]:
df["NAME_TYPE_SUITE"].value_counts() #Unaccompanied-singles

Unaccompanied      12559
Family              2032
Spouse, partner      564
Children             173
Other_B               79
Other_A               40
Group of people       13
Name: NAME_TYPE_SUITE, dtype: int64

In [28]:
df['NAME_INCOME_TYPE'].value_counts()

Working                 8133
Commercial associate    3593
Pensioner               2752
State servant           1042
Unemployed                 2
Student                    1
Name: NAME_INCOME_TYPE, dtype: int64

In [30]:
df['NAME_HOUSING_TYPE'].value_counts()

House / apartment      13762
With parents             746
Municipal apartment      577
Rented apartment         246
Office apartment         126
Co-op apartment           66
Name: NAME_HOUSING_TYPE, dtype: int64

In [32]:
df['REGION_POPULATION_RELATIVE']
#High values (above 1): Indicates that the applicant lives in a densely populated region, possibly a major city or urban area.
#Mid-range values (around 0.5 - 1): Indicates that the applicant lives in a moderately populated area, possibly a smaller city or suburban area.
#Low values (below 0.5): Indicates that the applicant lives in a less densely populated area, possibly a rural area or small town.

0        0.018801
1        0.003541
2        0.010032
3        0.008019
4        0.028663
           ...   
15518    0.007120
15519    0.016612
15520    0.072508
15521    0.010556
15522    0.015221
Name: REGION_POPULATION_RELATIVE, Length: 15523, dtype: float64

In [33]:
df['CNT_FAM_MEMBERS'].value_counts() #This column represents the number of family members in the loan applicant's household.

2.0     8110
1.0     3261
3.0     2703
4.0     1222
5.0      199
6.0       20
7.0        4
9.0        1
8.0        1
10.0       1
Name: CNT_FAM_MEMBERS, dtype: int64

In [35]:
df['WEEKDAY_APPR_PROCESS_START'].value_counts()

TUESDAY      2670
FRIDAY       2600
MONDAY       2582
WEDNESDAY    2577
THURSDAY     2558
SATURDAY     1717
SUNDAY        818
Name: WEEKDAY_APPR_PROCESS_START, dtype: int64

CNT_FAM_MEMBERS: This column represents the number of family members in the loan applicant's household. This information can be used to assess the applicant's financial obligations and potential need for the loan.

REGION_RATING_CLIENT: This column represents the credit rating of the region where the loan applicant lives. This information can be used to assess the overall economic conditions and financial stability of the applicant's region.

REGION_RATING_CLIENT_W_CITY: This column represents the credit rating of the region and city where the loan applicant lives. This information provides a more specific assessment of the economic conditions and financial stability of the applicant's location.

WEEKDAY_APPR_PROCESS_START: This column indicates the day of the week when the loan application was processed. This information may be used to identify any potential patterns or trends in loan approvals or rejections based on the day of the week.

HOUR_APPR_PROCESS_START: This column indicates the hour of the day when the loan application was processed. This information may be used to identify any potential patterns or trends in loan approvals or rejections based on the time of day.

REG_REGION_NOT_LIVE_REGION: This column indicates whether the applicant's registered region is different from their current living region. A value of 1 indicates a difference, while a value of 0 indicates that the regions are the same.

REG_REGION_NOT_WORK_REGION: This column indicates whether the applicant's registered region is different from their work region. A value of 1 indicates a difference, while a value of 0 indicates that the regions are the same.

LIVE_REGION_NOT_WORK_REGION: This column indicates whether the applicant's current living region is different from their work region. A value of 1 indicates a difference, while a value of 0 indicates that the regions are the same.

REG_CITY_NOT_LIVE_CITY: This column indicates whether the applicant's registered city is different from their current living city. A value of 1 indicates a difference, while a value of 0 indicates that the cities are the same.

REG_CITY_NOT_WORK_CITY: This column indicates whether the applicant's registered city is different from their work city. A value of 1 indicates a difference, while a value of 0 indicates that the cities are the same.

LIVE_CITY_NOT_WORK_CITY: This column indicates whether the applicant's current living city is different from their work city. A value of 1 indicates a difference, while a value of 0 indicates that the cities are the same.

EXT_SOURCE_1: This feature represents the value of an external source of information about the loan applicant. The specific source of information is not provided, but it could be related to the applicant's creditworthiness, income, or employment.

EXT_SOURCE_2: This feature represents the value of another external source of information about the loan applicant. Similar to EXT_SOURCE_1, the specific source is not provided, but it could provide additional insights into the applicant's financial situation.

EXT_SOURCE_3: This feature represents the value of a third external source of information about the loan applicant. Again, the specific source is not specified, but it could be used to further assess the applicant's creditworthiness.

APARTMENTS_AVG: This feature represents the average number of apartments in the buildings located in the same area as the loan applicant's property. This information could be used to assess the density of the neighborhood and the potential demand for rental properties.

BASEMENTAREA_AVG: This feature represents the average basement area of the buildings located in the same area as the loan applicant's property. This information could be used to assess the size and type of properties in the neighborhood.

YEARS_BEGINEXPLUATATION_AVG: This feature represents the average year when the buildings located in the same area as the loan applicant's property were first put into use. This information could be used to assess the age and potential condition of the properties in the neighborhood.

YEARS_BUILD_AVG: This feature represents the average year when the buildings located in the same area as the loan applicant's property were constructed. This information is similar to YEARS_BEGINEXPLUATATION_AVG but may provide more specific details about the construction date.

COMMONAREA_AVG: This feature represents the average common area of the buildings located in the same area as the loan applicant's property. This information could be used to assess the availability of shared spaces and amenities in the neighborhood.

ELEVATORS_AVG: This feature represents the average number of elevators in the buildings located in the same area as the loan applicant's property. This information could be used to assess the accessibility and convenience of the buildings in the neighborhood.

ENTRANCES_AVG: This feature represents the average number of entrances in the buildings located in the same area as the loan applicant's property. This information could be used to assess the accessibility and ease of ingress and egress for the buildings.

FLOORSMAX_AVG: This feature represents the average maximum number of floors in the buildings located in the same area as the loan applicant's property. This information could be used to assess the height and type of buildings in the neighborhood.

FLOORSMIN_AVG: This feature represents the average minimum number of floors in the buildings located in the same area as the loan applicant's property. This information could be used to assess the variety of building types and sizes in the neighborhood.

LANDAREA_AVG: This feature represents the average land area of the buildings located in the same area as the loan applicant's property. This information could be used to assess the size of the properties and the potential for outdoor space.

LIVINGAPARTMENTS_AVG: This feature represents the average number of living apartments in the buildings located in the same area as the loan applicant's property. This information could be used to assess the occupancy rate and potential demand for rental properties.

LIVINGAREA_AVG: This feature represents the average living area of the apartments in the buildings located in the same area as the loan applicant's property. This information could be used to assess the size and spaciousness of the apartments in the neighborhood.

NONLIVINGAPARTMENTS_AVG: This feature represents the average number of non-living apartments (such as commercial or office spaces) in the buildings located in the same area as the loan applicant's property. This information could be used to assess the mix of residential and non-residential uses in the neighborhood.

NONLIVINGAREA_AVG: This feature represents the average non-living area of the apartments (such as commercial or office spaces) in the buildings located in the same area as the loan applicant's property. This information could be used to assess the extent of non-residential usage in the neighborhood.

APARTMENTS_MODE: This feature represents the most common number of apartments in the buildings located in the same area as the loan applicant's property. This information could be used to identify the typical size of buildings in the neighborhood.

BASEMENTAREA_MODE: This feature represents the most common basement area in the buildings located in the same area as the loan applicant's property. This information could be used to identify the typical basement size for properties in the neighborhood