# Ignore Warnings

In [None]:
import warnings
warnings.filterwarnings("ignore")

# Importing required libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max.rows',130)
pd.set_option('display.max.columns',130)
pd.set_option('float_format', '{:.2f}'.format)

# Importing the dataset

In [None]:
df = pd.read_csv("../input/credit-eda-case-study/application_data.csv")

In [None]:
# Checking few records from the dataframe
df.head()

# Check structure of the data

In [None]:
df.info(verbose = True,null_counts = True)

We do not see any columns with Nullable values

In [None]:
df.shape

There are ~307k rows and 122 columns

## Get statistical summary for numerical variables

In [None]:
df.describe()

# Analyzing categorical variables

In [None]:
df.select_dtypes(include = "object").columns

In [None]:
# Checking number of categorical variables
len(df.select_dtypes(include = "object").columns)

There are 16 `categorical` variables

# Analyzing numerical variables

In [None]:
df.select_dtypes(include=["int64","float64"]).columns

In [None]:
# Checking number of categorical variables
len(df.select_dtypes(include=["int64","float64"]).columns)

There are 106 `numerical` variables

In [None]:
df.select_dtypes(include=["int64","float64"])

# Dealing with incorrect data types

Check if we have any column with incorrect data type

In [None]:
df.dtypes

In [None]:
df.head()

Looking at the data and their corresponding data types, we can conclude that **No Data Type changes are required**

# Dealing with missing values

Check if we have any null values in the dataset

In [None]:
df.isnull().values.any()

Get Total number of null values in the dataset

In [None]:
df.isnull().values.sum()

Getting the list of column(s) which have null values

In [None]:
df.columns[df.isnull().any()]

In [None]:
len(df.columns[df.isnull().any()])

There are totally `67` columns having one or more NULL values in the data

## Computing count and percentage of missing values

In [None]:
null_count = df.isnull().sum()
null_percentage = round((df.isnull().sum()/df.shape[0])*100, 2)

In [None]:
null_df = pd.DataFrame({'column_name' : df.columns,'null_count' : null_count,'null_percentage': null_percentage})
null_df.reset_index(drop = True, inplace = True)

In [None]:
null_df.sort_values(by = 'null_percentage', ascending = False)

## Removing columns with NULL values > 40%

Getting list of columns with NULL values > 40% into a list. We will be removing these columns from the dataframe as there are too many missing values.

In [None]:
columns_to_be_deleted = null_df[null_df['null_percentage'] > 40].column_name.to_list()

In [None]:
columns_to_be_deleted

In [None]:
len(columns_to_be_deleted)

There are totally `49` columns to be removed. Deleting them from main dataframe **`df`**

In [None]:
df.drop(columns = columns_to_be_deleted, inplace = True)

Checking column count post removal. Only `73` columns should be left

In [None]:
df.shape

## Checking columns with NULL values < 40%

Creating dataframe `null_df_under40` with missing column percentages under 40%

In [None]:
null_df_under40 = null_df[null_df['null_percentage'] < 40]

In [None]:
null_df_under40.sort_values(by = 'null_percentage', ascending = False)

### Analysis of `OCCUPATION_TYPE` column

-  nullable values = 31.35%

In [None]:
df['OCCUPATION_TYPE'].value_counts()

Impute NULL values with `Unknown` category

In [None]:
df['OCCUPATION_TYPE'].fillna(value = 'Unknown', inplace = True)

In [None]:
plt.figure(figsize = (10,5))
sns.countplot(data = df, x = "OCCUPATION_TYPE")
plt.xticks(rotation = 90)
plt.show()

**Observations**
-  Looking at the plot, `Laborers` has the highest number of loan applicants
-  For imputation, it would be better to leave the data as is (missing values being 31.35%) and not impute to min/min/mode/median as it may bias the data in later computations

### Analysis of `EXT_SOURCE_3` column

-  nullable values = 19.83%

In [None]:
df.EXT_SOURCE_3.value_counts().head()

In [None]:
sns.boxplot(df.EXT_SOURCE_3)
plt.show()

Getting percentile values for `EXT_SOURCE_3`

In [None]:
df.EXT_SOURCE_3.quantile(q = [0.25,0.5,0.75,1])

Most recurring value in `EXT_SOURCE_3`

In [None]:
df.EXT_SOURCE_3.mode()[0]

Checking the average value of `EXT_SOURCE_3`

In [None]:
df.EXT_SOURCE_3.mean()

**Observations**
-  Looking at the boxplot, median is 0.535276
-  Most recurring value is 0.746300213050371
-  Mean value is 0.5108529061800121
-  Though mean and median are closer and could be used for imputation, since missing percentage value is higher (19.83%), it would be better to leave the data as it is and not perform imputations

### Analysis of six columns with 13.5% missing values
-  `AMT_REQ_CREDIT_BUREAU_YEAR`
-	`AMT_REQ_CREDIT_BUREAU_QRT`
-	`AMT_REQ_CREDIT_BUREAU_MON`
-	`AMT_REQ_CREDIT_BUREAU_WEEK`
-	`AMT_REQ_CREDIT_BUREAU_DAY`
-	`AMT_REQ_CREDIT_BUREAU_HOUR`

> nullable values = 13.50%

Looking at summary statistics for the columns

In [None]:
df[['AMT_REQ_CREDIT_BUREAU_YEAR',
    'AMT_REQ_CREDIT_BUREAU_QRT',
    'AMT_REQ_CREDIT_BUREAU_MON',
    'AMT_REQ_CREDIT_BUREAU_WEEK',
    'AMT_REQ_CREDIT_BUREAU_DAY',
    'AMT_REQ_CREDIT_BUREAU_HOUR']].describe()

Most recurring value for the columns

In [None]:
df[['AMT_REQ_CREDIT_BUREAU_YEAR',
    'AMT_REQ_CREDIT_BUREAU_QRT',
    'AMT_REQ_CREDIT_BUREAU_MON',
    'AMT_REQ_CREDIT_BUREAU_WEEK',
    'AMT_REQ_CREDIT_BUREAU_DAY',
    'AMT_REQ_CREDIT_BUREAU_HOUR']].mode()

**Observations for the below columns**
-  `AMT_REQ_CREDIT_BUREAU_YEAR`
 -  Mean is 1.899974 and Median is 1.000000
 -  Mode is 0
 -  Don't impute as nullable values are higher (13.50%) and it might introduce bias in the data
-	`AMT_REQ_CREDIT_BUREAU_QRT`
 -  Mean is 0.265474 and Median is 0
 -  Mode is 0
 -  Don't impute as nullable values are higher (13.50%) and it might introduce bias in the data
-	`AMT_REQ_CREDIT_BUREAU_MON`
 -  Mean is 0.267395 and Median is 0
 -  Mode is 0
 -  Don't impute as nullable values are higher (13.50%) and it might introduce bias in the data
-	`AMT_REQ_CREDIT_BUREAU_WEEK`
 -  Mean is 0.034362 and Median is 0
 -  Mode is 0
 -  Don't impute as nullable values are higher (13.50%) and it might introduce bias in the data
-	`AMT_REQ_CREDIT_BUREAU_DAY`
 -  Mean is 0.007000 and Median is 0
 -  Mode is 0
 -  Don't impute as nullable values are higher (13.50%) and it might introduce bias in the data
-	`AMT_REQ_CREDIT_BUREAU_HOUR`
 -  Mean is 0.006402 and Median is 0
 -  Mode is 0
 -  Don't impute as nullable values are higher (13.50%) and it might introduce bias in the data

## Checking columns with NULL values > 0% and < 1%

Creating dataframe `null_df_under1` with missing column percentages values > 0% and < 1%

In [None]:
null_df_under1 = null_df[(null_df['null_percentage'] > 0) & (null_df['null_percentage'] < 1)]

In [None]:
null_df_under1.sort_values(by = 'null_percentage', ascending = False)

### Analysis of `NAME_TYPE_SUITE` column

-  nullable values = 0.42%

In [None]:
df['NAME_TYPE_SUITE'].value_counts()

In [None]:
plt.figure(figsize = (10,5))
sns.countplot(data = df, x = "NAME_TYPE_SUITE")
plt.xticks(rotation = 90)
plt.show()

**Observations**
-  Looking at the plot, `Unaccompanied` category has the highest number of loan applicants. So, most of the loan applicants venture out alone for applying loan
-  We can go ahead and impute `Unaccompanied` in the dataframe

### Analysis of `OBS_30_CNT_SOCIAL_CIRCLE` column

-  nullable values = 0.33%

In [None]:
df.OBS_30_CNT_SOCIAL_CIRCLE.value_counts().head()

In [None]:
sns.boxplot(df.OBS_30_CNT_SOCIAL_CIRCLE)
plt.show()

Getting percentile values for `OBS_30_CNT_SOCIAL_CIRCLE`

In [None]:
df.OBS_30_CNT_SOCIAL_CIRCLE.quantile(q = [0.25,0.5,0.75,1])

Most recurring value in `OBS_30_CNT_SOCIAL_CIRCLE`

In [None]:
df.OBS_30_CNT_SOCIAL_CIRCLE.mode()[0]

Checking the average value of `OBS_30_CNT_SOCIAL_CIRCLE`

In [None]:
df.OBS_30_CNT_SOCIAL_CIRCLE.mean()

**Observations**
-  Looking at the boxplot, median is 0.0
-  Most recurring value is 0.0
-  Mean value is 1.4222454239942575
-  There are two outlier values at 50 and 350.
-  Mean and mode are closer and can be used for imputation. It will not introduce bias as the missing value percentage is small (0.33%)

### Analysis of `DEF_30_CNT_SOCIAL_CIRCLE` column

-  nullable values = 0.33%

In [None]:
df.DEF_30_CNT_SOCIAL_CIRCLE.value_counts().head()

In [None]:
sns.boxplot(df.DEF_30_CNT_SOCIAL_CIRCLE)
plt.show()

Getting percentile values for `DEF_30_CNT_SOCIAL_CIRCLE`

In [None]:
df.DEF_30_CNT_SOCIAL_CIRCLE.quantile(q = [0.25,0.5,0.75,0.99,1])

Most recurring value in `DEF_30_CNT_SOCIAL_CIRCLE`

In [None]:
df.DEF_30_CNT_SOCIAL_CIRCLE.mode()[0]

Checking the average value of `DEF_30_CNT_SOCIAL_CIRCLE`

In [None]:
df.DEF_30_CNT_SOCIAL_CIRCLE.mean()

**Observations**
-  Looking at the boxplot, median is 0.0
-  Most recurring value is 0.0
-  Mean value is 0.1434206662533851
-  Even 99th percentile value is 2. There are ~7 outliers the largest of which is ~33.
-  Mean and median are closer and can be used for imputation. It will not introduce bias as the missing value percentage is small (0.33%)

### Analysis of `OBS_60_CNT_SOCIAL_CIRCLE` column

-  nullable values = 0.33%

In [None]:
df.OBS_60_CNT_SOCIAL_CIRCLE.value_counts().head()

In [None]:
sns.boxplot(df.OBS_60_CNT_SOCIAL_CIRCLE)
plt.show()

Getting percentile values for `OBS_60_CNT_SOCIAL_CIRCLE`

In [None]:
df.OBS_60_CNT_SOCIAL_CIRCLE.quantile(q = [0.25,0.5,0.75,0.99,1])

Most recurring value in `OBS_60_CNT_SOCIAL_CIRCLE`

In [None]:
df.OBS_60_CNT_SOCIAL_CIRCLE.mode()[0]

Checking the average value of `OBS_60_CNT_SOCIAL_CIRCLE`

In [None]:
df.OBS_60_CNT_SOCIAL_CIRCLE.mean()

**Observations**
-  Looking at the boxplot, median is 0.0
-  Most recurring value is 0.0
-  Mean value is 1.4052921791901856
-  Even 99th percentile value is 10. There is a prominent outlier at 50 and 350 approximately.
-  Mean and median are closer and can be used for imputation. It will not introduce bias as the missing value percentage is small (0.33%)

### Analysis of `DEF_60_CNT_SOCIAL_CIRCLE` column

-  nullable values = 0.33%

In [None]:
df.DEF_60_CNT_SOCIAL_CIRCLE.value_counts().head()

In [None]:
sns.boxplot(df.DEF_60_CNT_SOCIAL_CIRCLE)
plt.show()

Getting percentile values for `DEF_60_CNT_SOCIAL_CIRCLE`

In [None]:
df.DEF_60_CNT_SOCIAL_CIRCLE.quantile(q = [0.25,0.5,0.75,0.99,1])

Most recurring value in `DEF_60_CNT_SOCIAL_CIRCLE`

In [None]:
df.DEF_60_CNT_SOCIAL_CIRCLE.mode()[0]

Checking the average value of `DEF_60_CNT_SOCIAL_CIRCLE`

In [None]:
df.DEF_60_CNT_SOCIAL_CIRCLE.mean()

**Observations**
-  Looking at the boxplot, median is 0.0
-  Most recurring value is 0.0
-  Mean value is 0.10004894123788705
-  Even 99th percentile value is 2.  There are ~7 outliers the largest of which is ~24.
-  Mean and median are closer and can be used for imputation. It will not introduce bias as the missing value percentage is small (0.33%)

### Analysis of `EXT_SOURCE_2` column

-  nullable values = 0.21%

In [None]:
df.EXT_SOURCE_2.value_counts().head()

In [None]:
sns.boxplot(df.EXT_SOURCE_2)
plt.show()

Getting percentile values for `EXT_SOURCE_2`

In [None]:
df.EXT_SOURCE_2.quantile(q = [0.25,0.5,0.75,0.99,1])

Most recurring value in `EXT_SOURCE_2`

In [None]:
df.EXT_SOURCE_2.mode()[0]

Checking the average value of `EXT_SOURCE_2`

In [None]:
df.EXT_SOURCE_2.mean()

**Observations**
-  Looking at the boxplot, median is 0.565961
-  Most recurring value is 0.2858978721410488
-  Mean value is 0.5143926741308463
-  There is no outlier in the dataset
-  Mean and median are closer and can be used for imputation. It will not introduce bias as the missing value percentage is small (0.21%)

### Analysis of `AMT_GOODS_PRICE` column

-  nullable values = 0.09%

In [None]:
df.AMT_GOODS_PRICE.value_counts().head()

In [None]:
sns.boxplot(df.AMT_GOODS_PRICE)
plt.show()

Getting percentile values for `AMT_GOODS_PRICE`

In [None]:
df.AMT_GOODS_PRICE.quantile(q = [0.25,0.5,0.75,0.99,1])

Most recurring value in `AMT_GOODS_PRICE`

In [None]:
df.AMT_GOODS_PRICE.mode()[0]

Checking the average value of `AMT_GOODS_PRICE`

In [None]:
df.AMT_GOODS_PRICE.mean()

**Observations**
-  Looking at the boxplot, median is 450000.0
-  Most recurring value is 450000.0. So, median and mode are the same
-  Mean value is 538396.2074288895
-  Though there are values above 2500000 they cannot be treated as outliers as it could be a valid goods price
-  Mean and median are exactly same and can be used for imputation. It will not introduce bias as the missing value percentage is small (0.09%)

# Dealing with incorrect/unknown data values

### Analysis of `CODE_GENDER` column

Checking range of values

In [None]:
df['CODE_GENDER'].value_counts()

Gender should only be Male or Female. `XNA` value may indicate that the value was not provided by the loan applicant or missed by the loan officer verifying the application

In [None]:
df[df['CODE_GENDER'] == 'XNA']

As data looks valid, we will go ahead and check for an imputation method.
-  `Female` applicants are twice the number of `Male` applicants
-  And so, we will go ahead and impute `CODE_GENDER` with 'F'

In [None]:
df['CODE_GENDER'] = df['CODE_GENDER'].apply(lambda x: 'F' if x == 'XNA' else x)

Checking if `XNA` is removed

In [None]:
df['CODE_GENDER'].value_counts()

### Analysis of `DAYS_BIRTH` column

In [None]:
df['DAYS_BIRTH'].value_counts().head()

There are ~17K+ unique records all of which seem to be having negative values

In [None]:
df['DAYS_BIRTH'].unique()

In [None]:
df['DAYS_BIRTH'].nunique()

Converting `Days Birth` to positive days

In [None]:
df['DAYS_BIRTH'] = df['DAYS_BIRTH'].apply(lambda x: -x if x < 0 else x)

In [None]:
df['DAYS_BIRTH'].value_counts()

All Days in `DAYS_BIRTH` have positive values

#### Creating a new column `YEARS_BIRTH` for ease of analysis

In [None]:
df['YEARS_BIRTH'] = df['DAYS_BIRTH'].apply(lambda x: round(x/365))

### Analysis of `NAME_FAMILY_STATUS` column

Checking range of values

In [None]:
df['NAME_FAMILY_STATUS'].value_counts()

Gender should only be Male or Female. `Unknown` value may indicate that the value was not provided by the loan applicant or missed by the loan officer verifying the application

In [None]:
df[df['NAME_FAMILY_STATUS'] == 'Unknown']

In [None]:
df['NAME_FAMILY_STATUS'].value_counts(normalize = True) * 100

As data looks valid, we will go ahead and check for an imputation method.
-  `Married` applicants make up more than 63% of applicants.
-  Hence, we will go ahead and impute `NAME_FAMILY_STATUS` with 'Married'

In [None]:
df['NAME_FAMILY_STATUS'] = df['NAME_FAMILY_STATUS'].apply(lambda x: 'Married' if x == 'Unknown' else x)

Checking if `Unknown` is removed

In [None]:
df['NAME_FAMILY_STATUS'].value_counts()

### Analysis of `DAYS_EMPLOYED` column

In [None]:
df['DAYS_EMPLOYED'].value_counts().head()

In [None]:
df['DAYS_EMPLOYED'].value_counts(normalize = True) * 100

In [None]:
len(df[df['DAYS_EMPLOYED'] < 365243])

In [None]:
df[df['DAYS_EMPLOYED'] < 365243].DAYS_EMPLOYED.value_counts()

In [None]:
df['DAYS_EMPLOYED'].unique()

In [None]:
df['DAYS_EMPLOYED'].nunique()

**Observations**
-  There are ~55K+ records for which `DAYS_EMPLOYED` is 365243 days
-  Remaining 252K+ records have negative value for days
-  There are 12,574 unique values for `DAYS_EMPLOYED`

-  `DAYS_EMPLOYED` column indicates how many days before the application the person started current employment, the applicant/loan officer must have entered negative values to indicate the days before. <br>
 -  We will convert negative values in `DAYS_EMPLOYED` to positive days to standardize the days during use in calculations

In [None]:
df['DAYS_EMPLOYED'] = df['DAYS_EMPLOYED'].apply(lambda x: -x if x < 0 else x)

In [None]:
df['DAYS_EMPLOYED'].value_counts().head()

We can see that all days in `DAYS_EMPLOYED` have positive values

**For ~55K+ records for which `DAYS_EMPLOYED` is 365243 days**
- Converting this to year gives us 1000 years which is physically impossible to be employed for an applicant
- This is present for 18% of the data and cannot be an anamoly
- They could either be `Pensioners` or `Unemployed` and looking at the data our conclusion is correct

> There are two ways to handle this
>> 1) We let the data remain as it is and account for this during analysis OR <br>
>> 2) We calculate Average Days Employed excluding this category and impute that instead of 365243 days for Pensioners. <br>
>> For Unemployed, Days Employed can be 0

*Note*
 - During calculations with this column, we need to account for this scenario as otherwise it will skew our results

In [None]:
df[df['DAYS_EMPLOYED'] == 365243].NAME_INCOME_TYPE.value_counts()

#### Creating a new column `YEARS_EMPLOYED` for ease of analysis

In [None]:
df['YEARS_EMPLOYED'] = df['DAYS_EMPLOYED'].apply(lambda x: round(x/365))

### Analysis of `DAYS_REGISTRATION` column

In [None]:
df['DAYS_REGISTRATION'].value_counts().head()

In [None]:
df['DAYS_REGISTRATION'].value_counts(normalize = True).head()

In [None]:
df['DAYS_REGISTRATION'].unique()

In [None]:
df['DAYS_REGISTRATION'].nunique()

Converting `DAYS_REGISTRATION` to positive days

In [None]:
df['DAYS_REGISTRATION'] = df['DAYS_REGISTRATION'].apply(lambda x: -x if x < 0 else x)

In [None]:
df['DAYS_REGISTRATION'].value_counts().head()

All Days in `DAYS_REGISTRATION` have positive values

#### Creating a new column `YEARS_REGISTRATION` for ease of analysis

In [None]:
df['YEARS_REGISTRATION'] = df['DAYS_REGISTRATION'].apply(lambda x: round(x/365))

### Analysis of `DAYS_ID_PUBLISH` column

In [None]:
df['DAYS_ID_PUBLISH'].value_counts().head()

In [None]:
df['DAYS_ID_PUBLISH'].value_counts(normalize = True).head()

In [None]:
df['DAYS_ID_PUBLISH'].unique()

In [None]:
df['DAYS_ID_PUBLISH'].nunique()

Converting `DAYS_ID_PUBLISH` to positive days

In [None]:
df['DAYS_ID_PUBLISH'] = df['DAYS_ID_PUBLISH'].apply(lambda x: -x if x < 0 else x)

In [None]:
df['DAYS_ID_PUBLISH'].value_counts().head()

All Days in `DAYS_ID_PUBLISH` have positive values

#### Creating a new column `YEARS_ID_PUBLISH` for ease of analysis

In [None]:
df['YEARS_ID_PUBLISH'] = df['DAYS_ID_PUBLISH'].apply(lambda x: round(x/365))

### Analysis of `DAYS_LAST_PHONE_CHANGE` column

In [None]:
df['DAYS_LAST_PHONE_CHANGE'].value_counts().head()

In [None]:
df['DAYS_LAST_PHONE_CHANGE'].value_counts(normalize = True).head()

In [None]:
df['DAYS_LAST_PHONE_CHANGE'].unique()

In [None]:
df['DAYS_LAST_PHONE_CHANGE'].nunique()

Converting `DAYS_LAST_PHONE_CHANGE` to positive days

In [None]:
df['DAYS_LAST_PHONE_CHANGE'] = df['DAYS_LAST_PHONE_CHANGE'].apply(lambda x: -x if x < 0 else x)

In [None]:
df['DAYS_LAST_PHONE_CHANGE'].value_counts().head()

All Days in `DAYS_LAST_PHONE_CHANGE` have positive values

#### Creating a new column `YEARS_LAST_PHONE_CHANGE` for ease of analysis

In [None]:
df['YEARS_LAST_PHONE_CHANGE'] = df['DAYS_LAST_PHONE_CHANGE'].apply(lambda x: round(x/365,0))

# Automation functions

### Outlier Analysis (Distplot + BoxPlot) - numerical columns

In [None]:
def fn_dist_box(dataset,column):
    plt.subplots(1,2 ,figsize = (20,8))

    plt.subplot(121)
    sns.distplot(dataset[column], color = 'purple')
    pltname = 'Distplot of ' + column
    plt.ticklabel_format(style='plain', axis='x')
    plt.title(pltname)

    plt.subplot(122)
#     sns.boxplot(y = column, data = dataset, fliersize = 10)
    red_diamond = dict(markerfacecolor='r', marker='D')
    sns.boxplot(y = column, data = dataset, flierprops = red_diamond)
#     plt.boxplot(dataset[column], flierprops = red_diamond)
    pltname = 'Boxplot of ' + column
    plt.title(pltname)

    plt.tight_layout(pad = 4)
    plt.show()

### Creating a function `age_cat` to categorize `YEARS_BIRTH`

In [None]:
def age_cat(years):
    if years <= 20:
        return '0-20'
    elif years > 20 and years <= 30:
        return '20-30'
    elif years > 30 and years <= 40:
        return '30-40'  
    elif years > 40 and years <= 50:
        return '40-50'
    elif years > 50 and years <= 60:
        return '50-60'
    elif years > 60 and years <= 70:
        return '60-70'
    elif years > 70:
        return '70+'

### Univariate Analysis (Countplot) - categorical columns

In [None]:
def fn_uni_countplot(column):
    plt.figure(figsize = [20,8])
    palt = sns.color_palette("bright")

    plt.subplot(1,2,1)
    pltname = column + ' of clients with payment difficulties'
    plt.title(pltname)
    sns.countplot(x = column, data = df1, order = sorted(df1[column].unique(), reverse = True), palette = palt)
    plt.xticks(rotation = 90)
    
    plt.subplot(1,2,2)
    pltname = column + ' of clients with on-time payments'
    plt.title(pltname)
    sns.countplot(x = column, data = df0, order = sorted(df0[column].unique(), reverse = True), palette = palt)
    plt.xticks(rotation = 90)

    plt.tight_layout(pad = 4)
    plt.show()

### Univariate Analysis (Piechart) - categorical columns

In [None]:
def fn_uni_piechart(column):
    plt.figure(figsize = [20,12])

    plt.subplot(1,2,1)
    pltname = column + ' of clients with payment difficulties'
    plt.title(pltname)
    df1[column].value_counts().plot.pie(autopct='%1.1f%%',shadow=True, startangle=60, colors = ['green','yellow','purple','orange','red'], labeldistance=None)
    plt.legend()
    
    plt.subplot(1,2,2)
    pltname = column + ' of clients with on-time payments'
    plt.title(pltname)
    df0[column].value_counts().plot.pie(autopct='%1.1f%%',shadow=True, startangle=60, colors = ['green','yellow','purple','orange','red'], labeldistance=None)
    
    plt.legend()
    plt.tight_layout(pad = 4)
    plt.show()

### Univariate Analysis (Barplot) - categorical columns

In [None]:
def fn_uni_barplot(column):
    plt.figure(figsize = [20,8])

    plt.subplot(1,2,1)
    (df1[column].value_counts(normalize=True)*100).plot.bar(title = column + " Payment difficulties", color=['black', 'red', 'green', 'blue', 'cyan'])
    plt.xticks(rotation=90)

    plt.subplot(1,2,2)
    (df0[column].value_counts(normalize=True)*100).plot.bar(title = column + " On-Time Payments", color=['black', 'red', 'green', 'blue', 'cyan'])
    plt.xticks(rotation=90)

    plt.tight_layout(pad = 4)
    plt.show()

### Calculating min and max outlier range for numerical columns

In [None]:
def outlier_range(dataset,column):
    Q1 = dataset[column].quantile(0.25)
    Q3 = dataset[column].quantile(0.75)
    IQR = Q3 - Q1
    Min_value = (Q1 - 1.5 * IQR)
    Max_value = (Q3 + 1.5 * IQR)
    return Max_value

### Bivariate Analysis (boxplot) - categorical V/S continuous variables

In [None]:
 def fn_bi_boxplot(categorical,continuous,max_continuous1,max_continuous0,hue_column):
    plt.figure(figsize = [20,12])

    plt.subplot(1,2,1)
    plt.title('Payment Difficulties')
    red_diamond = dict(markerfacecolor='r', marker='D')
    sns.boxplot(x = categorical, 
                y = df1[df1[continuous] < max_continuous1][continuous], 
                data = df1, 
                flierprops = red_diamond, 
                order = sorted(df1[categorical].unique(), reverse = True),
                hue = hue_column, hue_order = sorted(df0[hue_column].unique(), reverse = True))
    plt.ticklabel_format(style='plain', axis='y')
    plt.xticks(rotation=90)

    plt.subplot(1,2,2)
    plt.title('On-Time Payments')
    sns.boxplot(x = categorical, 
                y = df0[df0[continuous] < max_continuous0][continuous], 
                data = df0, 
                flierprops = red_diamond, 
                order = sorted(df0[categorical].unique(), reverse = True),
                hue = hue_column, hue_order = sorted(df0[hue_column].unique(), reverse = True))
    plt.ticklabel_format(style='plain', axis='y')
    plt.xticks(rotation=90)

    plt.tight_layout(pad = 4)
    plt.show()

### Bivariate Analysis (Countplot) - categorical V/S categorical columns

In [None]:
def fn_bi_countplot(column,hue_column):
    plt.figure(figsize = [20,8])
    palt = sns.color_palette("bright")

    plt.subplot(1,2,1)
    pltname = 'Clients with payment difficulties'
    plt.title(pltname)
    sns.countplot(x = column, data = df1, 
                  order = sorted(df1[column].unique(), reverse = True), palette = palt,
                  hue = hue_column, hue_order = sorted(df1[hue_column].unique(), reverse = True))
    plt.xticks(rotation = 90)
    
    plt.subplot(1,2,2)
    pltname = 'Clients with on-time payments'
    plt.title(pltname)
    sns.countplot(x = column, data = df0, 
                  order = sorted(df0[column].unique(), reverse = True), palette = palt,
                  hue = hue_column, hue_order = sorted(df0[hue_column].unique(), reverse = True))
    plt.xticks(rotation = 90)

    plt.tight_layout(pad = 4)
    plt.show()

### Univariate Analysis (Countplot) - categorical columns - Merged dataset

In [None]:
def fn_uni_countplot_merge(column):
    plt.figure(figsize = [20,8])
    palt = sns.color_palette("bright")
    
    pltname = 'Analysis of ' + column
    plt.title(pltname)
    sns.countplot(x = column, data = df_merge, order = sorted(df_merge[column].unique(), reverse = True), palette = palt)
    plt.xticks(rotation = 90)

    plt.tight_layout(pad = 4)
    plt.show()

### Univariate Analysis (Piechart) - categorical columns - Merged dataset

In [None]:
def fn_uni_piechart_merge(column):
    plt.figure(figsize = [10,6])

    pltname = 'Analysis of ' + column
    plt.title(pltname)
    df_merge[column].value_counts().plot.pie(autopct='%1.1f%%',shadow=True, startangle=60, colors = ['purple','orange','red','green','yellow','pink'], labeldistance=None)
    
    plt.legend()
    plt.tight_layout(pad = 4)
    plt.show()

### Bivariate Analysis (boxplot) - categorical V/S continuous variables - Merged dataset

In [None]:
 def fn_bi_boxplot_merge(categorical,continuous,max_continuous,hue_column):
    plt.figure(figsize = [20,12])
    red_diamond = dict(markerfacecolor='r', marker='D')

    sns.boxplot(x = categorical, 
                y = df_merge[df_merge[continuous] < max_continuous][continuous], 
                data = df_merge, 
                flierprops = red_diamond, 
                order = sorted(df_merge[categorical].unique(), reverse = True),
                hue = hue_column, hue_order = sorted(df_merge[hue_column].unique(), reverse = True))
    plt.ticklabel_format(style='plain', axis='y')
    plt.xticks(rotation=90)

    plt.tight_layout(pad = 4)
    plt.show()

### Bivariate Analysis (Countplot) - categorical V/S categorical columns - Merged dataset

In [None]:
def fn_bi_countplot_merge(column,hue_column):
    plt.figure(figsize = [20,8])
    palt = sns.color_palette("bright")

    sns.countplot(x = column, data = df_merge, 
                  order = sorted(df_merge[column].unique()), palette = palt,
                  hue = hue_column, hue_order = sorted(df_merge[hue_column].unique(), reverse = True))
    plt.xticks(rotation = 90)

    plt.tight_layout(pad = 4)
    plt.show()

# Dealing with outliers for numerical columns

### Analysis of `CNT_CHILDREN` column

In [None]:
df['CNT_CHILDREN'].value_counts().sort_values(ascending = False).head()

In [None]:
(df['CNT_CHILDREN'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()

In [None]:
fn_dist_box(df,'CNT_CHILDREN')

Calculating IQR (Inter Quartile range)

In [None]:
Q1 = df['CNT_CHILDREN'].quantile(0.25)
Q3 = df['CNT_CHILDREN'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [None]:
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))

**Observations**
-  Looking at the data, we can see that above 7 children, the count of applicants are very minimal (2 or 3 in each category)
-  Also, looking at the observation data for applicants with 10 children, the applicants are only 31 and 41 years respectively. This seems a one-off scenario and can be treated as an outlier
-  Both distplots and boxplots clearly show the values above 2.5 as being outliers

**Conclusion**
-  Applicants with 3 or more children are outlier cases

### Analysis of `AMT_INCOME_TOTAL` column

In [None]:
df['AMT_INCOME_TOTAL'].value_counts().sort_values(ascending = False).head()

In [None]:
(df['AMT_INCOME_TOTAL'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()

In [None]:
df['AMT_INCOME_TOTAL'].describe(percentiles = [0.75,0.99,0.999])

Plotting for `AMT_INCOME_TOTAL`

In [None]:
fn_dist_box(df,'AMT_INCOME_TOTAL')

-  The end result charts are very thin and we are able to observe an outlier around ~120 million.
-  Let's plot it out by only considering income under 99.9% value which is 900K

In [None]:
plt.subplots(1,2 ,figsize = (20,8))

plt.subplot(121)
sns.distplot(df[df['AMT_INCOME_TOTAL'] < 900000].AMT_INCOME_TOTAL)
pltname = 'Distplot of ' + 'AMT_INCOME_TOTAL'
plt.title(pltname)

plt.subplot(122)
sns.boxplot(df[df['AMT_INCOME_TOTAL'] < 900000].AMT_INCOME_TOTAL)
pltname = 'Boxplot of ' + 'AMT_INCOME_TOTAL'
plt.title(pltname)

plt.tight_layout(pad = 4)
plt.show()

Now, we are able to clearly make out the distribution and data range in both plots.
 - This means that values above 900K income are clearly outliers

In [None]:
df[df['AMT_INCOME_TOTAL'] > 900000].head()

**Observations**
-  Looking at the data, we can see that Income above 900K (99.9% value) are outliers
-  Both distplots and boxplots clearly show the same trend

**Conclusion**
-  Applicants with Income above 900K (99.9% value) are outliers

### Analysis of `CNT_FAM_MEMBERS` column

In [None]:
df['CNT_FAM_MEMBERS'].value_counts().sort_values(ascending = False).head()

In [None]:
(df['CNT_FAM_MEMBERS'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()

In [None]:
df['CNT_FAM_MEMBERS'].describe(percentiles = [0.75,0.99,0.9999])

Plotting for `CNT_FAM_MEMBERS`

In [None]:
fn_dist_box(df,'CNT_FAM_MEMBERS')

-  For family members count from 5 upwards, the results are sparse and there is a clear outlier around 20
-  Applicants with family members above 10 are only 2 or 3
-  Let's plot it out by only considering family members under 99.9% value which is 8

In [None]:
plt.subplots(1,2 ,figsize = (20,8))

plt.subplot(121)
sns.distplot(df[df['CNT_FAM_MEMBERS'] <= 8].CNT_FAM_MEMBERS)
pltname = 'Distplot of ' + 'CNT_FAM_MEMBERS'
plt.title(pltname)

plt.subplot(122)
sns.boxplot(df[df['CNT_FAM_MEMBERS'] <= 8].CNT_FAM_MEMBERS)
pltname = 'Boxplot of ' + 'CNT_FAM_MEMBERS'
plt.title(pltname)

plt.tight_layout(pad = 4)
plt.show()

Now, we are able to clearly make out the distribution and data range in both plots.
 - This means that applicants with 5 or more family members are clearly outliers

### Analysis of `AMT_ANNUITY` column

In [None]:
df['AMT_ANNUITY'].value_counts().sort_values(ascending = False).head()

In [None]:
(df['AMT_ANNUITY'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()

In [None]:
df['AMT_ANNUITY'].describe(percentiles = [0.75,0.99,0.9999])

Calculating IQR (Inter Quartile range)

In [None]:
Q1 = df['AMT_ANNUITY'].quantile(0.25)
Q3 = df['AMT_ANNUITY'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [None]:
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))

Plotting for `AMT_ANNUITY`

In [None]:
fn_dist_box(df,'AMT_ANNUITY')

-  As observed from displot and boxplot, the outliers tend to exist after 61704.0 (This value is derived using `Max_value` using IQR formulae)
-  We can verify the same by reproducing the same plots under outlier value as shown below

In [None]:
plt.subplots(1,2 ,figsize = (20,8))

plt.subplot(121)
sns.distplot(df[df['AMT_ANNUITY'] <= 61704.0].AMT_ANNUITY)
pltname = 'Distplot of ' + 'AMT_ANNUITY'
plt.title(pltname)

plt.subplot(122)
sns.boxplot(df[df['AMT_ANNUITY'] <= 61704.0].AMT_ANNUITY)
pltname = 'Boxplot of ' + 'AMT_ANNUITY'
plt.title(pltname)

plt.tight_layout(pad = 4)
plt.show()

Now, we are able to clearly make out the distribution and data range in both plots
 - This means that values above 61704.0 `AMT_ANNUITY` are clearly outliers

**Observations**
-  As observed from displot and boxplot, the outliers tend to exist after 61704 (Outlier value is derived using `Max_value` using IQR formulae)

**Conclusion**
-  Applicants with `AMT_ANNUITY` above 61704 (calculated using IQR) are outliers

### Analysis of `AMT_CREDIT` column

In [None]:
df['AMT_CREDIT'].value_counts().sort_values(ascending = False).head()

In [None]:
(df['AMT_CREDIT'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()

In [None]:
df['AMT_CREDIT'].describe(percentiles = [0.75,0.99,0.9999])

Calculating IQR (Inter Quartile range)

In [None]:
Q1 = df['AMT_CREDIT'].quantile(0.25)
Q3 = df['AMT_CREDIT'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [None]:
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))

Plotting for `AMT_CREDIT`

In [None]:
fn_dist_box(df,'AMT_CREDIT')

-  As observed from displot and boxplot, the outliers tend to exist after 1616625.0 (This value is derived using `Max_value` using IQR formulae)
-  We can verify the same by reproducing the same plots under outlier value as shown below

In [None]:
plt.subplots(1,2 ,figsize = (20,8))

plt.subplot(121)
sns.distplot(df[df['AMT_CREDIT'] <= 1616625.0].AMT_CREDIT)
pltname = 'Distplot of ' + 'AMT_CREDIT'
plt.title(pltname)

plt.subplot(122)
sns.boxplot(df[df['AMT_CREDIT'] <= 1616625.0].AMT_CREDIT)
pltname = 'Boxplot of ' + 'AMT_CREDIT'
plt.title(pltname)

plt.tight_layout(pad = 4)
plt.show()

Now, we are able to clearly make out the distribution and data range in both plots
 - This means that values above 61616625.0 `AMT_CREDIT` are clearly outliers

**Observations**
-  As observed from displot and boxplot, the outliers tend to exist after 1616625.0 (Outlier value is derived using `Max_value` using IQR formulae)

**Conclusion**
-  Applicants with `AMT_CREDIT` above 1616625.0 (calculated using IQR) are outliers

### Analysis of `DAYS_LAST_PHONE_CHANGE` column

In [None]:
df['DAYS_LAST_PHONE_CHANGE'].value_counts().sort_values(ascending = False).head()

In [None]:
(df['DAYS_LAST_PHONE_CHANGE'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()

In [None]:
df['DAYS_LAST_PHONE_CHANGE'].describe(percentiles = [0.75,0.99,0.9999])

Calculating IQR (Inter Quartile range)

In [None]:
Q1 = df['DAYS_LAST_PHONE_CHANGE'].quantile(0.25)
Q3 = df['DAYS_LAST_PHONE_CHANGE'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [None]:
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))

Plotting for `DAYS_LAST_PHONE_CHANGE`

In [None]:
fn_dist_box(df,'DAYS_LAST_PHONE_CHANGE')

-  As observed from displot and boxplot, the outliers tend to exist after 3514.0 (This value is derived using `Max_value` using IQR formulae)
-  We can verify the same by reproducing the same plots under outlier value as shown below

In [None]:
plt.subplots(1,2 ,figsize = (20,8))

plt.subplot(121)
sns.distplot(df[df['DAYS_LAST_PHONE_CHANGE'] <= 3514.0].DAYS_LAST_PHONE_CHANGE)
pltname = 'Distplot of ' + 'DAYS_LAST_PHONE_CHANGE'
plt.title(pltname)

plt.subplot(122)
sns.boxplot(df[df['DAYS_LAST_PHONE_CHANGE'] <= 3514.0].DAYS_LAST_PHONE_CHANGE)
pltname = 'Boxplot of ' + 'DAYS_LAST_PHONE_CHANGE'
plt.title(pltname)

plt.tight_layout(pad = 4)
plt.show()

Now, we are able to clearly make out the distribution and data range in both plots
 - This means that values above 3514.0 `DAYS_LAST_PHONE_CHANGE` are clearly outliers

**Observations**
-  As observed from displot and boxplot, the outliers tend to exist after 3514.0 (Outlier value is derived using `Max_value` using IQR formulae)

**Conclusion**
-  Applicants with `DAYS_LAST_PHONE_CHANGE` above 3514.0 (calculated using IQR) are outliers

# Binning of continuous columns for analysis

### Categorizing `AMT_GOODS_PRICE` column

In [None]:
df['AMT_GOODS_PRICE'].value_counts().sort_values(ascending = False).head()

In [None]:
(df['AMT_GOODS_PRICE'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()

Getting Statistical summary for `AMT_GOODS_PRICE`

In [None]:
df['AMT_GOODS_PRICE'].describe(percentiles = [0.25,0.75,0.99,0.9999])

Categorize the values in `AMT_GOODS_PRICE` into 5 bins and create a new column `AMT_GOODS_PRICE_CATEGORY`

In [None]:
df['AMT_GOODS_PRICE_CATEGORY'] = pd.cut(df['AMT_GOODS_PRICE'], bins = 5, labels = ['very low','low','medium','high','very high'])

Checking that the values are populated as per expectation

In [None]:
df['AMT_GOODS_PRICE_CATEGORY'].value_counts()

### Categorizing `YEARS_BIRTH` column

We will be categorizing 'YEARS_BIRTH' instead of 'DAYS_BIRTH' as years are easier to interpret than days

In [None]:
df['YEARS_BIRTH'].value_counts().sort_values(ascending = False).head()

In [None]:
(df['YEARS_BIRTH'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()

Getting Statistical summary for `YEARS_BIRTH`

In [None]:
df['YEARS_BIRTH'].describe(percentiles = [0.25,0.75,0.99,0.9999])

Categorize the values from `YEARS_BIRTH` into a new column `YEARS_BIRTH_CATEGORY`

In [None]:
df['YEARS_BIRTH_CATEGORY'] = df['YEARS_BIRTH'].apply(age_cat)

Checking that the values are populated as per expectation

In [None]:
df['YEARS_BIRTH_CATEGORY'].value_counts().sort_values(ascending = False)

### Categorizing `YEARS_REGISTRATION` column

We will be categorizing 'YEARS_REGISTRATION' instead of 'DAYS_REGISTRATION' as years are easier to interpret than days

In [None]:
df['YEARS_REGISTRATION'].value_counts().sort_values(ascending = False).head()

In [None]:
(df['YEARS_REGISTRATION'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()

Getting Statistical summary for `YEARS_REGISTRATION`

In [None]:
df['YEARS_REGISTRATION'].describe(percentiles = [0.25,0.75,0.99,0.9999])

Categorize the values from `YEARS_REGISTRATION` into a new column `YEARS_REGISTRATION_CATEGORY`

In [None]:
df['YEARS_REGISTRATION_CATEGORY'] = df['YEARS_REGISTRATION'].apply(age_cat)

Checking that the values are populated as per expectation

In [None]:
df['YEARS_REGISTRATION_CATEGORY'].value_counts().sort_values(ascending = False)

# Checking Imbalance for target column `TARGET'

### Analyzing `TARGET` column

In [None]:
df['TARGET'].value_counts().sort_values(ascending = False)

In [None]:
df['TARGET'].value_counts(normalize = True).sort_values(ascending = False) * 100

In [None]:
plt.figure(figsize = (10,5))
sns.countplot(x = df['TARGET'], data = df)
plt.title('Checking imbalance ratio of TARGET variable')
plt.show()

In [None]:
plt.figure(figsize = [20,8])
(df['TARGET'].value_counts(normalize=True)*100).plot.bar(title = "Checking imbalance percentage for TARGET variable", color=['black', 'red', 'green', 'blue', 'cyan'])
plt.xticks(rotation=0)
plt.show()

**Observations**

-  We have imbalance in `TARGET` variable based on the % of observations
 - `TARGET` value 1 represents client with payment difficulties (he/she had late payment more than X days on at least one of the first Y installments of the loan). This is only 8.07% of the data
 - `TARGET` value 0 represents all other cases than 1. This is 91.93% of the data

# Data split based on `TARGET`

### Create new dataframe with `TARGET` value 1
- `TARGET` value 1 represents client with payment difficulties (he/she had late payment more than X days on at least one of the first Y installments of the loan). This is only 8.07% of the data

In [None]:
df1 = df[df['TARGET'] == 1]

In [None]:
df1.TARGET.value_counts()

### Create new dataframe with `TARGET` value 0
- `TARGET` value 0 represents all other cases than 1. This is 91.93% of the data

In [None]:
df0 = df[df['TARGET'] == 0]

In [None]:
df0.TARGET.value_counts()

# Univariate analysis of categorical variables

### Analysis of `NAME_CONTRACT_TYPE`

`NAME_CONTRACT_TYPE` with payment difficulties

In [None]:
df1['NAME_CONTRACT_TYPE'].value_counts().sort_values(ascending = False)

`NAME_CONTRACT_TYPE` with on-time payments

In [None]:
df0['NAME_CONTRACT_TYPE'].value_counts().sort_values(ascending = False)

In [None]:
fn_uni_countplot('NAME_CONTRACT_TYPE')

In [None]:
fn_uni_piechart('NAME_CONTRACT_TYPE')

**Observations**

- Looking at CountPlot, we don't see significant differences in `NAME_CONTRACT_TYPE` b/w clients with payment difficulties and on-time payments
- Looking at PieChart, we don't see significant differences in `NAME_CONTRACT_TYPE` b/w clients with payment difficulties and on-time payments

**Conclusion**
- `NAME_CONTRACT_TYPE` column does not provide any conclusive evidence in favor of clients with payment difficulties OR on-time payments

### Analysis of `CODE_GENDER`

`CODE_GENDER` with payment difficulties

In [None]:
df1['CODE_GENDER'].value_counts().sort_values(ascending = False)

`CODE_GENDER` with on-time payments

In [None]:
df0['CODE_GENDER'].value_counts().sort_values(ascending = False)

In [None]:
sorted(df['CODE_GENDER'].unique(), reverse = True)

In [None]:
fn_uni_countplot('CODE_GENDER')

In [None]:
fn_uni_piechart('CODE_GENDER')

**Observations**

- Looking at CountPlot and PieChart, there is a 9.5% decrease in "Male" values from `CODE_GENDER` b/w clients with payment difficulties to on-time payments. It's a weak correlation that Males have more payment difficulties.

**Conclusion**
- `CODE_GENDER` column provides a weak inference that "Male" clients have more payment difficulties

### Analysis of `FLAG_OWN_CAR`

`FLAG_OWN_CAR` with payment difficulties

In [None]:
df1['FLAG_OWN_CAR'].value_counts().sort_values(ascending = False)

`FLAG_OWN_CAR` with on-time payments

In [None]:
df0['FLAG_OWN_CAR'].value_counts().sort_values(ascending = False)

In [None]:
fn_uni_countplot('FLAG_OWN_CAR')

In [None]:
fn_uni_piechart('FLAG_OWN_CAR')

**Observations**

- Looking at CountPlot, we don't see significant differences in `FLAG_OWN_CAR` b/w clients with payment difficulties and on-time payments
- Looking at PieChart, we don't see significant differences in `FLAG_OWN_CAR` b/w clients with payment difficulties and on-time payments.

**Conclusion**
- `FLAG_OWN_CAR` column does not provide any conclusive evidence in favor of clients with payment difficulties OR on-time payments

### Analysis of `NAME_INCOME_TYPE`

`NAME_INCOME_TYPE` with payment difficulties

In [None]:
df1['NAME_INCOME_TYPE'].value_counts().sort_values(ascending = False)

`NAME_INCOME_TYPE` with on-time payments

In [None]:
df0['NAME_INCOME_TYPE'].value_counts().sort_values(ascending = False)

In [None]:
fn_uni_countplot('NAME_INCOME_TYPE')

In [None]:
fn_uni_barplot('NAME_INCOME_TYPE')

**Observations**

- Pensioners have better on-time payments
- Students don't have Payment difficulties
- Businessmen don't have Payment difficulties

**Conclusion**

- Pensioners have better on-time payments. This is a weak correlation.
- Students don't have Payment difficulties. In this case, total students have only 18 observations and should be treated as a weak correlation
- Businessmen don't have Payment difficulties. In this case, Businessmen have only 10 observations and should be treated as a weak correlation

### Analysis of `NAME_EDUCATION_TYPE`

`NAME_EDUCATION_TYPE` with payment difficulties

In [None]:
df1['NAME_EDUCATION_TYPE'].value_counts().sort_values(ascending = False)

`NAME_EDUCATION_TYPE` with on-time payments

In [None]:
df0['NAME_EDUCATION_TYPE'].value_counts().sort_values(ascending = False)

In [None]:
fn_uni_countplot('NAME_EDUCATION_TYPE')

In [None]:
fn_uni_barplot('NAME_EDUCATION_TYPE')

In [None]:
fn_uni_piechart('NAME_EDUCATION_TYPE')

**Observations**

- Clients with 'Higher education' have better on-timepayments than payment difficulties 
- Remaining categories don't provide any conclusive results

**Conclusion**

- Clients with 'Higher education' have less payment difficulties. However, this is a weak correlation

### Analysis of `NAME_FAMILY_STATUS`

`NAME_FAMILY_STATUS` with payment difficulties

In [None]:
df1['NAME_FAMILY_STATUS'].value_counts().sort_values(ascending = False)

`NAME_FAMILY_STATUS` with on-time payments

In [None]:
df0['NAME_FAMILY_STATUS'].value_counts().sort_values(ascending = False)

In [None]:
fn_uni_countplot('NAME_FAMILY_STATUS')

In [None]:
fn_uni_barplot('NAME_FAMILY_STATUS')

In [None]:
fn_uni_piechart('NAME_FAMILY_STATUS')

**Observations**

- Clients who are 'Married' are 59.8% with payment difficulties and 64.2% with on-timepayments
- Clients who are 'Widow' are 3.8% with payment difficulties and 5.4% with on-timepayments
- Clients who are 'Single/not married' are 18.0% with payment difficulties and 14.5% with on-timepayments
- Remaining categories don't provide any conclusive results

**Conclusion**

- Clients who are 'Married' OR 'Widow' do on-time payments better comparatively. However, this is a weak correlation.
- Clients who are 'Single/not married' have more difficulties with on-time payments comparatively. However, this is a weak correlation.

# Correlation analysis of numerical variables

### Plotting correlation matrix for Payment Difficulties

In [None]:
df1.select_dtypes(include=["int64","float64"]).shape

There are 66 numerical columns. Creating a correlation matrix `corr` to view the results better

In [None]:
corr_df1 = df1.select_dtypes(include=["int64","float64"]).corr()

In [None]:
corr_df1.head()

Creating a HeatMap to view the correlations above 80% and 99.99%

In [None]:
plt.figure(figsize = (25,25))
sns.heatmap(data = corr_df1[(corr_df1 >= 0.8) & (corr_df1 < 0.9999)], annot = True, cmap = "RdYlGn", cbar = True, fmt='.2f')
plt.show()

### Getting top 10 correlations for Payment Difficulties

In [None]:
corr_df1[corr_df1 <= 0.99].unstack().sort_values(ascending = False).head(22)

As we have duplicate combinations, looking at the above and removing dups, we get top 10 correlations as:

- AMT_GOODS_PRICE              AMT_CREDIT                    0.98
- REGION_RATING_CLIENT         REGION_RATING_CLIENT_W_CITY   0.96
- CNT_FAM_MEMBERS              CNT_CHILDREN                  0.89
- DEF_60_CNT_SOCIAL_CIRCLE     DEF_30_CNT_SOCIAL_CIRCLE      0.87
- REG_REGION_NOT_WORK_REGION   LIVE_REGION_NOT_WORK_REGION   0.85
- LIVE_CITY_NOT_WORK_CITY      REG_CITY_NOT_WORK_CITY        0.78
- AMT_ANNUITY                  AMT_GOODS_PRICE               0.75
- AMT_ANNUITY                  AMT_CREDIT                    0.75
- DAYS_EMPLOYED                FLAG_DOCUMENT_6               0.62
- DAYS_BIRTH                   DAYS_EMPLOYED                 0.58

### Plotting correlation matrix for On-Time payments

In [None]:
df0.select_dtypes(include=["int64","float64"]).shape

There are 66 numerical columns. Creating a correlation matrix `corr` to view the results better

In [None]:
corr_df0 = df0.select_dtypes(include=["int64","float64"]).corr()

In [None]:
corr_df0.head()

Creating a HeatMap to view the correlations above 80% and 99.99%

In [None]:
plt.figure(figsize = (25,25))
sns.heatmap(data = corr_df0[(corr_df0 >= 0.8) & (corr_df0 < 0.9999)], annot = True, cmap = "RdYlGn", cbar = True, fmt='.2f')
plt.show()

### Getting top 10 correlations for On-Time Payments

In [None]:
corr_df0[corr_df0 <= 0.99].unstack().sort_values(ascending = False).head(28)

As we have duplicate combinations, looking at the above and removing dups, we get top 10 correlations as:

- AMT_GOODS_PRICE              AMT_CREDIT                    0.99
- REGION_RATING_CLIENT         REGION_RATING_CLIENT_W_CITY   0.95
- CNT_FAM_MEMBERS              CNT_CHILDREN                  0.88
- REG_REGION_NOT_WORK_REGION   LIVE_REGION_NOT_WORK_REGION   0.86
- DEF_30_CNT_SOCIAL_CIRCLE     DEF_60_CNT_SOCIAL_CIRCLE      0.86
- LIVE_CITY_NOT_WORK_CITY      REG_CITY_NOT_WORK_CITY        0.83
- AMT_ANNUITY                  AMT_GOODS_PRICE               0.78
- AMT_ANNUITY                  AMT_CREDIT                    0.77
- DAYS_BIRTH                   DAYS_EMPLOYED                 0.63
- DAYS_EMPLOYED                FLAG_DOCUMENT_6               0.60

### Comparing top 10 correlations b/w Payment difficulties and On-Time Payments

**Observations**

- The top 10 correlations for Payment difficulties and On-Time Payments are the same except minor differences in correlation percentage
- The highest correlation is for the combination of `AMT_GOODS_PRICE` and `AMT_CREDIT`
- For Payment difficulties dataset, the correlation b/w `AMT_GOODS_PRICE` and `AMT_CREDIT` is 0.98
- For On-Time Payments dataset, the correlation b/w `AMT_GOODS_PRICE` and `AMT_CREDIT` is 0.99

# Univariate analysis of numerical variables

### Analysis of `AMT_CREDIT`

#### Outlier identification of `AMT_CREDIT` with Payment difficulties

Calculating IQR (Inter Quartile range)

In [None]:
Q1 = df1['AMT_CREDIT'].quantile(0.25)
Q3 = df1['AMT_CREDIT'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [None]:
Min_value1 = (Q1 - 1.5 * IQR)
Max_value1 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value1))
print("Max value after which outlier exist: {}".format(Max_value1))

#### Outlier identification of `AMT_CREDIT` with On-Time Payments

Calculating IQR (Inter Quartile range)

In [None]:
Q1 = df0['AMT_CREDIT'].quantile(0.25)
Q3 = df0['AMT_CREDIT'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [None]:
Min_value0 = (Q1 - 1.5 * IQR)
Max_value0 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value0))
print("Max value after which outlier exist: {}".format(Max_value0))

Removing outliers and plotting distplot

In [None]:
plt.figure(figsize = [20,8])
sns.distplot(df1[df1['AMT_CREDIT'] <= Max_value1].AMT_CREDIT,label = 'Payment difficulties', hist=False)
sns.distplot(df0[df0['AMT_CREDIT'] <= Max_value0].AMT_CREDIT,label = 'On-Time Payments', hist=False)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.legend()
plt.show()

**Observations**

- For `AMT_CREDIT` between 250000 and approximately 650000, there are more clients with Payment difficulties
- For `AMT_CREDIT` > 750000 , there are more clients with On-Time Payments

### Analysis of `YEARS_BIRTH`

#### Outlier identification of `YEARS_BIRTH` with Payment difficulties

Calculating IQR (Inter Quartile range)

In [None]:
Q1 = df1['YEARS_BIRTH'].quantile(0.25)
Q3 = df1['YEARS_BIRTH'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [None]:
Min_value1 = (Q1 - 1.5 * IQR)
Max_value1 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value1))
print("Max value after which outlier exist: {}".format(Max_value1))

#### Outlier identification of `YEARS_BIRTH` with On-Time Payments

Calculating IQR (Inter Quartile range)

In [None]:
Q1 = df0['YEARS_BIRTH'].quantile(0.25)
Q3 = df0['YEARS_BIRTH'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [None]:
Min_value0 = (Q1 - 1.5 * IQR)
Max_value0 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value0))
print("Max value after which outlier exist: {}".format(Max_value0))

Removing outliers and plotting distplot

In [None]:
plt.figure(figsize = [20,8])
sns.distplot(df1[df1['YEARS_BIRTH'] <= Max_value1].YEARS_BIRTH,label = 'Payment difficulties', hist=False)
sns.distplot(df0[df0['YEARS_BIRTH'] <= Max_value0].YEARS_BIRTH,label = 'On-Time Payments', hist=False)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.legend()
plt.show()

**Observations**

- For `YEARS_BIRTH` between 20 and 40, there are more clients with Payment difficulties
- Conversely, for `YEARS_BIRTH` > 40 , there are more clients with On-Time Payments

### Analysis of `AMT_GOODS_PRICE `

#### Outlier identification of `AMT_GOODS_PRICE ` with Payment difficulties

Calculating IQR (Inter Quartile range)

In [None]:
Q1 = df1['AMT_GOODS_PRICE'].quantile(0.25)
Q3 = df1['AMT_GOODS_PRICE'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [None]:
Min_value1 = (Q1 - 1.5 * IQR)
Max_value1 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value1))
print("Max value after which outlier exist: {}".format(Max_value1))

#### Outlier identification of `AMT_GOODS_PRICE ` with On-Time Payments

Calculating IQR (Inter Quartile range)

In [None]:
Q1 = df0['AMT_GOODS_PRICE'].quantile(0.25)
Q3 = df0['AMT_GOODS_PRICE'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [None]:
Min_value0 = (Q1 - 1.5 * IQR)
Max_value0 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value0))
print("Max value after which outlier exist: {}".format(Max_value0))

Removing outliers and plotting distplot

In [None]:
plt.figure(figsize = [20,8])
sns.distplot(df1[df1['AMT_GOODS_PRICE'] <= Max_value1].AMT_GOODS_PRICE,label = 'Payment difficulties', hist=False)
sns.distplot(df0[df0['AMT_GOODS_PRICE'] <= Max_value0].AMT_GOODS_PRICE,label = 'On-Time Payments', hist=False)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.legend()
plt.show()

**Observations**

- For `AMT_GOODS_PRICE` between ~250000 and ~550000, there are more clients with Payment difficulties
- Otherwise there are spikes on and off but they don't show any conclusive observations

### Analysis of `DAYS_EMPLOYED`

#### Outlier identification of `DAYS_EMPLOYED ` with Payment difficulties

Calculating IQR (Inter Quartile range)

In [None]:
Q1 = df1['DAYS_EMPLOYED'].quantile(0.25)
Q3 = df1['DAYS_EMPLOYED'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [None]:
Min_value1 = (Q1 - 1.5 * IQR)
Max_value1 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value1))
print("Max value after which outlier exist: {}".format(Max_value1))

#### Outlier identification of `DAYS_EMPLOYED ` with On-Time Payments

Calculating IQR (Inter Quartile range)

In [None]:
Q1 = df0['DAYS_EMPLOYED'].quantile(0.25)
Q3 = df0['DAYS_EMPLOYED'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [None]:
Min_value0 = (Q1 - 1.5 * IQR)
Max_value0 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value0))
print("Max value after which outlier exist: {}".format(Max_value0))

Removing outliers and plotting distplot

In [None]:
plt.figure(figsize = [20,8])
sns.distplot(df1[df1['DAYS_EMPLOYED'] <= Max_value1].DAYS_EMPLOYED,label = 'Payment difficulties', hist=False)
sns.distplot(df0[df0['DAYS_EMPLOYED'] <= Max_value0].DAYS_EMPLOYED,label = 'On-Time Payments', hist=False)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.legend()
plt.show()

**Observations**

- For `DAYS_EMPLOYED` less than 2000, there are more clients with Payment difficulties
- Conversely, for `DAYS_EMPLOYED` > 2000 , there are more clients with On-Time Payments
- This means that those who are employed longer have better chances of repaying the loan

### Analysis of `CNT_CHILDREN `

#### Outlier identification of `CNT_CHILDREN ` with Payment difficulties

Calculating IQR (Inter Quartile range)

In [None]:
Q1 = df1['CNT_CHILDREN'].quantile(0.25)
Q3 = df1['CNT_CHILDREN'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [None]:
Min_value1 = (Q1 - 1.5 * IQR)
Max_value1 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value1))
print("Max value after which outlier exist: {}".format(Max_value1))

#### Outlier identification of `CNT_CHILDREN ` with On-Time Payments

Calculating IQR (Inter Quartile range)

In [None]:
Q1 = df0['CNT_CHILDREN'].quantile(0.25)
Q3 = df0['CNT_CHILDREN'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [None]:
Min_value0 = (Q1 - 1.5 * IQR)
Max_value0 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value0))
print("Max value after which outlier exist: {}".format(Max_value0))

Removing outliers and plotting distplot

In [None]:
plt.figure(figsize = [20,8])
sns.distplot(df1[df1['CNT_CHILDREN'] <= Max_value1].CNT_CHILDREN,label = 'Payment difficulties', hist=False)
sns.distplot(df0[df0['CNT_CHILDREN'] <= Max_value0].CNT_CHILDREN,label = 'On-Time Payments', hist=False)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.legend()
plt.show()

**Observations**

- For `CNT_CHILDREN` 0 (those with no children), there are lots of clients with On-Time Payments
- For `CNT_CHILDREN` with 1 OR 2 (those with 1 or 2 children), there are few more clients with On-Time Payments

### Analysis of `AMT_INCOME_TOTAL `

#### Outlier identification of `AMT_INCOME_TOTAL ` with Payment difficulties

Calculating IQR (Inter Quartile range)

In [None]:
Q1 = df1['AMT_INCOME_TOTAL'].quantile(0.25)
Q3 = df1['AMT_INCOME_TOTAL'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [None]:
Min_value1 = (Q1 - 1.5 * IQR)
Max_value1 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value1))
print("Max value after which outlier exist: {}".format(Max_value1))

#### Outlier identification of `AMT_INCOME_TOTAL ` with On-Time Payments

Calculating IQR (Inter Quartile range)

In [None]:
Q1 = df0['AMT_INCOME_TOTAL'].quantile(0.25)
Q3 = df0['AMT_INCOME_TOTAL'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [None]:
Min_value0 = (Q1 - 1.5 * IQR)
Max_value0 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value0))
print("Max value after which outlier exist: {}".format(Max_value0))

Removing outliers and plotting distplot

In [None]:
plt.figure(figsize = [20,8])
sns.distplot(df1[df1['AMT_INCOME_TOTAL'] <= Max_value1].AMT_INCOME_TOTAL,label = 'Payment difficulties', hist=False)
sns.distplot(df0[df0['AMT_INCOME_TOTAL'] <= Max_value0].AMT_INCOME_TOTAL,label = 'On-Time Payments', hist=False)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.legend()
plt.show()

**Observations**

- Based on `AMT_INCOME_TOTAL`, for clients with Payment difficulties, the distribution resembles a normal distribution approximately
- But for clients with On-Time Payments, there are erratic spikes in the distribution which doesn't give any valid observations

# Bivariate/Multivariate analysis

## Continuous V/S Continuous variables

### Analysis of `AMT_GOODS_PRICE` V/S `AMT_CREDIT`

**Outlier identification of `AMT_GOODS_PRICE ` with Payment difficulties**

In [None]:
max_value1_AMT_GOODS_PRICE = outlier_range(df1,'AMT_GOODS_PRICE')
max_value1_AMT_GOODS_PRICE

**Outlier identification of `AMT_CREDIT ` with Payment difficulties**

In [None]:
max_value1_AMT_CREDIT = outlier_range(df1,'AMT_CREDIT')
max_value1_AMT_CREDIT

**Outlier identification of `AMT_GOODS_PRICE ` with On-Time Payments**

In [None]:
max_value0_AMT_GOODS_PRICE = outlier_range(df0,'AMT_GOODS_PRICE')
max_value0_AMT_GOODS_PRICE

**Outlier identification of `AMT_CREDIT ` with On-Time Payments**

In [None]:
max_value0_AMT_CREDIT = outlier_range(df0,'AMT_CREDIT')
max_value0_AMT_CREDIT

Plotting scatterplot for comparison with outliers removed

In [None]:
plt.figure(figsize = [20,8])

plt.subplot(1,2,1)
plt.title('Payment difficulties')
sns.scatterplot(x = df1[df1['AMT_GOODS_PRICE'] < max_value1_AMT_GOODS_PRICE].AMT_GOODS_PRICE, 
                y = df1[df1['AMT_CREDIT'] < max_value1_AMT_CREDIT].AMT_CREDIT, data = df1)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.subplot(1,2,2)
plt.title('On-Time Payments')
sns.scatterplot(x = df0[df0['AMT_GOODS_PRICE'] < max_value0_AMT_GOODS_PRICE].AMT_GOODS_PRICE, 
                y = df0[df0['AMT_CREDIT'] < max_value0_AMT_CREDIT].AMT_CREDIT, data = df0)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.tight_layout(pad = 4)
plt.show()

**Observations**
- `AMT_GOODS_PRICE` and `AMT_CREDIT` have strong positive correlation. This means that as Goods price increases, so does Credit Amount

### Analysis of `AMT_ANNUITY` V/S `AMT_CREDIT`

**Outlier identification of `AMT_ANNUITY` with Payment difficulties**

In [None]:
max_value1_AMT_ANNUITY = outlier_range(df1,'AMT_ANNUITY')
max_value1_AMT_ANNUITY

**Outlier identification of `AMT_CREDIT ` with Payment difficulties**

In [None]:
max_value1_AMT_CREDIT = outlier_range(df1,'AMT_CREDIT')
max_value1_AMT_CREDIT

**Outlier identification of `AMT_ANNUITY ` with On-Time Payments**

In [None]:
max_value0_AMT_ANNUITY = outlier_range(df0,'AMT_ANNUITY')
max_value0_AMT_ANNUITY

**Outlier identification of `AMT_CREDIT ` with On-Time Payments**

In [None]:
max_value0_AMT_CREDIT = outlier_range(df0,'AMT_CREDIT')
max_value0_AMT_CREDIT

Plotting scatterplot for comparison with outliers removed

In [None]:
plt.figure(figsize = [20,8])

plt.subplot(1,2,1)
plt.title('Payment difficulties')
sns.scatterplot(x = df1[df1['AMT_ANNUITY'] < max_value1_AMT_ANNUITY].AMT_ANNUITY, 
                y = df1[df1['AMT_CREDIT'] < max_value1_AMT_CREDIT].AMT_CREDIT, data = df1)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.subplot(1,2,2)
plt.title('On-Time Payments')
sns.scatterplot(x = df0[df0['AMT_ANNUITY'] < max_value0_AMT_ANNUITY].AMT_ANNUITY, 
                y = df0[df0['AMT_CREDIT'] < max_value0_AMT_CREDIT].AMT_CREDIT, data = df0)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.tight_layout(pad = 4)
plt.show()

**Observations**
- `AMT_ANNUITY` and `AMT_CREDIT` have strong positive correlation. This means that as Annuity Amount increases, so does Credit Amount

### Analysis of `DAYS_EMPLOYED` V/S `AMT_INCOME_TOTAL`

**Outlier identification of `DAYS_EMPLOYED` with Payment difficulties**

In [None]:
max_value1_DAYS_EMPLOYED = outlier_range(df1,'DAYS_EMPLOYED')
max_value1_DAYS_EMPLOYED

**Outlier identification of `AMT_INCOME_TOTAL ` with Payment difficulties**

In [None]:
max_value1_AMT_INCOME_TOTAL = outlier_range(df1,'AMT_INCOME_TOTAL')
max_value1_AMT_INCOME_TOTAL

**Outlier identification of `DAYS_EMPLOYED ` with On-Time Payments**

In [None]:
max_value0_DAYS_EMPLOYED = outlier_range(df0,'DAYS_EMPLOYED')
max_value0_DAYS_EMPLOYED

**Outlier identification of `AMT_INCOME_TOTAL ` with On-Time Payments**

In [None]:
max_value0_AMT_INCOME_TOTAL = outlier_range(df0,'AMT_INCOME_TOTAL')
max_value0_AMT_INCOME_TOTAL

Plotting scatterplot for comparison with outliers removed

In [None]:
plt.figure(figsize = [20,8])

plt.subplot(1,2,1)
plt.title('Payment difficulties')
sns.scatterplot(x = df1[df1['DAYS_EMPLOYED'] < max_value1_DAYS_EMPLOYED].DAYS_EMPLOYED, 
                y = df1[df1['AMT_INCOME_TOTAL'] < max_value1_AMT_INCOME_TOTAL].AMT_INCOME_TOTAL, data = df1)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.subplot(1,2,2)
plt.title('On-Time Payments')
sns.scatterplot(x = df0[df0['DAYS_EMPLOYED'] < max_value0_DAYS_EMPLOYED].DAYS_EMPLOYED, 
                y = df0[df0['AMT_INCOME_TOTAL'] < max_value0_AMT_INCOME_TOTAL].AMT_INCOME_TOTAL, data = df0)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.tight_layout(pad = 4)
plt.show()

**Observations**
- Clients who are employed for a long time (>7000) days are making their payments on-time but these category of clients do not exist in Payments difficulties group
- Even looking at Payment difficulties group, clients with more than 4000 days of employment are sparse

### Analysis of `AMT_CREDIT` V/S `DAYS_BIRTH`

**Outlier identification of `AMT_CREDIT` with Payment difficulties**

In [None]:
max_value1_AMT_CREDIT = outlier_range(df1,'AMT_CREDIT')
max_value1_AMT_CREDIT

**Outlier identification of `DAYS_BIRTH ` with Payment difficulties**

In [None]:
max_value1_DAYS_BIRTH = outlier_range(df1,'DAYS_BIRTH')
max_value1_DAYS_BIRTH

**Outlier identification of `AMT_CREDIT ` with On-Time Payments**

In [None]:
max_value0_AMT_CREDIT = outlier_range(df0,'AMT_CREDIT')
max_value0_AMT_CREDIT

**Outlier identification of `DAYS_BIRTH ` with On-Time Payments**

In [None]:
max_value0_DAYS_BIRTH = outlier_range(df0,'DAYS_BIRTH')
max_value0_DAYS_BIRTH

Plotting scatterplot for comparison with outliers removed

In [None]:
plt.figure(figsize = [20,8])

plt.subplot(1,2,1)
plt.title('Payment difficulties')
sns.scatterplot(x = df1[df1['AMT_CREDIT'] < max_value1_AMT_CREDIT].AMT_CREDIT, 
                y = df1[df1['DAYS_BIRTH'] < max_value1_DAYS_BIRTH].DAYS_BIRTH, data = df1)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.subplot(1,2,2)
plt.title('On-Time Payments')
sns.scatterplot(x = df0[df0['AMT_CREDIT'] < max_value0_AMT_CREDIT].AMT_CREDIT, 
                y = df0[df0['DAYS_BIRTH'] < max_value0_DAYS_BIRTH].DAYS_BIRTH, data = df0)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.tight_layout(pad = 4)
plt.show()

**Observations**
- There is no observable correlation between Days of Birth and Amount of Credit

### Analysis of `AMT_ANNUITY` V/S `AMT_GOODS_PRICE`

**Outlier identification of `AMT_ANNUITY` with Payment difficulties**

In [None]:
max_value1_AMT_ANNUITY = outlier_range(df1,'AMT_ANNUITY')
max_value1_AMT_ANNUITY

**Outlier identification of `AMT_GOODS_PRICE` with Payment difficulties**

In [None]:
max_value1_AMT_GOODS_PRICE = outlier_range(df1,'AMT_GOODS_PRICE')
max_value1_AMT_GOODS_PRICE

**Outlier identification of `AMT_ANNUITY ` with On-Time Payments**

In [None]:
max_value0_AMT_ANNUITY = outlier_range(df0,'AMT_ANNUITY')
max_value0_AMT_ANNUITY

**Outlier identification of `AMT_GOODS_PRICE` with On-Time Payments**

In [None]:
max_value0_AMT_GOODS_PRICE = outlier_range(df0,'AMT_GOODS_PRICE')
max_value0_AMT_GOODS_PRICE

Plotting scatterplot for comparison with outliers removed

In [None]:
plt.figure(figsize = [20,8])

plt.subplot(1,2,1)
plt.title('Payment difficulties')
sns.scatterplot(x = df1[df1['AMT_ANNUITY'] < max_value1_AMT_ANNUITY].AMT_ANNUITY, 
                y = df1[df1['AMT_GOODS_PRICE'] < max_value1_AMT_GOODS_PRICE].AMT_GOODS_PRICE, data = df1)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.subplot(1,2,2)
plt.title('On-Time Payments')
sns.scatterplot(x = df0[df0['AMT_ANNUITY'] < max_value0_AMT_ANNUITY].AMT_ANNUITY, 
                y = df0[df0['AMT_GOODS_PRICE'] < max_value0_AMT_GOODS_PRICE].AMT_GOODS_PRICE, data = df0)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.tight_layout(pad = 4)
plt.show()

**Observations**
- `AMT_ANNUITY` and `AMT_GOODS_PRICE` have strong positive correlation. This means that as Annuity increases, so does Goods Price

## Continuous V/S Categorical variables

### Analysis of `NAME_EDUCATION_TYPE` V/S `AMT_CREDIT` V/S `CODE_GENDER`

**Outlier identification of `AMT_CREDIT ` with Payment difficulties**

In [None]:
max_value1_AMT_CREDIT = outlier_range(df1,'AMT_CREDIT')
max_value1_AMT_CREDIT

**Outlier identification of `AMT_CREDIT ` with On-Time Payments**

In [None]:
max_value0_AMT_CREDIT = outlier_range(df0,'AMT_CREDIT')
max_value0_AMT_CREDIT

**Client with Payment difficulties**

In [None]:
df1.groupby(by = ['NAME_EDUCATION_TYPE','CODE_GENDER']).AMT_CREDIT.describe().head()

**Client with On-Time Payments**

In [None]:
df0.groupby(by = ['NAME_EDUCATION_TYPE','CODE_GENDER']).AMT_CREDIT.describe().head()

In [None]:
fn_bi_boxplot('NAME_EDUCATION_TYPE','AMT_CREDIT',max_value1_AMT_CREDIT,max_value0_AMT_CREDIT,'CODE_GENDER')

**Observations**
- Clients with `Academic Degree` have a wide range of credits for On-Time Payments whereas the range is much lower for ones with Payment difficulties
- Looking at summary statistics, Clients with `Academic Degree` and Payment difficulties take mean and median credit at a much higher range than On-Time Payment clients
- `Male` clients with `Academic Degree` always pay the loan on-time

### Analysis of `NAME_FAMILY_STATUS` V/S `AMT_INCOME_TOTAL` V/S `CODE_GENDER`

**Outlier identification of `AMT_INCOME_TOTAL` with Payment difficulties**

In [None]:
max_value1_AMT_INCOME_TOTAL = outlier_range(df1,'AMT_INCOME_TOTAL')
max_value1_AMT_INCOME_TOTAL

**Outlier identification of `AMT_INCOME_TOTAL` with On-Time Payments**

In [None]:
max_value0_AMT_INCOME_TOTAL = outlier_range(df0,'AMT_INCOME_TOTAL')
max_value0_AMT_INCOME_TOTAL

**Client with Payment difficulties**

In [None]:
df1.groupby(by = ['NAME_FAMILY_STATUS','CODE_GENDER']).AMT_INCOME_TOTAL.describe().head()

**Client with On-Time Payments**

In [None]:
df0.groupby(by = ['NAME_FAMILY_STATUS','CODE_GENDER']).AMT_INCOME_TOTAL.describe().head()

In [None]:
fn_bi_boxplot('NAME_FAMILY_STATUS','AMT_INCOME_TOTAL',max_value1_AMT_INCOME_TOTAL,max_value0_AMT_INCOME_TOTAL,'CODE_GENDER')

**Observations**
- `Married` clients have a slightly higher mean/median income with On-Time Payments than Payment difficulties category

### Analysis of `YEARS_BIRTH_CATEGORY` V/S `AMT_INCOME_TOTAL` V/S `NAME_HOUSING_TYPE`

**Outlier identification of `AMT_INCOME_TOTAL` with Payment difficulties**

In [None]:
max_value1_AMT_INCOME_TOTAL = outlier_range(df1,'AMT_INCOME_TOTAL')
max_value1_AMT_INCOME_TOTAL

**Outlier identification of `AMT_INCOME_TOTAL` with On-Time Payments**

In [None]:
max_value0_AMT_INCOME_TOTAL = outlier_range(df0,'AMT_INCOME_TOTAL')
max_value0_AMT_INCOME_TOTAL

**Client with Payment difficulties**

In [None]:
df1.groupby(by = ['YEARS_BIRTH_CATEGORY','NAME_HOUSING_TYPE']).AMT_INCOME_TOTAL.describe().head()

**Client with On-Time Payments**

In [None]:
df0.groupby(by = ['YEARS_BIRTH_CATEGORY','NAME_HOUSING_TYPE']).AMT_INCOME_TOTAL.describe().head()

In [None]:
fn_bi_boxplot('YEARS_BIRTH_CATEGORY','AMT_INCOME_TOTAL',max_value1_AMT_INCOME_TOTAL,max_value0_AMT_INCOME_TOTAL,'NAME_HOUSING_TYPE')

**Observations**
- Clients with age `60-70` and living in `Co-op apartment` have very high income range in Payment difficulties category than On-Time Payments
- Clients with age `20-30` and living in `Office apartment` have very higher income median in On-Time Payments compared to Payment difficulties category

### Analysis of `FLAG_OWN_CAR` V/S `AMT_ANNUITY` V/S `CODE_GENDER`

**Outlier identification of `AMT_ANNUITY` with Payment difficulties**

In [None]:
max_value1_AMT_ANNUITY = outlier_range(df1,'AMT_ANNUITY')
max_value1_AMT_ANNUITY

**Outlier identification of `AMT_ANNUITY` with On-Time Payments**

In [None]:
max_value0_AMT_ANNUITY = outlier_range(df0,'AMT_ANNUITY')
max_value0_AMT_ANNUITY

**Client with Payment difficulties**

In [None]:
df1.groupby(by = ['FLAG_OWN_CAR','CODE_GENDER']).AMT_ANNUITY.describe().head()

**Client with On-Time Payments**

In [None]:
df0.groupby(by = ['FLAG_OWN_CAR','CODE_GENDER']).AMT_ANNUITY.describe().head()

In [None]:
fn_bi_boxplot('FLAG_OWN_CAR','AMT_ANNUITY',max_value1_AMT_ANNUITY,max_value0_AMT_ANNUITY,'CODE_GENDER')

**Observations**
- We don't find any significant observations

### Analysis of `NAME_INCOME_TYPE` V/S `AMT_GOODS_PRICE` V/S `CODE_GENDER`

**Outlier identification of `AMT_GOODS_PRICE` with Payment difficulties**

In [None]:
max_value1_AMT_GOODS_PRICE = outlier_range(df1,'AMT_GOODS_PRICE')
max_value1_AMT_GOODS_PRICE

**Outlier identification of `AMT_GOODS_PRICE` with On-Time Payments**

In [None]:
max_value0_AMT_GOODS_PRICE = outlier_range(df0,'AMT_GOODS_PRICE')
max_value0_AMT_GOODS_PRICE

**Client with Payment difficulties**

In [None]:
df1.groupby(by = ['NAME_INCOME_TYPE','CODE_GENDER']).AMT_GOODS_PRICE.describe().head()

**Client with On-Time Payments**

In [None]:
df0.groupby(by = ['NAME_INCOME_TYPE','CODE_GENDER']).AMT_GOODS_PRICE.describe().head()

In [None]:
fn_bi_boxplot('NAME_INCOME_TYPE','AMT_GOODS_PRICE',max_value1_AMT_GOODS_PRICE,max_value0_AMT_GOODS_PRICE,'CODE_GENDER')

**Observations**
- Clients who are `Unemployed` and `Male` have a very high price of goods in On-Time Payments than Payment difficulties
- Clients who are `Student` and either `Male` OR `Female` do their payments On-Time. They are completely missing from Payment difficulties category. `Student` seems to be an attractive category to give loans to.
- Clients who are `Businessman` and either `Male` OR `Female` do their payments On-Time. They are completely missing from Payment difficulties category. `Businessman` seems to be an attractive category to give loans to.

### Analysis of `NAME_INCOME_TYPE` V/S `AMT_INCOME_TOTAL` V/S `CODE_GENDER`

**Outlier identification of `AMT_INCOME_TOTAL` with Payment difficulties**

In [None]:
max_value1_AMT_INCOME_TOTAL = outlier_range(df1,'AMT_INCOME_TOTAL')
max_value1_AMT_INCOME_TOTAL

**Outlier identification of `AMT_INCOME_TOTAL` with On-Time Payments**

In [None]:
max_value0_AMT_INCOME_TOTAL = outlier_range(df0,'AMT_INCOME_TOTAL')
max_value0_AMT_INCOME_TOTAL

**Client with Payment difficulties**

In [None]:
df1.groupby(by = ['NAME_INCOME_TYPE','CODE_GENDER']).AMT_INCOME_TOTAL.describe().head()

**Client with On-Time Payments**

In [None]:
df0.groupby(by = ['NAME_INCOME_TYPE','CODE_GENDER']).AMT_INCOME_TOTAL.describe().head()

In [None]:
fn_bi_boxplot('NAME_INCOME_TYPE','AMT_INCOME_TOTAL',max_value1_AMT_INCOME_TOTAL,max_value0_AMT_INCOME_TOTAL,'CODE_GENDER')

**Observations**
- Clients who are `Unemployed` and `Male` have a very high income in On-Time Payments than Payment difficulties
- Clients who are `Student` and either `Male` OR `Female` do their payments On-Time. They are completely missing from Payment difficulties category. `Student` seems to be an attractive category to give loans to.
- Clients who are `Businessman` and either `Male` OR `Female` do their payments On-Time. They are completely missing from Payment difficulties category. `Businessman` seems to be an attractive category to give loans to.
- Clients who are in `Maternity Leave` and `Female` have a very high income in On-Time Payments than Payment difficulties

### Analysis of `OCCUPATION_TYPE` V/S `AMT_INCOME_TOTAL` V/S `CODE_GENDER`

**Outlier identification of `AMT_INCOME_TOTAL` with Payment difficulties**

In [None]:
max_value1_AMT_INCOME_TOTAL = outlier_range(df1,'AMT_INCOME_TOTAL')
max_value1_AMT_INCOME_TOTAL

**Outlier identification of `AMT_INCOME_TOTAL` with On-Time Payments**

In [None]:
max_value0_AMT_INCOME_TOTAL = outlier_range(df0,'AMT_INCOME_TOTAL')
max_value0_AMT_INCOME_TOTAL

**Client with Payment difficulties**

In [None]:
df1.groupby(by = ['OCCUPATION_TYPE','CODE_GENDER']).AMT_INCOME_TOTAL.describe().head()

**Client with On-Time Payments**

In [None]:
df0.groupby(by = ['OCCUPATION_TYPE','CODE_GENDER']).AMT_INCOME_TOTAL.describe().head()

In [None]:
fn_bi_boxplot('OCCUPATION_TYPE','AMT_INCOME_TOTAL',max_value1_AMT_INCOME_TOTAL,max_value0_AMT_INCOME_TOTAL,'CODE_GENDER')

**Observations**
- Clients who are `Waiters/barment staff` and `female` have less median income in On-Time Payments than Payment difficulties
- Clients who are `Cleaning staff` and `female` have more median income in On-Time Payments than Payment difficulties
- Clients who are `HR Staff` and `Male` have more median income in Payment difficulties than On-Time Payments
- Clients who are `Managers` and `Male` have more median income in On-Time Payments than Payment difficulties

## Categorical V/S Categorical variables

### Analysis of `NAME_INCOME_TYPE` V/S `CODE_GENDER`

In [None]:
fn_bi_countplot('NAME_INCOME_TYPE','CODE_GENDER')

**Observations**
- Clients who are `Working` and `Male` have more Payment difficulties compared to On-Time Payments
- Clients who are `Pensioner` and `Female` have more Payment difficulties compared to On-Time Payments
- Clients who are `Businessman` and `Students` do their payments On-Time though their record count is low

### Analysis of `NAME_EDUCATION_TYPE` V/S `CODE_GENDER`

In [None]:
fn_bi_countplot('NAME_EDUCATION_TYPE','CODE_GENDER')

**Observations**
- Clients who have `Secondary/Secondary special` education and `Male` have more Payment difficulties compared to On-Time Payments
- Clients who have `Higher education` and `Female` have more On-Time Payments compared to Payment difficulties

### Analysis of `NAME_FAMILY_STATUS` V/S `OCCUPATION_TYPE`

In [None]:
fn_bi_countplot('OCCUPATION_TYPE','NAME_FAMILY_STATUS')

**Observations**
- Clients who are `Single/not married`, `Married` & `Civil marriage` and are `Waiters/barmen staff` have more Payment difficulties compared to On-Time Payments
- Clients who are `Single/not married` & `Married` and are `Laborers` have more Payment difficulties compared to On-Time Payments
- Clients who are `Married` and are `Drivers` have more Payment difficulties compared to On-Time Payments
- `Married` and `Accountants` have better On-Time Payments

### Analysis of `ORGANIZATION_TYPE` V/S `FLAG_OWN_CAR`

In [None]:
fn_bi_countplot('ORGANIZATION_TYPE','FLAG_OWN_CAR')

**Observations**
- Clients who are `Self-employed`and don't own `Car` have more Payment difficulties compared to On-Time Payments

### Analysis of `OCCUPATION_TYPE` V/S `NAME_CONTRACT_TYPE`

In [None]:
fn_bi_countplot('OCCUPATION_TYPE','NAME_CONTRACT_TYPE')

**Observations**
- Clients who are `Sales staff`,`Laborers`,`Drivers` and have `Cash loans` have more Payment difficulties compared to On-Time Payments

***

# Previous Application Data

***

# Importing the dataset

In [None]:
df_prev = pd.read_csv("../input/credit-eda-case-study/previous_application.csv")

In [None]:
# Checking few records from the dataframe
df_prev.head()

# Check structure of the data

In [None]:
df_prev.info(verbose = True,null_counts = True)

We do not see any columns with Nullable values

In [None]:
df_prev.shape

There are ~1.67 million rows and 37 columns

## Get statistical summary for numerical variables

In [None]:
df_prev.describe()

# Analyzing categorical variables

In [None]:
df_prev.select_dtypes(include = "object").columns

In [None]:
# Checking number of categorical variables
len(df_prev.select_dtypes(include = "object").columns)

There are 16 `categorical` variables

# Analyzing numerical variables

In [None]:
df_prev.select_dtypes(include=["int64","float64"]).columns

In [None]:
# Checking number of categorical variables
len(df_prev.select_dtypes(include=["int64","float64"]).columns)

There are 21 `numerical` variables

In [None]:
df_prev.select_dtypes(include=["int64","float64"]).head()

# Dealing with incorrect data types

Check if we have any column with incorrect data type

In [None]:
df_prev.dtypes

In [None]:
df_prev.head()

Looking at the data and their corresponding data types, we can conclude that **No Data Type changes are required**

# Dealing with missing values

Check if we have any null values in the dataset

In [None]:
df_prev.isnull().values.any()

Get Total number of null values in the dataset

In [None]:
df_prev.isnull().values.sum()

Getting the list of column(s) which have null values

In [None]:
df_prev.columns[df_prev.isnull().any()]

In [None]:
len(df_prev.columns[df_prev.isnull().any()])

There are totally `16` columns having one or more NULL values in the data

## Computing count and percentage of missing values

In [None]:
null_count = df_prev.isnull().sum()
null_percentage = round((df_prev.isnull().sum()/df_prev.shape[0])*100, 2)

In [None]:
null_df = pd.DataFrame({'column_name' : df_prev.columns,'null_count' : null_count,'null_percentage': null_percentage})
null_df.reset_index(drop = True, inplace = True)

In [None]:
null_df.sort_values(by = 'null_percentage', ascending = False)

## Removing columns with NULL values > 40%

Getting list of columns with NULL values > 40% into a list. We will be removing these columns from the dataframe as there are too many missing values.

In [None]:
columns_to_be_deleted = null_df[null_df['null_percentage'] > 40].column_name.to_list()

In [None]:
columns_to_be_deleted

In [None]:
len(columns_to_be_deleted)

There are totally `11` columns to be removed. Deleting them from previous application dataframe **`df_prev`**

In [None]:
df_prev.drop(columns = columns_to_be_deleted, inplace = True)

Checking column count post removal. Only `26` columns should be left

In [None]:
df_prev.shape

## Checking columns with NULL values < 40%

Creating dataframe `null_df_under40` with missing column percentages under 40%

In [None]:
null_df_under40 = null_df[null_df['null_percentage'] < 40]

In [None]:
null_df_under40.sort_values(by = 'null_percentage', ascending = False)

### Analysis of `AMT_GOODS_PRICE` column

-  nullable values = 23.08%

In [None]:
df_prev.AMT_GOODS_PRICE.value_counts().head()

In [None]:
sns.boxplot(df_prev.AMT_GOODS_PRICE)
plt.show()

Getting percentile values for `AMT_GOODS_PRICE`

In [None]:
df_prev.AMT_GOODS_PRICE.quantile(q = [0.25,0.5,0.75,1])

Most recurring value in `AMT_GOODS_PRICE`

In [None]:
df_prev.AMT_GOODS_PRICE.mode()[0]

Checking the average value of `AMT_GOODS_PRICE`

In [None]:
df_prev.AMT_GOODS_PRICE.mean()

**Observations**
-  Looking at the boxplot, median is 112320.00
-  Most recurring value is 45000.0
-  Mean value is 227847.27928334344
-  Since missing percentage value is higher (23.08%), it would be better to leave the data as it is and not perform imputations

### Analysis of `AMT_ANNUITY` column

-  nullable values = 22.29%

In [None]:
df_prev.AMT_ANNUITY.value_counts().head()

In [None]:
sns.boxplot(df_prev.AMT_ANNUITY)
plt.show()

Getting percentile values for `AMT_ANNUITY`

In [None]:
df_prev.AMT_ANNUITY.quantile(q = [0.25,0.5,0.75,1])

Most recurring value in `AMT_ANNUITY`

In [None]:
df_prev.AMT_ANNUITY.mode()[0]

Checking the average value of `AMT_ANNUITY`

In [None]:
df_prev.AMT_ANNUITY.mean()

**Observations**
-  Looking at the boxplot, median is 11250.00
-  Most recurring value is 2250.0
-  Mean value is 15955.120659450406
-  Since missing percentage value is higher (22.29%), it would be better to leave the data as it is and not perform imputations

### Analysis of `CNT_PAYMENT` column

-  nullable values = 22.29%

In [None]:
df_prev.CNT_PAYMENT.value_counts().head()

In [None]:
sns.boxplot(df_prev.CNT_PAYMENT)
plt.show()

Getting percentile values for `CNT_PAYMENT`

In [None]:
df_prev.CNT_PAYMENT.quantile(q = [0.25,0.5,0.75,1])

Most recurring value in `CNT_PAYMENT`

In [None]:
df_prev.CNT_PAYMENT.mode()[0]

Checking the average value of `CNT_PAYMENT`

In [None]:
df_prev.CNT_PAYMENT.mean()

**Observations**
-  Looking at the boxplot, median is 12.00
-  Most recurring value is 12.0
-  Mean value is 16.0540815603274
-  Though median & mode are same, since missing percentage value is higher (22.29%), it would be better to leave the data as it is and not perform imputations

## Checking columns with NULL values > 0% and < 1%

Creating dataframe `null_df_under1` with missing column percentages values > 0% and < 1%

In [None]:
null_df_under1 = null_df[(null_df['null_percentage'] > 0) & (null_df['null_percentage'] < 1)]

In [None]:
null_df_under1.sort_values(by = 'null_percentage', ascending = False)

### Analysis of `PRODUCT_COMBINATION` column

-  nullable values = 0.02%

In [None]:
df_prev['PRODUCT_COMBINATION'].value_counts().head()

In [None]:
plt.figure(figsize = (10,5))
sns.countplot(data = df_prev, x = "PRODUCT_COMBINATION")
plt.xticks(rotation = 90)
plt.show()

**Observations**
-  Looking at the plot, `Cash` Product_Combination has the highest number of loan applicants
-  We can go ahead and impute `Cash` in the dataframe

# Dealing with incorrect/unknown data values

### Analysis of `NAME_CASH_LOAN_PURPOSE` column

Checking range of values

In [None]:
df_prev['NAME_CASH_LOAN_PURPOSE'].value_counts(normalize = True).head()

**Observations**
- Though `XAP` and `XNA` don't provide any understanding of Loan purpose, they indicate Not Available and Not Applicable
- Since this makes up 99% of the data, we will leave it as is

### Analysis of `DAYS_DECISION` column

In [None]:
df_prev['DAYS_DECISION'].value_counts()

There are 2922 unique records all of which seem to be having negative values

In [None]:
df_prev['DAYS_DECISION'].unique()

In [None]:
df_prev['DAYS_DECISION'].nunique()

Converting `DAYS_DECISION` to positive days

In [None]:
df_prev['DAYS_DECISION'] = df_prev['DAYS_DECISION'].apply(lambda x: -x if x < 0 else x)

In [None]:
df_prev['DAYS_DECISION'].value_counts()

All Days in `DAYS_DECISION` have positive values

### Analysis of `NAME_PAYMENT_TYPE` column

Checking range of values

In [None]:
df_prev['NAME_PAYMENT_TYPE'].value_counts(normalize = True)

**Observations**
- Though `XNA` doesn't provide any understanding of Payment TYpe, it indicates Not Applicable
- Since this makes up 38% of the data, we will leave it as is

### Analysis of `CODE_REJECT_REASON` column

Checking range of values

In [None]:
df_prev['CODE_REJECT_REASON'].value_counts(normalize = True)

**Observations**
- Though `XAP` doesn't provide any understanding of Payment Type, it indicates Not Available
- Since this makes up 81% of the data, we will leave it as is

### Analysis of `NAME_CLIENT_TYPE` column

Checking range of values

In [None]:
df_prev['NAME_CLIENT_TYPE'].value_counts(normalize=True)

`XNA` value may indicate that the value was not provided by the loan applicant or missed by the loan officer verifying the application

In [None]:
df_prev[df_prev['NAME_CLIENT_TYPE'] == 'XNA'].head()

As data looks valid, we will go ahead and check for an imputation method.
-  `Repeater` applicants make up 74% of applicants
-  And so, we will go ahead and impute `NAME_CLIENT_TYPE` with 'Repeater'

In [None]:
df_prev['NAME_CLIENT_TYPE'] = df_prev['NAME_CLIENT_TYPE'].apply(lambda x: 'Repeater' if x == 'XNA' else x)

Checking if `XNA` is removed

In [None]:
df_prev['NAME_CLIENT_TYPE'].value_counts(normalize=True)

There are other columns with `XNA` but we will leave them as is

# Dealing with outliers for numerical columns

### Analysis of `AMT_ANNUITY` column

In [None]:
df_prev['AMT_ANNUITY'].value_counts().sort_values(ascending = False).head()

In [None]:
(df_prev['AMT_ANNUITY'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()

In [None]:
df_prev.AMT_ANNUITY.quantile(q = [0.25,0.5,0.75,0.99,1])

In [None]:
fn_dist_box(df_prev,'AMT_ANNUITY')

Calculating IQR (Inter Quartile range)

In [None]:
Q1 = df_prev['AMT_ANNUITY'].quantile(0.25)
Q3 = df_prev['AMT_ANNUITY'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [None]:
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))

**Observations**
-  `AMT_ANNUITY` values above 42163.38 are outliers

### Analysis of `AMT_APPLICATION` column

In [None]:
df_prev['AMT_APPLICATION'].value_counts().sort_values(ascending = False).head()

In [None]:
(df_prev['AMT_APPLICATION'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()

In [None]:
df_prev.AMT_APPLICATION.quantile(q = [0.25,0.5,0.75,0.99,1])

In [None]:
fn_dist_box(df_prev,'AMT_APPLICATION')

Calculating IQR (Inter Quartile range)

In [None]:
Q1 = df_prev['AMT_APPLICATION'].quantile(0.25)
Q3 = df_prev['AMT_APPLICATION'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [None]:
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))

**Observations**
-  `AMT_APPLICATION` values above 422820.0 are outliers

### Analysis of `AMT_CREDIT` column

In [None]:
df_prev['AMT_CREDIT'].value_counts().sort_values(ascending = False).head()

In [None]:
(df_prev['AMT_CREDIT'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()

In [None]:
df_prev.AMT_CREDIT.quantile(q = [0.25,0.5,0.75,0.99,1])

In [None]:
fn_dist_box(df_prev,'AMT_CREDIT')

Calculating IQR (Inter Quartile range)

In [None]:
Q1 = df_prev['AMT_CREDIT'].quantile(0.25)
Q3 = df_prev['AMT_CREDIT'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [None]:
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))

**Observations**
-  `AMT_CREDIT` values above 504805.5 are outliers

# Merge datasets `df` and `df_prev` into `df_merge`

In [None]:
df_merge = df.merge(df_prev, left_on='SK_ID_CURR', right_on='SK_ID_CURR', how='inner')

In [None]:
df.shape

In [None]:
df_prev.shape

In [None]:
df_merge.shape

# Check structure of the data

In [None]:
df_merge.info(verbose = True,null_counts = True)

We do not see any columns with Nullable values

In [None]:
df_merge.shape

There are ~1.41 million rows and 106 columns

In [None]:
df_merge[df_merge['SK_ID_CURR'] == 265681][['AMT_CREDIT_x','AMT_CREDIT_y']].head()

In [None]:
df[df['SK_ID_CURR'] == 265681][['AMT_CREDIT']].head()

## Get statistical summary for numerical variables

In [None]:
df_merge.describe()

# Analyzing categorical variables

In [None]:
df_merge.select_dtypes(include = "object").columns

In [None]:
# Checking number of categorical variables
len(df_merge.select_dtypes(include = "object").columns)

There are 29 `categorical` variables

# Analyzing numerical variables

In [None]:
df_merge.select_dtypes(include=["int64","float64"]).columns

In [None]:
# Checking number of categorical variables
len(df_merge.select_dtypes(include=["int64","float64"]).columns)

There are 76 `numerical` variables

# Univariate analysis of categorical variables

### Analysis of `NAME_CONTRACT_STATUS`

In [None]:
df_merge['NAME_CONTRACT_STATUS'].value_counts().sort_values(ascending = False)

In [None]:
fn_uni_countplot_merge('NAME_CONTRACT_STATUS')

In [None]:
fn_uni_piechart_merge('NAME_CONTRACT_STATUS')

**Observations**

- `Approved` loan status is the highest among all loan applications
- `Canceled` loan status is the second highest among all loan applications

### Analysis of `NAME_CLIENT_TYPE`

In [None]:
df_merge['NAME_CLIENT_TYPE'].value_counts().sort_values(ascending = False)

In [None]:
fn_uni_countplot_merge('NAME_CLIENT_TYPE')

In [None]:
fn_uni_piechart_merge('NAME_CLIENT_TYPE')

**Observations**

- `Repeater` client type is the highest among all loan applications
- `New` client type is the second highest among all loan applications

### Analysis of `CHANNEL_TYPE`

In [None]:
df_merge['CHANNEL_TYPE'].value_counts().sort_values(ascending = False)

In [None]:
fn_uni_countplot_merge('CHANNEL_TYPE')

In [None]:
fn_uni_piechart_merge('CHANNEL_TYPE')

**Observations**

- `Country-wide` Channel type is the highest among all loan applications
- `Credit and cash offices` is the second highest Channel Type among all loan applications

### Analysis of `NAME_YIELD_GROUP`

In [None]:
df_merge['NAME_YIELD_GROUP'].value_counts().sort_values(ascending = False)

In [None]:
fn_uni_countplot_merge('NAME_YIELD_GROUP')

In [None]:
fn_uni_piechart_merge('NAME_YIELD_GROUP')

**Observations**

- `XNA` interest rate is the highest among all loan applications
- `middle` and `high` interest rates are the second and third highest among all loan applications

### Analysis of `NAME_GOODS_CATEGORY`

In [None]:
df_merge['NAME_GOODS_CATEGORY'].value_counts().sort_values(ascending = False).head()

In [None]:
fn_uni_countplot_merge('NAME_GOODS_CATEGORY')

In [None]:
fn_uni_piechart_merge('NAME_GOODS_CATEGORY')

**Observations**

- `XNA` goods category is the highest among all loan applications
- `mobile` goods categoryis the second highest among all loan applications

# Univariate analysis of numerical variables

### Analysis of `AMT_APPLICATION`

#### Outlier identification of `AMT_APPLICATION`

Calculating IQR (Inter Quartile range)

In [None]:
Q1 = df_merge['AMT_APPLICATION'].quantile(0.25)
Q3 = df_merge['AMT_APPLICATION'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [None]:
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))

Removing outliers and plotting distplot

In [None]:
plt.figure(figsize = [20,8])
sns.distplot(df_merge[df_merge['AMT_APPLICATION'] <= Max_value].AMT_APPLICATION, hist=True)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.show()

**Observations**

- Most of the loan amount applied by the clients initially seems to be very small as can be seen from the huge spike at the beginning of the distribution

### Analysis of `AMT_ANNUITY_y`

#### Outlier identification of `AMT_ANNUITY_y`

Calculating IQR (Inter Quartile range)

In [None]:
Q1 = df_merge['AMT_ANNUITY_y'].quantile(0.25)
Q3 = df_merge['AMT_ANNUITY_y'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [None]:
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))

Removing outliers and plotting distplot

In [None]:
plt.figure(figsize = [20,8])
sns.distplot(df_merge[df_merge['AMT_ANNUITY_y'] <= Max_value].AMT_ANNUITY_y, hist=True)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.show()

**Observations**

- Most of the previous loan's annuity from the clients is less than 10,000 as the distribution is high here
- As previous loan's annuity increases, the no. of clients decreases

### Analysis of `AMT_CREDIT_y`

#### Outlier identification of `AMT_CREDIT_y`

Calculating IQR (Inter Quartile range)

In [None]:
Q1 = df_merge['AMT_CREDIT_y'].quantile(0.25)
Q3 = df_merge['AMT_CREDIT_y'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [None]:
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))

Removing outliers and plotting distplot

In [None]:
plt.figure(figsize = [20,8])
sns.distplot(df_merge[df_merge['AMT_CREDIT_y'] <= Max_value].AMT_CREDIT_y, hist=True)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.show()

**Observations**

- This distribution very closely resembles that of AMT_APPLICATION. This means that most people received the loan amount that they applied for

### Analysis of `AMT_GOODS_PRICE_y`

#### Outlier identification of `AMT_GOODS_PRICE_y`

Calculating IQR (Inter Quartile range)

In [None]:
Q1 = df_merge['AMT_GOODS_PRICE_y'].quantile(0.25)
Q3 = df_merge['AMT_GOODS_PRICE_y'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [None]:
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))

Removing outliers and plotting distplot

In [None]:
plt.figure(figsize = [20,8])
sns.distplot(df_merge[df_merge['AMT_GOODS_PRICE_y'] <= Max_value].AMT_GOODS_PRICE_y, hist=True)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.show()

**Observations**

- Most of the goods price asked by clients in previous application is less than 100K

# Correlation analysis of numerical variables

### Plotting correlation matrix

In [None]:
corr_df = df_merge[['AMT_ANNUITY_x', 'AMT_APPLICATION','AMT_CREDIT_x', 'AMT_GOODS_PRICE_x',
                    'AMT_ANNUITY_y', 'AMT_CREDIT_y', 'AMT_GOODS_PRICE_y', 'CNT_PAYMENT']].corr()

In [None]:
corr_df.head()

Creating a HeatMap

In [None]:
plt.figure(figsize = (20,8))
sns.heatmap(data = corr_df, annot = True, cmap = "RdYlGn", cbar = True, fmt='.2f')
plt.show()

**Observations**
- `AMT_APPLICATION` has a high correlation with `AMT_ANNUITY_y`,`AMT_CREDIT_y`,`AMT_GOODS_PRICE_y` and decent correlation with `CNT_PAYMENT`
- `AMT_GOODS_PRICE_y` has a high correlation with `AMT_ANNUITY_y`,`AMT_CREDIT_y`,`AMT_APPLICATION` and decent correlation with `CNT_PAYMENT`
- `AMT_CREDIT_y` has a high correlation with `AMT_GOODS_PRICE_y` and decent correlation with `CNT_PAYMENT`
- `AMT_ANNUITY_x` has a high correlation with `AMT_GOODS_PRICE_y`,`AMT_CREDIT_y`
- `AMT_ANNUITY_x` has a high correlation with `AMT_GOODS_PRICE_x`,`AMT_CREDIT_x`
- `AMT_CREDIT_x` has a high correlation with `AMT_GOODS_PRICE_x`

# Bivariate/Multivariate analysis

## Continuous V/S Continuous variables

### Analysis of `AMT_GOODS_PRICE_y` V/S `AMT_CREDIT_y` V/S `NAME_CONTRACT_STATUS`

**Outlier identification of `AMT_GOODS_PRICE_y`**

In [None]:
max_value1_AMT_GOODS_PRICE_y = outlier_range(df_merge,'AMT_GOODS_PRICE_y')
max_value1_AMT_GOODS_PRICE_y

**Outlier identification of `AMT_CREDIT_y`**

In [None]:
max_value1_AMT_CREDIT_y = outlier_range(df_merge,'AMT_CREDIT_y')
max_value1_AMT_CREDIT_y

Plotting scatterplot

In [None]:
plt.figure(figsize = [20,8])

sns.scatterplot(x = df_merge[df_merge['AMT_GOODS_PRICE_y'] < max_value1_AMT_GOODS_PRICE_y].AMT_GOODS_PRICE_y, 
                y = df_merge[df_merge['AMT_CREDIT_y'] < max_value1_AMT_CREDIT_y].AMT_CREDIT_y,
                data = df_merge,hue = 'NAME_CONTRACT_STATUS')
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.tight_layout(pad = 4)
plt.show()

**Observations**
- At lower levels of previous application's Goods price < 200K and Credit > 300k, have a chance of getting refused. However, this is a weak correlation as we have less data points to support this

### Analysis of `AMT_ANNUITY_y` V/S `AMT_CREDIT_y` V/S `NAME_CONTRACT_STATUS`

**Outlier identification of `AMT_ANNUITY_y`**

In [None]:
max_value1_AMT_ANNUITY_y = outlier_range(df_merge,'AMT_ANNUITY_y')
max_value1_AMT_ANNUITY_y

**Outlier identification of `AMT_CREDIT_y`**

In [None]:
max_value1_AMT_CREDIT_y = outlier_range(df_merge,'AMT_CREDIT_y')
max_value1_AMT_CREDIT_y

Plotting scatterplot

In [None]:
plt.figure(figsize = [20,8])

sns.scatterplot(x = df_merge[df_merge['AMT_ANNUITY_y'] < max_value1_AMT_ANNUITY_y].AMT_ANNUITY_y, 
                y = df_merge[df_merge['AMT_CREDIT_y'] < max_value1_AMT_CREDIT_y].AMT_CREDIT_y,
                data = df_merge,hue = 'NAME_CONTRACT_STATUS')
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.tight_layout(pad = 4)
plt.show()

**Observations**
- There are lots of refusal observations with Annuity amount < 10000 and Credit amount > ~250K. This might be because higher credit amount should also require higher Annuity from Client to pay it

### Analysis of `AMT_APPLICATION` V/S `AMT_GOODS_PRICE_y` V/S `NAME_CONTRACT_STATUS`

**Outlier identification of `AMT_APPLICATION`**

In [None]:
max_value1_AMT_APPLICATION = outlier_range(df_merge,'AMT_APPLICATION')
max_value1_AMT_APPLICATION

**Outlier identification of `AMT_GOODS_PRICE_y`**

In [None]:
max_value1_AMT_GOODS_PRICE_y = outlier_range(df_merge,'AMT_GOODS_PRICE_y')
max_value1_AMT_GOODS_PRICE_y

Plotting scatterplot

In [None]:
plt.figure(figsize = [20,8])

sns.scatterplot(x = df_merge[df_merge['AMT_APPLICATION'] < max_value1_AMT_APPLICATION].AMT_APPLICATION, 
                y = df_merge[df_merge['AMT_GOODS_PRICE_y'] < max_value1_AMT_GOODS_PRICE_y].AMT_GOODS_PRICE_y,
                data = df_merge,hue = 'NAME_CONTRACT_STATUS')
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.tight_layout(pad = 4)
plt.show()

**Observations**
- Application amount has strong positive correlation with Goods price

### Analysis of `AMT_APPLICATION` V/S `AMT_CREDIT_y` V/S `NAME_CONTRACT_STATUS`

**Outlier identification of `AMT_APPLICATION`**

In [None]:
max_value1_AMT_APPLICATION = outlier_range(df_merge,'AMT_APPLICATION')
max_value1_AMT_APPLICATION

**Outlier identification of `AMT_CREDIT_y`**

In [None]:
max_value1_AMT_CREDIT_y= outlier_range(df_merge,'AMT_CREDIT_y')
max_value1_AMT_CREDIT_y

Plotting scatterplot

In [None]:
plt.figure(figsize = [20,8])

sns.scatterplot(x = df_merge[df_merge['AMT_APPLICATION'] < max_value1_AMT_APPLICATION].AMT_APPLICATION, 
                y = df_merge[df_merge['AMT_CREDIT_y'] < max_value1_AMT_CREDIT_y].AMT_CREDIT_y,
                data = df_merge,hue = 'NAME_CONTRACT_STATUS')
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.tight_layout(pad = 4)
plt.show()

**Observations**
- Application amount has strong positive correlation with Credit amount

## Continuous V/S Categorical variables

### Analysis of `NAME_CONTRACT_STATUS` V/S `AMT_CREDIT_y` V/S `CODE_GENDER`

**Outlier identification of `AMT_CREDIT_y`**

In [None]:
max_value_AMT_CREDIT_y = outlier_range(df_merge,'AMT_CREDIT_y')
max_value_AMT_CREDIT_y

**Client with Payment difficulties**

In [None]:
df_merge.groupby(by = ['NAME_CONTRACT_STATUS','CODE_GENDER']).AMT_CREDIT_y.describe().head()

In [None]:
fn_bi_boxplot_merge('NAME_CONTRACT_STATUS','AMT_CREDIT_y',max_value_AMT_CREDIT_y,'CODE_GENDER')

**Observations**
- Clients who are `Refused` and `Female` apply for higher median credit amount than `Male`

### Analysis of `NAME_CONTRACT_STATUS` V/S `AMT_ANNUITY_y` V/S `CODE_GENDER`

**Outlier identification of `AMT_ANNUITY_y`**

In [None]:
max_value_AMT_ANNUITY_y= outlier_range(df_merge,'AMT_ANNUITY_y')
max_value_AMT_ANNUITY_y

**Client with Payment difficulties**

In [None]:
df_merge.groupby(by = ['NAME_CONTRACT_STATUS','CODE_GENDER']).AMT_ANNUITY_y.describe().head()

In [None]:
fn_bi_boxplot_merge('NAME_CONTRACT_STATUS','AMT_ANNUITY_y',max_value_AMT_ANNUITY_y,'CODE_GENDER')

**Observations**
- Clients who got `Cancelled` and `Male` paid higher median Annuity than `Female`
- Clients who got `Refused` and `Female` paid higher median Annuity than `Male`

### Analysis of `NAME_CLIENT_TYPE` V/S `AMT_GOODS_PRICE_y` V/S `NAME_CONTRACT_STATUS`

**Outlier identification of `AMT_GOODS_PRICE_y`**

In [None]:
max_value_AMT_GOODS_PRICE_y= outlier_range(df_merge,'AMT_GOODS_PRICE_y')
max_value_AMT_GOODS_PRICE_y

**Client with Payment difficulties**

In [None]:
df_merge.groupby(by = ['NAME_CLIENT_TYPE','NAME_CONTRACT_STATUS']).AMT_GOODS_PRICE_y.describe().head()

In [None]:
fn_bi_boxplot_merge('NAME_CONTRACT_STATUS','AMT_GOODS_PRICE_y',max_value_AMT_GOODS_PRICE_y,'NAME_CLIENT_TYPE')

**Observations**
- Clients who are `New` and `Canceled`have less median goods price compared to `Repeater` and `Refreshed`
- Clients who are `Approved` and `New` have less median goods price compared to  `Repeater` and `Refreshed`

### Analysis of `NAME_CONTRACT_STATUS` V/S `AMT_CREDIT_y` V/S `NAME_PORTFOLIO`

**Outlier identification of `AMT_CREDIT_y`**

In [None]:
max_value_AMT_CREDIT_y= outlier_range(df_merge,'AMT_CREDIT_y')
max_value_AMT_CREDIT_y

**Client with Payment difficulties**

In [None]:
df_merge.groupby(by = ['NAME_PORTFOLIO','NAME_CONTRACT_STATUS']).AMT_CREDIT_y.describe().head()

In [None]:
fn_bi_boxplot_merge('NAME_CONTRACT_STATUS','AMT_CREDIT_y',max_value_AMT_CREDIT_y,'NAME_PORTFOLIO')

**Observations**
- Clients who have `Unused offer` receive more median credit in `POS` portfolio
- Clients who are `Refused` receive more median credit in `Cash` portfolio
- Clients who are `Approved` receive more median credit in `Cars` portfolio

## Categorical V/S Categorical variables

### Analysis of `YEARS_BIRTH_CATEGORY` V/S `NAME_CONTRACT_STATUS`

In [None]:
fn_bi_countplot_merge('YEARS_BIRTH_CATEGORY','NAME_CONTRACT_STATUS')

**Observations**
- Clients who are in the age range 30-40 get most approval followed by clients in 40-50 age range
- Clients who are in the age range 60-70 receive least refusals followed by 20-30 age range

### Analysis of `NAME_FAMILY_STATUS` V/S `NAME_CONTRACT_STATUS`

In [None]:
fn_bi_countplot_merge('NAME_FAMILY_STATUS','NAME_CONTRACT_STATUS')

**Observations**
- Clients who are `Married` receive the most approvals

### Analysis of `NAME_EDUCATION_TYPE` V/S `NAME_CONTRACT_STATUS`

In [None]:
fn_bi_countplot_merge('NAME_EDUCATION_TYPE','NAME_CONTRACT_STATUS')

**Observations**
- Clients who have `Secondary/secondary special` receive the most approvals

### Analysis of `NAME_CLIENT_TYPE` V/S `NAME_CONTRACT_STATUS`

In [None]:
fn_bi_countplot_merge('NAME_CLIENT_TYPE','NAME_CONTRACT_STATUS')

**Observations**
- Clients who are `Repeaters` receive the most approvals followed by `New`

***

**Conclusion: Client categories to be targeted for providing loan**
- Clients who are employed for more than 19 years
- Clients in the age range 30-40 and 40-50
- Clients who are Married
- Male clients with Academic degree
- Students and Businessman
- Repeater clients