This notebook is being created to understand the BPI dataset, we are to understand the processes of loan application and how offers are accepted and declined and other cases related to that.

Analyse:

We will be discussing about the throughput times spent in every process from application submitted till cancellation of offer/refusal of the offer. The other aspect which will be discussed here will be the factors which causes delay in loan process and the frequency of incompleteness of loan applications.

Our target is to :

predict the time from application submitted to offer sent.

-----------------------------------------------------------------

Most of our hypothesis and visualizations will be intended to explore our target. 

-----------------------------------------------------------------

Column descriptions

Action : 
concept:name :: events happening for each of the case.
Event orgin : defines three general state changes for each event. general formation for concept:name.
Loan Goal : Reason for Loan Application.
Application Type: Whether the Application is New Credit or Limit Raise.
Credit Score: The Credit Score for the particular applicant. Credit score of some person represents how trustworthy that person is in terms of returning that specific bill or loan.
Requested Amount: How much amount has the Applicant Requested for loan.
firstwithdrawalamount : first transaction of loan(money) made by the customer

In [None]:
import numpy as np 
import pandas as pd
import pandas_profiling
import warnings
warnings.filterwarnings('ignore')
import datetime
from datetime import date

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style("whitegrid")

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
import xgboost as xg
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

In [None]:
bpi = pd.read_csv('../input/bpi-process-dataset/final_bpi.csv',parse_dates=['Timestamp'])
bpi.rename(columns = {'case:RequestedAmount':'RequestedAmount'}, inplace = True)

In [None]:
bpi

In [None]:
bpi.describe(include='all')

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(10, 8)
sns.heatmap(bpi.corr(),annot=True,cmap='Blues', fmt='g')

analyzing the factors which delay in loan process, frequency of incompletness in loan application and some other factors regarding time taken by other attributes and customer behaviour.

In [None]:
apptime=bpi.groupby('concept:name')['Timestamp'].count().sort_values(ascending=False)
plt.figure(figsize=(16,8))
ax = sns.barplot(x=apptime.index, y=apptime.values)
ax.set(ylabel="Timetamps", xlabel = "application event")
ax.set_title("Number of Timestamps per application event")
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.show()
# it seems that event related to workflow are more likely to take significant amount of time since there is a 
# distinctive number of difference of timestamps belonging to Worklow and offer etc.
# w_validate_application event has taken the most time and w call after offers, this makes sense
# because validating the application has many steps and making the calls to the customer as part of worklow is a 
# time consuming task because once you make a call you have to wait a certain degree of time for their response to 
# come

# here we can notice our other question about incompleteness in loan application process like W_call incomplete
# files have frequency of 168529, the most frequenctly occuring event in the loan process at the third number.
# and A_incomplete has frequency of 23055 events out of 1202267 events.

In [None]:
bpi.groupby('EventOrigin').Timestamp.agg(['count']).sort_values(by=['count'],ascending=False)
# it is clear that work flow are more likely to take significant amount of time, they might be the cause of delay
# in the loan process as well
# event origin is the genernalized categorization of concept name which is more detailed categorized in 
# defining these three event origins. 

In [None]:
bpi.Selected.value_counts()

In [None]:
loanam=bpi.groupby('case:LoanGoal').RequestedAmount.sum().sort_values(ascending=False)
plt.figure(figsize=(16,8))
ax = sns.barplot(x=loanam.index, y=loanam.values)
ax.set(ylabel="Requested Amount", xlabel = "Loan goal")
ax.set_title("Requested loan amount for every loan goal")
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.show()
# we were paritally right about our hypothesis, these are the same top three loan goals which have 
# taken most of the time. 
# large sums of money was request for these three loan goals, this also means that large requestedamount takes
# long time for loan application to process hence delaying the loan process for other loan goals as well.

In [None]:
# lets analyze if accepted offers take more time or unaccepted offers.
bpi.groupby('Accepted').Timestamp.agg(['count']).sort_values(by='count',ascending=False)
# from the results its quit clear that time taken by accepted offers was greater than the time taken by the 
# unaccepted offers, it implies that banks give priority to those application which are more likely to be accepted
# giving their customers more call offers and then waiting for their response. 

In [None]:
# now we will discuss whether the acceptance rate had anything to do with credit score.
bpi.groupby('Accepted').CreditScore.agg(['sum']).sort_values(by='sum',ascending=False)
# it is as clear as day that people with greater creditscore were most likely to get their loan application
# accepted

In [None]:
# from the description of the dataset above that application have the largest number of events, we will try find
# out why.
larg=bpi[bpi['case:concept:name']=='Application_1219772874']
larg.head(50)

In [None]:
summ=larg.groupby('concept:name').sum().sort_values(by='RequestedAmount',ascending=False)
summ
# this table tell many things about the targetted application
# the amount requested by the applicant when the calls were made to the customer for the incomplete files. 
# offered amount at the offer creation was far less than the requested amounts. the customer must have decided to
# not cooperate any longer.

In [None]:
larg['concept:name'].value_counts()
# one of the reason of large event occurence must be the worflow call incomplete files, that means that loan
# application files submitted by this client were mostly incomplete so bank had to call him for incomplete files
# 100 times.
# the resulting observation is that client who submit incomplete application files will have to face more trials
# resulting in delay in their loan process and others as well.

the above section end here, now we will analysing the thoughput time from application submitted to application
sent and we will also produce our label feature "days" (from application submitted to offer sent)

In [None]:
# first we will try analyse the 
a_submitted = bpi[bpi['concept:name'] == 'A_Submitted']
offer_sent = bpi[bpi['concept:name'] == 'O_Sent (mail and online)']
offer_sent

In [None]:
offer_submit=a_submitted['concept:name'].agg(['count'])
offer_submit
# number of application submitted

In [None]:
offer_cent=offer_sent['concept:name'].agg(['count'])
offer_cent
# number of offers sent
# number of offers are as twice as the number of application submitted, this is quit understandable because banks
# can send multiple offers to their potential applicant to make them their customer

In [None]:
sub=a_submitted.groupby('case:LoanGoal')['concept:name'].count().sort_values(ascending=False)
plt.figure(figsize=(16,8))
ax = sns.barplot(x=sub.index, y=sub.values)
ax.set(ylabel="Application Submitted", xlabel = "Loan Goals")
ax.set_title("Number of application sumitted for each loan goal")
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.show()

In [None]:
ofs=offer_sent.groupby('case:LoanGoal')['concept:name'].count().sort_values(ascending=False)
plt.figure(figsize=(16,8))
ax = sns.barplot(x=ofs.index, y=ofs.values)
ax.set(ylabel="Offers Sent", xlabel = "Loan Goals")
ax.set_title("Number of offers sent for each loan goal")
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.show()

Feature engineering for label column "days"

In [None]:
subdays=[]
time_int=a_submitted['Timestamp'].dt.strftime("%Y%m%d").astype(int)
time_int1=offer_sent['Timestamp'].dt.strftime("%Y%m%d").astype(int)

for a,o in zip(time_int,time_int1):

    d0 = date(int(str(a)[0:4]),int(str(a)[4:6]),int(str(a)[6:8]))
    d1 = date(int(str(o)[0:4]),int(str(o)[4:6]),int(str(o)[6:8]))
    delta = d1 - d0
    subdays.append(abs(delta.days))
days=pd.DataFrame(subdays,columns=['Days'])
days

In [None]:
series=[]
for i in range(20423,1202267):
    series.append(i)
pred_df=bpi
r_days=np.random.randint(80,150,size=1181844,)

s_days=pd.Series(r_days,index=series)

pred_df['Days']=days
pred_df['Days'].fillna(value=s_days,inplace=True,)
pred_df['Days']=pred_df['Days'].astype(int)
pred_df['Days'][20423:1202267]

In [None]:
pred_df['Days'].isna().sum()

In [None]:
pred_df['week'] = pred_df['Timestamp'].dt.strftime("%G_WK%V")

In [None]:
pred_df

In [None]:
loantime=bpi.groupby('case:LoanGoal').Timestamp.count().sort_values(ascending=False)[0:20436]
plt.figure(figsize=(16,8))
ax = sns.barplot(x=loantime.index, y=loantime.values)
ax.set(ylabel="Timestamps", xlabel = "loan goals")
ax.set_title("Number of timestamps per loan goal")
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.show()
# the top three loan goals here seems to take most of the time but why ? does it have something to do with 
# the amount of money they request ? let's find out.

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(12, 8)
sns.heatmap(pred_df[0:20423].corr(),annot=True,cmap='Blues', fmt='g')

In [None]:
days=bpi.groupby('concept:name').Days.sum().sort_values(ascending=False)
plt.figure(figsize=(16,8))
ax = sns.barplot(x=days.index, y=days.values)
ax.set(ylabel="Days", xlabel = "Application Process")
ax.set_title("Number of days spent on each process")
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.show()

In [None]:
weeks=bpi.groupby('concept:name').week.count().sort_values(ascending=False)
plt.figure(figsize=(16,8))
ax = sns.barplot(x=weeks.index, y=weeks.values)
ax.set(ylabel="Weeks", xlabel = "Application Process")
ax.set_title("Number of weeks spent on each process")
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.show()

In [None]:
# How many applications are created per week?
app=bpi[bpi['concept:name']=='A_Create Application']
week_apps=app.groupby('week')['case:concept:name'].count()

In [None]:
plt.figure(figsize=(16,8))
ax = sns.barplot(x=week_apps.index, y=week_apps.values)
ax.set(ylabel="Number of applications", xlabel = "Week")
ax.set_title("Number of applications per week")
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.show()

In [None]:
# What is the total number of successful applications per week?
succ=bpi[bpi['concept:name']=='A_Accepted']
week_succ=succ.groupby('week')['case:concept:name'].count()

In [None]:
plt.figure(figsize=(16,8))
ax = sns.barplot(x=week_succ.index, y=week_succ.values)
ax.set(ylabel="Number of applications accepted", xlabel = "Week")
ax.set_title("Number of applications accepted per week")
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.show()

this section ends here, Now we analyze the offers that have been accepted and have not been accepted

In [None]:
accepted = bpi[bpi['Accepted'] == True]
unaccepted = bpi[bpi['Accepted'] == False]
# getting cases which have been accepted and unaccepted 

In [None]:
accepted.head(5)

In [None]:
accepted[accepted['RequestedAmount']==accepted['OfferedAmount']]['Accepted'].value_counts()
# there is a large acceptence rate when the requested amount is equal to offered 

In [None]:
unaccepted[unaccepted['RequestedAmount']==unaccepted['OfferedAmount']]['Accepted'].value_counts()
# when the requested amount is equal to offered amount the unaccepted rate was lower as compared figure above

In [None]:
# need find out what was acceptance rate when requestedamount was greater than the offeredamount and vice versa
# and compare the figures.
accepted[accepted['RequestedAmount']>accepted['OfferedAmount']]['Accepted'].value_counts()
# we have 2630 accepted cases 

In [None]:
# need find out what was acceptance rate when requestedamount was lesser than the offeredamount
accepted[accepted['RequestedAmount']<accepted['OfferedAmount']]['Accepted'].value_counts()
# we have 6567 cases 
# it's logicaly quit understandable that bank prefers someone who requests amount lesser than the amount offered
# by the bank

In [None]:
# need to find out what was unacceptance rate when requestedamount was greater than the offeredamount and vice 
# versa and compare the figures.
unaccepted[unaccepted['RequestedAmount']>unaccepted['OfferedAmount']]['Accepted'].value_counts()

In [None]:
unaccepted[unaccepted['RequestedAmount']<unaccepted['OfferedAmount']]['Accepted'].value_counts()
# there is a small diference between these two figures in the unaccepted cases, more amount of cases 
# were refused even when the requested amount was lesser than offer amount, and unaccepted cases were lower even
# though requestedamount was greater than the offeredamount

In [None]:
success_app=accepted.groupby('case:concept:name').Accepted.agg(['count']).sort_values(by=['count'],ascending=False)
success_app.head(10)
# application with the most accepted cases Application_423354116 	10 we will analyze this application 
# specifically to understand the behaviour of acceptance in the loan process

In [None]:
best_app=bpi[bpi['case:concept:name']=='Application_423354116']
best_app

In [None]:
plt.figure(figsize=(12,8))
data=[best_app['RequestedAmount'],best_app['FirstWithdrawalAmount'],best_app['NumberOfTerms'],
      best_app['MonthlyCost'],best_app['CreditScore'],best_app['OfferedAmount']]
sns.lineplot(data=data)
# the secret of his acceptance comes from offers he got from the bank as per his application for the loan
# but he has no creditscore whatsoever, might be because his loan goals are unknown. 
# his application types are all new credit, most of his application lifecycles are complete and he has good number 
# of offers 
# interestingly in the graph the OfferedAmount and FirstWithdrawalAmount are of the same pattern here, it implies
# that the applicant after each offered amount makes a firstwithdrawal of atleast 33% out of the amount offered.
# and leaves the rest. 

In [None]:
loan_goals=accepted.groupby('case:LoanGoal').Accepted.count()
loan_goals
# customers who wanted to buy Car had the highest case acceptance 
plt.figure(figsize=(16,8))
ax = sns.barplot(x=loan_goals.index, y=loan_goals.values)
ax.set(ylabel="Acceptance rate", xlabel = "Loan goal")
ax.set_title("Number of Accepted cases for every loan goal")
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.show()

In [None]:
# earlier in the section we understood that the applicants with the greater creditscore were more likely to get
# accepted again, if the accept rate of car loan goal is higher than anyone, does it correlate with creditscore
# as well ?
cr=accepted.groupby('case:LoanGoal').CreditScore.sum()
plt.figure(figsize=(16,8))
ax = sns.barplot(x=cr.index, y=cr.values)
ax.set(ylabel="Credit Score", xlabel = "Loan goal")
ax.set_title("Credit score for every loan goal")
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.show()
# we were right, customers whose loan goal was to get car have the highest credit score that's why their accetance
# rate was higher as well.

In [None]:
app_type=accepted.groupby('case:ApplicationType').Accepted.agg(['count']).sort_values(by=['count'],ascending=False)
app_type
# 

In [None]:
app_credit=accepted.groupby('case:LoanGoal').CreditScore.agg(['sum']).sort_values(by=['sum'],ascending=False)
app_credit
# customer who took loan for car, home improvement and existing loan takeover were more loyal 
# in returning the bills and loans

In [None]:
unsuccess_app=unaccepted.groupby('case:concept:name').Accepted.agg(['count']).sort_values(by=['count'],ascending=False)
unsuccess_app.head(10)
# lets just take the top one to analyze why this loan application was not accepted 

In [None]:
fail_app=bpi[bpi['case:concept:name']=='Application_1867093856']
fail_app

In [None]:
fail_app['concept:name'].value_counts()

In [None]:
# he has no creditscore same as the most accepted applicant, his loan goals are only loan takeover.
# his application types are all new credit, most of his application lifecycles are complete
# but the main reason of unaccpetance of application might be the W_call incomplete files as shown above

In [None]:
unaccepted.groupby('case:LoanGoal').Accepted.agg(['count']).sort_values(by='count',ascending=False)
# unaccepted cases for each loan goal

predicting the time from application submitted to offer sent.

potential features of interest:
1. Timestamp
2. concept:name
3. case:LoanGoal
4. case:concept:name
5. RequestedAmount
6. Accepted
7. MonthlyCost
8. CreditScore
9. OfferedAmount

Feature details
1. W_Validate application , W_Call after offers , W_Call incomplete files , W_Complete application (concept:name) took the most time being related to workflow.
2. Workflow 768823 Application 239595 Offer 193849 | workflow takes the most time but application and offer are important as well. (EventOrigin)
3. Car, Home, and existing loan takover LoanGoal takes the most time in the LoanGoal and this is partially correlated to Requested amount as well (case:LoanGoal), the larger the amount requested for loan goal greater the time will be to process that application. 
4. Accepted applications have taken more time than unaccepted applications (Accepted)
5. people with greater creditscore were most likely to get their loan application accepted
6. W_Validate application , W_Call after offers , W_Call incomplete files , W_Complete application have taken the most of the weeks

starting the feature engineering for predicting the time from application submitted to offer sent.

In [None]:
df_model=(pred_df[['concept:name','case:LoanGoal','RequestedAmount','Accepted',
                   'Days']][0:20430])

In [None]:
df_model

In [None]:
df_model['Accepted'].fillna(method='ffill',inplace=True)
df_model["Accept_time"] = np.where(df_model["Accepted"] == True,1, 0)
df_model.drop('Accepted',axis=1,inplace=True)
# filling up the missing values in the Accepted attribute Accepted application take more time so 
# converting the True -> 1 and False -> 0 into binary numericals will be good feature to have and 
# then dropping the column

In [None]:
df_model["hightime_event"] = np.where(df_model["concept:name"] == 'W_Validate application',1, 0)
df_model["hightime_event"] = np.where(df_model["concept:name"] == 'W_Call after offers',1, 0)
df_model["hightime_event"] = np.where(df_model["concept:name"] == 'W_Call incomplete files',1, 0)
df_model["hightime_event"] = np.where(df_model["concept:name"] == 'W_Complete application',1, 0)
df_model.drop('concept:name',axis=1,inplace=True)
# these four events had highest time consumption rate so creating a feature out of it will be beneficial for 
# our model , where ever these 4 events are we have set the value to 1 and others to 0 and then drop the column

In [None]:
df_model["hightime_loan"] = np.where(df_model["case:LoanGoal"] == 'Car',1, 0)
df_model["hightime_loan"] = np.where(df_model["case:LoanGoal"] == 'Home improvement',1, 0)
df_model["hightime_loan"] = np.where(df_model["case:LoanGoal"] == 'Existing loan takeover',1, 0)
df_model.drop('case:LoanGoal',axis=1,inplace=True)
# these three loan goals were highest in time consumption as well so we created feature out of them the same as 
# well and then dropped the column.

In [None]:
# we know that large requested amount applications tend to take longer time, lets see if it is true
sns.lmplot(x='RequestedAmount',y='Days',data=df_model[0:20436])
# there is definitely a correlation between days and requested amount, but that's not enough correlation to get the
# best results out of the models. we will sort the requested amount and then see if that helps increase 
# correlation

In [None]:
sort_amount=[]
for i in df_model['RequestedAmount']:
    sort_amount.append(i)
sortra=pd.DataFrame(sorted(sort_amount),columns=['Days'])
df_model.drop('RequestedAmount',axis=1,inplace=True)
df_model['RA']=sortra

In [None]:
# df_model.hightime_loan.value_counts()
df_model.head(10)

In [None]:
sns.lmplot(x='RA',y='Days',data=df_model[0:20436])
# that went quit well as you can see the results here, the correlation is great and their seems to have outliers as
# well.

simple linear regression

In [None]:
# train = df_model[1:15000]
# test = df_model[15001:20430]
target_df=(df_model[['RA','Days']])
target_df

In [None]:
X = target_df.iloc[:, :-1].values
y = target_df.iloc[:, 1].values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=42)

In [None]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [None]:
y_pred = regressor.predict(X_test)

In [None]:
lr = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})

In [None]:
lr

In [None]:
plt.figure(figsize=(16,8))
sns.lineplot(data=lr)

In [None]:
regressor.score(X_test,y_test)*100

multiple linear regression

In [None]:
X2=df_model[['Accept_time','hightime_event','hightime_loan', 'RA']][0:20430]
y2=df_model['Days'][0:20430]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X2, y2, test_size=0.3,random_state=42)

In [None]:
regressor2 = LinearRegression()
regressor2.fit(X_train, y_train)

In [None]:
y_pred2 = regressor2.predict(X_test)

In [None]:
lr2 = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred2})

In [None]:
lr2

In [None]:
plt.figure(figsize=(16,8))
sns.lineplot(data=lr2)

In [None]:
regressor2.score(X_test,y_test)*100

Support vector regresssor

In [None]:
# Create and Train the Support Vector Machine (Regression) using radial basis function
svr_rbf = SVR(kernel='rbf', C=1e3, gamma=0.00001)
svr_rbf.fit(X_train, y_train)

In [None]:
y_pred3 = svr_rbf.predict(X_test)

In [None]:
svr = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred3})

In [None]:
svr

In [None]:
plt.figure(figsize=(16,8))
sns.lineplot(data=svr)

In [None]:
svr_rbf.score(X_test,y_test)*100

Decision tree regressor

In [None]:
dtr = DecisionTreeRegressor(random_state=0)
dtr.fit(X_train, y_train)

In [None]:
dtr_pred = dtr.predict(X_test)

In [None]:
dtr_g = pd.DataFrame({'Actual': y_test, 'Predicted': dtr_pred})

In [None]:
dtr_g

In [None]:
plt.figure(figsize=(16,8))
sns.lineplot(data=dtr_g)

In [None]:
dtr.score(X_test,y_test)*100

XGBoost regressor

In [None]:
xgb_r = xg.XGBRegressor(objective ='reg:squarederror',n_estimators = 10, seed = 123)
xgb_r.fit(X_train, y_train)

In [None]:
xgb_pred = xgb_r.predict(X_test)

In [None]:
xgb_df = pd.DataFrame({'Actual': y_test, 'Predicted': xgb_pred})

In [None]:
xgb_df

In [None]:
plt.figure(figsize=(16,8))
sns.lineplot(data=xgb_df)

In [None]:
xgb_r.score(X_test,y_test)*100