# Analyzing loan approval decisions automated by IBM DBA through Business Automation Insights time series
## Analyzing your decisions in Python with Panda dataframes and Brunel

This Python notebook shows how to load a decision set produced by IBM ODM, and how to apply analytics with Brunel library to get insights on the decisions.
The decision set has been automated by running business rules on randomly generated loan applications. The decision set has been written in a JSON format. 

This notebook has been developed with a Panda dataframe and runs in Python 3.

The intent of applying data science on decisions is to check that decision automation works as expected. In other words, we want to check that the executed rules fit well with the segmentation of the data. From there we will potentialy find optimizations to better automate your decision making. You will be able to extend the notebook to create new views on your decisions by using Panda dataframes and Brunel visualization capabilities.
    
To get the most out of this notebook, you should have some familiarity with the Python programming language.

## Contents 
This notebook contains the following main sections:

1. [Load the loan validation decision set.](#overview)
2. [View an approval distribution pie chart.](#viewapprovaldistribution)
3. [View approvals in a chord chart.](#viewapprovaldistributionincordchart) 
4. [View the income on loan amount distribution.](#incomeoncreditscoredistribution)
5. [View the loan amount on credit score distribution.](#viewamountdistribution)
6. [Summary and next steps.](#next)    

<a id="overview"></a>
## 1. Load the Loan Validation decision set.
The loan validation dataset has been generated by using the decision capability of IBM Digital Business Automation platform named, alternatively named Operation Decision Manager. The dataset contains a list of decisions captured as JSON fragments of texts. 

This section shows the steps to access to this dataset file to construct a dataframe for the decision envelopes and a second that focus on the decision details meaning the input and output parameters.

In [1]:
import requests

target_url = "https://raw.githubusercontent.com/ODMDev/decisions-on-spark/master/data/loanvalidation/loanvalidation-with-score-grade-bai-timeserie-850.json"

response = requests.get(target_url)
print('Reading the JSONL file : ', response.text[:500],'...')

Reading the JSONL file :  {"version":"1.0.1","id":"fb081660-eb09-40e3-ab97-fcec8899632c0","timestamp":"2019-07-02T18:43:43.130+02:00","type":"EXECUTION_SUCCESS","odmType":"ruleset","rulesetPath":"/test_deployment/1.0/loan_validation_with_score_and_grade/1.0","offset":5,"partition":1,"duration":5,"data":{"test_deployment.loan_validation_with_score_and_grade.in.loan.numberOfMonthlyPayments":306,"test_deployment.loan_validation_with_score_and_grade.in.loan.startDate":1697068800000,"test_deployment.loan_validation_with_score ...


In [2]:
#Build a list of JSON string from a string that contains multiple JSON entries
def get_multi_json(json_mono_str):
    json_entries = []
    stack = 0
    json_entry = ''
    
    for letter in json_mono_str:
        json_entry += letter
    
        if letter == '{':
            stack += 1
        if letter == '}':
            stack -= 1
    
        if stack == 0:
            json_entry = json_entry.strip()
            json_entry = json_entry.strip('\n')
            if len(json_entry) > 0:
                json_entries.append(json_entry)
                json_entry = ''
        
        #print('l:', letter)
        #print('entry', json_entry)
        
    return json_entries
    

# Build a shorten key dictionnary based on the raw dictionnary made by reading a JSON payload
def compact_dictionnary(json_dict):
    
    prefix = "test_deployment.loan_validation_with_score_and_grade."

    dict2 = dict()

    for key,val in json_dict.items():
        start = len(prefix)
        new_key = key[start:].replace(".", "_") 
    
        # Merge key that have a list value to avoid an expansion when buidling a dataframe row
        if new_key == 'out_report_messages':
            #print('key: ', new_key)
            value = decision_data.get(key)
            joined_value = ''
            joined_value = joined_value.join(value)
            dict2[new_key] = joined_value
        else:
            #print('key2: ', new_key)
            dict2[new_key] = decision_data.get(key)
            
    return dict2

# Build a dictionnary from a JSON string
def make_dictionnary(json_str):

    json_dict = json.loads(json_str)
    
    return json_dict

In [3]:
import json

#Segment the monolithic JSON text into a list of JSON texts
json_entries = get_multi_json(response.text)

print(len(json_entries), ' JSON entries parsed')

850  JSON entries parsed


In [4]:
dict_envelope_entries = []
dict_details_entries = []

#Build a list of Dictionnaries
for json_entry in json_entries:
    #print('JSON Entry: ' + json_entry)
    dict_envelope = make_dictionnary(json_entry)
    dict_envelope_entries.append(dict_envelope)
    
    #Zoom in the data sub dictionnary
    decision_data = dict_envelope['data']
    
    #add the id to allow later on a join between the 2 dictionnaries and subsequent dataframes
    decision_data['id'] = dict_envelope['id']
    #print('id:', decision_data['id'])
    
    dict_details_entries.append(compact_dictionnary(decision_data))

In [5]:
import pandas as pd

rows_envelope_list = []

#Build a dataframe with a row for each top level JSON dictionnary
for dict_entry in dict_envelope_entries:
    rows_envelope_list.append(dict_entry)

df_envelope = pd.DataFrame(rows_envelope_list)
#print('The column of the envelope dataframe : ', list(df_envelope))

The decision as captured by DBA Business Automation Insights are now in a dataframe. All columns expected the "data" one describe the envelope of the decisions.

In [6]:
df_envelope.iloc[:3]

Unnamed: 0,version,id,timestamp,type,odmType,rulesetPath,offset,partition,duration,data,trace.task.names,trace.task.durations,trace.rule.names
0,1.0.1,fb081660-eb09-40e3-ab97-fcec8899632c0,2019-07-02T18:43:43.130+02:00,EXECUTION_SUCCESS,ruleset,/test_deployment/1.0/loan_validation_with_scor...,5,1,5,{'test_deployment.loan_validation_with_score_a...,"[loanvalidation, loanvalidation>initResult, lo...","[4, 0, 1, 1, 1]","[validation.borrower.checkSSNareanumber, valid..."
1,1.0.1,3335b91a-1819-4983-a5ee-f155bc6aa2220,2019-07-02T18:43:52.569+02:00,EXECUTION_SUCCESS,ruleset,/test_deployment/1.0/loan_validation_with_scor...,6,1,12,{'test_deployment.loan_validation_with_score_a...,"[loanvalidation, loanvalidation>initResult, lo...","[8, 0, 1, 1, 1, 1]","[validation.borrower.checkSSNareanumber, valid..."
2,1.0.1,9c8ddde9-da2b-427d-9019-0c916f8d28890,2019-07-02T18:43:52.604+02:00,EXECUTION_SUCCESS,ruleset,/test_deployment/1.0/loan_validation_with_scor...,7,1,4,{'test_deployment.loan_validation_with_score_a...,"[loanvalidation, loanvalidation>initResult, lo...","[4, 0, 1, 2, 1]","[validation.borrower.checkSSNareanumber, valid..."


In [7]:
rows_details_list = []

#Build a dataframe with a row for each top level JSON dictionnary
for dict_entry in dict_details_entries:
    rows_details_list.append(dict_entry)

df_details = pd.DataFrame(rows_details_list)
print('The column of the decision details dataframe : ')
list(df_details)

The column of the decision details dataframe : 


['in_loan_numberOfMonthlyPayments',
 'in_loan_startDate',
 'in_loan_amount',
 'in_loan_loanToValue',
 'in_loan_duration',
 'in_borrower_firstName',
 'in_borrower_lastName',
 'in_borrower_birth',
 'in_borrower_yearlyIncome',
 'in_borrower_zipCode',
 'in_borrower_creditScore',
 'in_borrower_spouse',
 'in_borrower_latestBankruptcy',
 'in_borrower_ssn_areaNumber',
 'in_borrower_ssn_groupCode',
 'in_borrower_ssn_serialNumber',
 'in_borrower_ssn_digits',
 'in_borrower_ssn_fullNumber',
 'in_borrower_birthDate',
 'in_borrower_ssncode',
 'in_borrower_latestBankruptcyDate',
 'in_borrower_latestBankruptcyReason',
 'in_borrower_latestBankruptcyChapter',
 'out_score',
 'out_grade',
 'out_report_borrower_firstName',
 'out_report_borrower_lastName',
 'out_report_borrower_birth',
 'out_report_borrower_yearlyIncome',
 'out_report_borrower_zipCode',
 'out_report_borrower_creditScore',
 'out_report_borrower_spouse',
 'out_report_borrower_latestBankruptcy',
 'out_report_borrower_ssn_areaNumber',
 'out_rep

In [8]:
df_details[:3]

Unnamed: 0,in_loan_numberOfMonthlyPayments,in_loan_startDate,in_loan_amount,in_loan_loanToValue,in_loan_duration,in_borrower_firstName,in_borrower_lastName,in_borrower_birth,in_borrower_yearlyIncome,in_borrower_zipCode,...,out_report_insuranceRequired,out_report_insuranceRate,out_report_approved,out_report_messages,out_report_yearlyInterestRate,out_report_monthlyRepayment,out_report_insurance,out_report_message,out_report_yearlyRepayment,Unnamed: 21
0,306,1697068800000,490000,0.5,26,John,Smith,-2308867200000,35000,74162,...,False,0.0,False,Average risk loanToo big Debt/Income ratio: 1....,0.081,3791.428956,none,Average risk loan\nToo big Debt/Income ratio: ...,45497.147473,fb081660-eb09-40e3-ab97-fcec8899632c0
1,269,1776297600000,79000,0.5,23,John,Smith,862099200000,52000,12345,...,True,0.02,True,Very low risk loanCongratulations! Your loan h...,0.068,572.980467,2%,Very low risk loan\nCongratulations! Your loan...,6875.765598,3335b91a-1819-4983-a5ee-f155bc6aa2220
2,198,1655510400000,270000,0.6,17,Betty,Smith,250560000000,27000,45695,...,False,0.0,False,Low risk loanToo big Debt/Income ratio: 0.98We...,0.064,2211.376379,none,Low risk loan\nToo big Debt/Income ratio: 0.98...,26536.516543,9c8ddde9-da2b-427d-9019-0c916f8d28890


We have now a dataframes that the details for loan approval decisions automated with business rules. 

For decision automation we used business rules to determine eligibility mainly based on credit score, loan amount, and income to debt ratio. Decision outcomes are represented by the approval and yearlyReplayment columns.

In [9]:
total_rows = df_details.shape[0]
#print("The size of the decision set is " + str(total_rows))

<a id="viewapprovaldistribution"></a>
## 2.View the loan approval distribution in a pie chart.
A simple pie chart that shows the approval distribution in the decision set.

In [10]:
import brunel

%brunel data('df_details') stack polar bar x("const") y(#count) color(out_report_approved) legends(none) label(out_report_approved) :: width=200, height=300

<IPython.core.display.Javascript object>

<a id="viewapprovaldistributionincordchart"></a>
## 3.View the loan approval distribution per credit score in a chord chart.
A chord chart that shows the approval count per credit score. The distribution of processed credit scores looks homogeneous explained by the fact that loan applications were synthetically created with a random credit score value.

In [11]:
%brunel data('df_details') chord x(out_report_approved) y(in_borrower_creditScore) color(#count) tooltip(#all)

<IPython.core.display.Javascript object>

Visualize the mean value of credit score for approved and rejected loan applications. Logically we observed an higher mean for approved applications.

In [12]:
%brunel data('df_details') bar x(out_report_approved) y(in_borrower_creditScore) mean(in_borrower_creditScore) sort(in_borrower_creditScore)

<IPython.core.display.Javascript object>

<a id="incomeoncreditscoredistribution"></a>
## 4.View income on credit score distribution.
Do we see trends or limits in credit score or income for accepted loan applications? We can observe graphically that the larger are the credit score and income values the more accepted approval we get.

This Brunel chart allows to zoom and span in the dataset. Mouse over displays the coordinates of the decision in the decision space.

In [13]:
%brunel data('df_details') x(in_borrower_yearlyIncome) y(in_borrower_creditScore) color(out_report_approved:yellow-green) tooltip(#all):: width=800, height=300

<IPython.core.display.Javascript object>

<a id="loanamountoncreditscoredistribution"></a>
## 5.View loan amount / credit score distribution-
Do we see limits in score or amount for accepted loan applications? We observe that:
- the higher the loan amount, the higher the rejection rate.
- the lower credit score, the higher the rejection rate.

We observe the absence of green points identified for loan amount greater that USD 1 000 000. It is consistent with a rule that rejects the application for amounts greater than this threshold.

In [14]:
%brunel data('df_details') x(in_loan_amount) y(in_borrower_creditScore) color(out_report_approved:yellow-green) tooltip(#all):: width=800, height=300

<IPython.core.display.Javascript object>

<a id="viewamountdistribution"></a>
## 5.Loan amount distribution.
The amount of loan applications visualized into a bar chart pie chart.
Bar chart shows a balanced distribution as input data have been ramdomly generated. In a real life context we expect a non zero minimum amount.

In [15]:
%brunel data('df_details') bar x(in_loan_amount) y(#count) bin(in_loan_amount) style("size:100%") :: width=800, height=300

<IPython.core.display.Javascript object>

<a id="next"></a>
# Summary and next steps
You have manipulated dataframes and views of a decision set powered by IBM DBA and captured in a JSONL format. You can expand this notebook by adapting the views and adding new ones to get more insights about your decisions and make better decisions in the future.

Copyright © 2020 IBM. This notebook and its source code are released under the terms of the MIT License.

<a id="authors"></a>
## Authors

Marie-Francoise Lim Meffre and Pierre Feillet are engineers at the IBM Decision Lab. Marie-Francoise is senior developer taking care of the decision automation samples.
Pierre is architect in decision automation, and is passionate about data science and machine learning.

Copyright © 2020 IBM. This notebook and its source code are released under the terms of the MIT License.