In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_score

import warnings
def ignore_warn(*args, **kwargs):
    pass

from IPython.display import Image
import pickle
import os 

# Fetures /  Target

In [2]:
# import clients data
df = pd.read_csv('./dataset/data_prepared.csv', index_col=0)



# Normalize data
X = df.drop(labels = ["_will_pay"],axis = 1)
y = df['_will_pay']

X = X.drop(['_is_account_recent','_is_common_type','log_income'], axis=1,inplace = False)

scaler = StandardScaler()
scaler.fit(X)

pickle.dump(scaler, open('./scaler/'+'scaler', 'wb'))

X_SS = pd.DataFrame(scaler.transform(X), index=X.index, columns=X.columns) 
X_train, X_test, y_train, y_test = train_test_split(X_SS, y, test_size=0.2, random_state=42)

  return self.partial_fit(X, y)


Mined / Engineered Features

- monthly_income:                
- monthly_outcome: 
- ratio_outcome_income: monthly_outcome/monthly_income
- sqrt_income:  square-root of monthly income
- sqrt_outco:                  
- log_income: natural log of montly income                    
- log_outcome: 
- total_credit_payments:  Total number of payments        
- payments_per_year: Number of payment per year             
- loan_term:    Lengh of payment in years                    
- loan_amount:  Total amount lent                
                  
- worst_previous_delinquency: Worst registered loan delinquency (amount) before account opening
- worst_previous_fraction: Worst registered loan delinquency (amount) before account opening       
- _is_account_recent: Open after 2012            
- _is_common_type: Belong to the most typical operations (75 percentile)               
         


Bolean Target Feature: 

- _will_pay = (worst_delinquency_past_due_estimated / loan_amount) < 0.15

Important: Class good/bad client was completely ignored 
because it had not clear meaning.

This model was choosen such that use the provided data to simulate the 
lending business: predict if a given client with this data ask us for 
a certain {loan_amount, loan_term, payments_per_year} and predict 
if this will pay



#### Important notes

Important: The only users kept were those that provided enough
information such that loan_amount and term could be estimated. 
These are essencial features we need to reach the objectives of this
project. See data_understanding.ipyn for details.

In particular, the loan_amount was estimated as follows:

- If loan IS delinquent:

loan_amount_estimated = total_credit_payments * past_due_balance / number_of_payments_due 
            = total_credit_payments * worst_delinquency_past_due_balance / worst_delinquency 

- If loan is NOT delinquent and current_balance $\ne$ 0

$\text{loan_amount} = \text{total_credit_payments} \times \text{amount_to_pay_next_payment}$

- If loan is NOT delinquent and current_balance $=$ 0

loan_amount =? maximum_credit_amount

where =? here means the maximum estimate of the loan.

# Load models

In [3]:
filenames_models = os.listdir('./trained_models/')
filenames_tables = [i for i in os.listdir('./plots_tables/') if '.csv' in i]
filenames_plots = [i for i in os.listdir('./plots_tables/') if '.pdf' in i]

In [4]:
trained_models_dic={}
for mol in filenames_models:
    trained_models_dic[mol]=pickle.load(open('./trained_models/'+mol, 'rb'))

In [5]:
# Get training scores 

scores_train = [] ; std_train = [] ; scores_test = [] ; models_names = []

for mod in trained_models_dic:
    models_names.append(mod)
    
    acc = cross_val_score(trained_models_dic[mod], X_train, y_train, scoring = "precision", cv = 4)
    scores_train.append(acc.mean())
    
    std_train.append(acc.std())

    acc = precision_score(y_test, trained_models_dic[mod].predict(X_test))
    scores_test.append(acc)



In [6]:
X_test.head()

Unnamed: 0,monthly_income,monthly_outcome,total_credit_payments,payments_per_year,loan_term,loan_amount,worst_previous_delinquency,worst_previous_fraction,ratio_outcome_income,sqrt_income,sqrt_outcome,log_outcome
11124,-0.287868,-0.180919,-0.254921,-0.008374,-0.319724,-0.008868,-0.153549,-0.247961,-0.182469,-0.492472,-0.349963,-0.187827
13377,-0.284321,-0.181144,-0.254921,-0.008374,-0.319724,-0.041213,-0.155076,-0.251331,-0.180725,-0.455618,-0.353315,-0.199716
14903,-0.280203,-0.186192,-0.364267,1.687265,-0.695949,-0.235358,-0.128861,-0.234484,-0.164823,-0.41845,-0.4475,-0.627711
16081,0.328234,0.873751,3.389944,-0.008374,2.919304,1.52861,-0.155076,-0.251331,-0.184802,1.06651,2.162979,1.883948
7515,-0.288603,-0.168647,0.80209,1.687265,-0.21757,-0.117514,0.358847,6.218002,-0.185282,-0.500883,-0.209534,0.202135


# Evaluation of will_pay classification model(s) 

The classification metric needed is without a doubt precision, as we are trying to 
minimize the false positive because this cost us money! The racionaly behind this choise is 
that in absence of expertise I aim to minimize possible loosses instead of maximizing possible gains. 

- The precision reach by the models used is similar: ~ 89% +- 4% . 
- The std of the blended is very small because results from combining the 6 best models through soft votes.

The details on modeling.ipyn, can be summarize by saying that some models
certainly have a variance problem and there is an imbalace of will_pay 




In [7]:
# Creating a table of results, ranked highest to lowest
results = pd.DataFrame({
    'Model': models_names,
    'Cross-validation set precision score': scores_train,
    'Cross-validation std': std_train,
    'Test set precision score': scores_test})

result_df = results.sort_values(by='Test set precision score', ascending=False).reset_index(drop=True)


result_df.to_csv('./plots_tables/test_set_scores.csv')

result_df.head(12)



Unnamed: 0,Model,Cross-validation set precision score,Cross-validation std,Test set precision score
0,K_Nearest_Neighbour,0.903034,0.011705,0.93133
1,blended,0.897466,0.004884,0.910781
2,XGBoost,0.904422,0.003665,0.906367
3,Random_Forest,0.897872,0.003632,0.904412
4,Gradient_Boosting,0.905802,0.006588,0.901515
5,Bagging_Classifier,0.878798,0.00553,0.889734
6,Extra_Trees,0.874488,0.006986,0.889286
7,AdaBoost,0.889859,0.006729,0.886861
8,SVC,0.876788,0.008946,0.882784
9,Gaussian_Process,0.845644,0.003902,0.863014


# Answer 1er question: Pick the best clients 

Given some new clients (say our test set), that provide the feature of our model, I 
select those clients with the classifier:

In [8]:
report = pd.DataFrame(scaler.inverse_transform(X_test), index=X_test.index, columns=X_test.columns)

report.index.name='user_id'

report.reset_index(level=0, inplace=True)

report['_will_pay_predicted'] = trained_models_dic['blended'].predict(X_test)

report.to_csv('./dataset/report.csv')

In [9]:
report.head(1)

Unnamed: 0,user_id,monthly_income,monthly_outcome,total_credit_payments,payments_per_year,loan_term,loan_amount,worst_previous_delinquency,worst_previous_fraction,ratio_outcome_income,sqrt_income,sqrt_outcome,log_outcome,_will_pay_predicted
0,11124,3500.0,7444.0,19.0,24.0,0.791667,46588.0,25.0,0.0125,0.470177,59.160798,86.278618,8.915164,1


# Answer 2nd question: Propose amount and term

Notice that, our (business) model allows, for each new client, that request centain

- total_credit_payments
- payments_per_year 
- loan_term 
- loan_amount = (money_requested)*(1+interest)

whether this client is or not likely to pay back. More on the interest below. 

In fact, if our client does not classify, we can even modify this information
so that our prediction system give us a high probabily of payback.

Example, let us take a bad client:

In [10]:
report[report._will_pay_predicted==0].head(6)

Unnamed: 0,user_id,monthly_income,monthly_outcome,total_credit_payments,payments_per_year,loan_term,loan_amount,worst_previous_delinquency,worst_previous_fraction,ratio_outcome_income,sqrt_income,sqrt_outcome,log_outcome,_will_pay_predicted
5,7512,3245.0,18604.0,1.0,12.0,0.083333,350.416667,0.0,0.0,0.174425,56.964901,136.396481,9.831132,0
18,4668,4020.0,3834.0,18.0,12.0,1.5,4674.0,0.0,0.0,1.048513,63.40347,61.919302,8.251664,0
31,4983,105763.0,379769.0,1.0,12.0,0.083333,114.0,0.0,0.0,0.278493,325.212238,616.254006,12.847318,0
40,6442,3813.0,4392.0,1.0,12.0,0.083333,350.153846,0.0,0.0,0.868169,61.749494,66.272166,8.38754,0
44,12626,156100.0,143383.0,1.0,12.0,0.083333,747.0,0.0,0.0,1.088693,395.094925,378.659478,11.873275,0
69,14128,2011464.0,547658.0,24.0,24.0,1.0,12000.0,0.0,0.0,3.672847,1418.260907,740.039188,13.213406,0


In [11]:
client = report[report.user_id==14128]

client.to_csv('./dataset/client.csv')
client

Unnamed: 0,user_id,monthly_income,monthly_outcome,total_credit_payments,payments_per_year,loan_term,loan_amount,worst_previous_delinquency,worst_previous_fraction,ratio_outcome_income,sqrt_income,sqrt_outcome,log_outcome,_will_pay_predicted
69,14128,2011464.0,547658.0,24.0,24.0,1.0,12000.0,0.0,0.0,3.672847,1418.260907,740.039188,13.213406,0


#### By changing the loan_ammount our algorithm tell us what we can offer to this client

In [12]:
client = pd.read_csv('./dataset/client.csv', index_col = 0)

for i in range(1,9):
    client.at[69,'loan_amount'] = 10**i
    decision=trained_models_dic[mod].predict( 
        scaler.transform(client.drop(columns = ['user_id', '_will_pay_predicted'])))
    print('loan_ammount:', 10**i, ' _will_pay_prediction',decision)


loan_ammount: 10  _will_pay_prediction [0]
loan_ammount: 100  _will_pay_prediction [0]
loan_ammount: 1000  _will_pay_prediction [1]
loan_ammount: 10000  _will_pay_prediction [0]
loan_ammount: 100000  _will_pay_prediction [0]
loan_ammount: 1000000  _will_pay_prediction [0]
loan_ammount: 10000000  _will_pay_prediction [0]
loan_ammount: 100000000  _will_pay_prediction [0]


#### Insight: algorithm is non-linear in the loan_ammount. 

In [13]:
for i in range(1,5):
    client.at[69,'loan_amount'] = 10**4
    client.at[69,'loan_term'] = i/20
    decision=trained_models_dic[mod].predict( 
        scaler.transform(client.drop(columns = ['user_id', '_will_pay_predicted'])))
    print('loan_term in years:', i*5, ' _will_pay_prediction',decision)

loan_term in years: 5  _will_pay_prediction [0]
loan_term in years: 10  _will_pay_prediction [0]
loan_term in years: 15  _will_pay_prediction [0]
loan_term in years: 20  _will_pay_prediction [0]


#### Insight: short loan_terms are not necessary good!


# Answer 3rd question: Anual interest rate the lended amount must have in order to be profitable

To estimate interest, we first need to estimate losses. 

For this we, have to use a large test set, and estimate constantly how 
my we are loosing due to the fact that our model is only an approximation
of a stochastic process.

In [14]:
tmp=pd.DataFrame(y_test)
tmp.index.name='user_id'
tmp.reset_index(level=0, inplace=True)

report['did_pay'] = tmp['_will_pay']

report.to_csv('./dataset/interest.csv')

In [15]:
losses_condition= (report['_will_pay_predicted']==1) & (report['did_pay']==0)

gains_condition= (report['_will_pay_predicted']==1) & (report['did_pay']==1)

losses = report[losses_condition]['loan_amount'].sum()

paid_back = report[gains_condition]['loan_amount'].sum()



## Then the condition, for the bussiness to be profitable is that the interest covers the losses according to 

$ paid_{back} = loan_{WihoutInterest}  (1 + interest_{rate})$

$ loan_{WihoutInterest} * interest_{rate} \ge losses $

In [16]:
mininum_interest_rate=(paid_back / losses -1 )**(-1)  *100

print("""
This means for this business to be profitable we need to fix the interest rate bigger than 
""",mininum_interest_rate, '%')


This means for this business to be profitable we need to fix the interest rate bigger than 
 5.303309570812077 %


Excelent result if you ask me