# Give Me Some Credit
![](https://www.freshfacs.com/v/vspfiles/photos/D3-2.jpg)

Improve on the state of the art in credit scoring by predicting the probability that somebody will experience financial distress in the next two years.


##### Around 6% of samples defaulted
- MonthlyIncome and NumberOfDependents have 29731 (19.82%) and 3924 (2.61%) null values respectively
- We also notice that when NumberOfTimes90DaysLate has values above 17, there are 267 instances where the three columns 
- NumberOfTimes90DaysLate, NumberOfTime60-89DaysPastDueNotWorse, NumberOfTime30-59DaysPastDueNotWorse share the same values, specifically 96 and 98.
    - We can see that sharing the same values of 96 and 98 respectively is not logical since trivial calculations can reveal that being 30 days past due for 96 times for a single person within a timespan of 2 years is not possible.
- RevolvingUtilizationOfUnsecuredLines
    - Defined as ratio of the total amount of money owed to total credit limit
distribution of values is right-skewed, consider removing outliers
    - It is expected that as this value increases, the proportion of people defaulting should increase as well
    - However, we can see that as the minimum value of this column is set to 13, the proportion of defaulters is smaller than that belonging to the pool of clients with total amount of money owed not exceeding total credit limit.
    - Thus we should remove those samples with RevolvingUtilizationOfUnsecuredLines's value more than equal to 13
- age
    - There seems to be more younger people defaulting and the distribution seems fine on the whole
- NumberOfTimes90DaysLate
    - It is interesting to note that there are no one who is 90 or more days past due between 17 and 96 times.
- NumberOfTime60-89DaysPastDueNotWorse
    - It is interesting to note that there are no one who is 60-89 days past due between 11 and 96 times.
- NumberOfTime30-59DaysPastDueNotWorse
    - It is interesting to note that there are no one who is 30-59 days past due between 13 and 96 times.
- DebtRatio
    - 2.5% of clients owe around 3490 or more times what they own
    - For the people who have monthly income in this 2.5%, only 185 people have values for their monthly incomes and the values are either 0 or 1.
    - There are 164 out of these 185 people who are of two different types, first with no monthly income and does not default and second with monthly income and does default.
- MonthlyIncome
    - Distribution of values is skewed, we can consider imputation with median.
    - We can also consider imputing with normally distributed values with its mean and standard deviation.
- Numberof Dependents
    - We can consider imputing with its mode, which is zero.

# Imports

In [None]:
import numpy as np # linear algebra
import json
import os
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
from IPython.display import HTML

pd.set_option('max_columns', 50)

In [None]:
from platform import python_version
print(python_version())

In [None]:
## https://github.com/SoftwareAG/nyoka
    
!pip install pypmml
!pip install --upgrade nyoka

In [None]:
!pip install numpy protobuf==3.16.0
!pip install onnx

In [None]:
!pip install pydot

In [None]:
!pip install onnxmltools

In [None]:
!pip install onnxruntime
!pip install skl2onnx

In [None]:
## ONNX 
import onnxruntime as rt
import onnx
import skl2onnx
from skl2onnx.common.data_types import FloatTensorType
from skl2onnx import convert_sklearn
from skl2onnx import convert_sklearn, update_registered_converter
from skl2onnx.common.shape_calculator import calculate_linear_classifier_output_shapes  # noqa
from onnxmltools.convert.xgboost.operator_converters.XGBoost import convert_xgboost  # noqa
from onnx.tools.net_drawer import GetPydotGraph, GetOpNodeProducer


In [None]:
from sklearn.model_selection import train_test_split
from sklearn import model_selection
import joblib
from sklearn.model_selection import GridSearchCV, StratifiedKFold, KFold
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from lightgbm import LGBMClassifier
from pypmml import Model
import numpy as np

### Read Data

In [None]:
%%time
sampleEntry = pd.read_csv('/kaggle/input/GiveMeSomeCredit/sampleEntry.csv')
train = pd.read_csv('/kaggle/input/GiveMeSomeCredit/cs-training.csv')
test = pd.read_csv('/kaggle/input/GiveMeSomeCredit/cs-test.csv')
del test['Unnamed: 0']
del test['SeriousDlqin2yrs']
del train['Unnamed: 0']

Variable Name	Description	Type
- ``SeriousDlqin2yrs``	Person experienced 90 days past due delinquency or worse	Y/N
- ``RevolvingUtilizationOfUnsecuredLines``	Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits	percentage
- ``age``	Age of borrower in years	integer
- ``NumberOfTime3059DaysPastDueNotWorse``	Number of times borrower has been 30-59 days past due but no worse in the last 2 years.	integer
- ``DebtRatio``	Monthly debt payments, alimony,living costs divided by monthy gross income	percentage
- ``MonthlyIncome``	Monthly income	real
- ``NumberOfOpenCreditLinesAndLoans``	Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards)	integer
- ``NumberOfTimes90DaysLate``	Number of times borrower has been 90 days or more past due.	integer
- ``NumberRealEstateLoansOrLines``	Number of mortgage and real estate loans including home equity lines of credit	integer
- ``NumberOfTime60-89DaysPastDueNotWorse``	Number of times borrower has been 60-89 days past due but no worse in the last 2 years.	integer
- ``NumberOfDependents``	Number of dependents in family excluding themselves (spouse, children etc.)	integer

In [None]:
print('train shape  ',train.shape)

In [None]:
train['SeriousDlqin2yrs'] = train['SeriousDlqin2yrs'].astype(np.int32)
train['age'] = train['age'].astype(np.int32)
train['NumberOfTime30-59DaysPastDueNotWorse'] = train['NumberOfTime30-59DaysPastDueNotWorse'].astype(np.int32)
train['NumberOfOpenCreditLinesAndLoans'] = train['NumberOfOpenCreditLinesAndLoans'].astype(np.int32)
train['NumberOfTimes90DaysLate'] = train['NumberOfTimes90DaysLate'].astype(np.int32)
train['NumberRealEstateLoansOrLines'] = train['NumberRealEstateLoansOrLines'].astype(np.int32)
train['NumberOfTime60-89DaysPastDueNotWorse'] = train['NumberOfTime60-89DaysPastDueNotWorse'].astype(np.int32)

target = 'SeriousDlqin2yrs'
features = ['RevolvingUtilizationOfUnsecuredLines',
            'age', 'NumberOfTime30-59DaysPastDueNotWorse',
 'DebtRatio', 'MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans', 'NumberOfTimes90DaysLate',
 'NumberRealEstateLoansOrLines', 'NumberOfTime60-89DaysPastDueNotWorse',
 'NumberOfDependents']

In [None]:
pipeline = Pipeline([
    ('LGBMC_preprocess',LGBMClassifier(n_estimators=5))
])
pipeline.fit(train[features], train[target])

### Convert the model to PMML
Now we can convert the model to PMML using nyoka:

In [None]:
from nyoka import lgb_to_pmml
lgb_to_pmml(pipeline,features,target,"lgbcreditmodel.pmml")

# XGB Classifier

In [None]:
def pipeline_train(train):
    """ Summary or Description of the Function

    Parameters:
    train (dataframe): dataframe with train data
    Returns:
    binary: machine learning model
    """ 
    train = train.dropna()
    train_x, val_x, train_y, val_y=train_test_split(train.drop('SeriousDlqin2yrs',axis=1),train['SeriousDlqin2yrs'].astype('uint8'),test_size=.2,random_state=2021)
    xgb_cfl = xgb.XGBClassifier(n_jobs = -1, 
                                n_estimators = 50)
    xgb_cfl.fit(train_x, train_y)
    return xgb_cfl
model = pipeline_train(train)
type(model)

# Simulation - test deploy

In [None]:
filename_pmml = '../input/model-credito/model_credito.sav'

In [None]:
json_simulation = test.head(1).to_json()
json_simulation

### Using script

In [None]:
def entry_point_script(data, train_data):
    """ Summary or Description of the Function

    Parameters:
    data (json): json with client information

    Returns:
    json: client probability that somebody will experience financial distress in the next two years.


   """
    model = pipeline_train(train_data)
    jdata = json.loads(data)
    escoragem = pd.DataFrame(jdata)
    res = model.predict_proba(escoragem)[:,1]
    result = pd.DataFrame(res.tolist(), columns=['probability'])
    return result.to_json(orient = "records", lines=False)

In [None]:
%%time
res = entry_point_script(json_simulation, train)
res

# Using binary model

In [None]:
def entry_point_binarymodel(data, filename):
    """ Summary or Description of the Function

    Parameters:
    data (json): json with client information
    filename:  path with binary modeo

    Returns:
    json: client probability that somebody will experience financial distress in the next two years.


   """
    jdata = json.loads(data)
    escoragem = pd.DataFrame(jdata)
    loaded_model = joblib.load(open(filename, 'rb'))    
    res = loaded_model.predict_proba(escoragem)[:,1]
    result = pd.DataFrame(res.tolist(), columns=['probability'])
    return result.to_json(orient = "records", lines=False)

In [None]:
%%time
res = entry_point_binarymodel(json_simulation, filename_pmml)
res

# Using PMML model
Validate whether the predictions of PMML are the same as ones produced by the Python model.


In [None]:
def entry_point_pmml(data):
    """ Summary or Description of the Function

    Parameters:
    train_data (dataframe): dataframe with train data
    filename (str): machine learning model path

    Returns:
    json: client probability that somebody will experience financial distress in the next two years.
    """ 
    jdata = json.loads(data)
    escoragem = pd.DataFrame(jdata)
    loaded_model = Model.fromFile("lgbcreditmodel.pmml")
    res = model.predict(escoragem)
    result = pd.DataFrame(res.tolist(), columns=['probability'])
    return result.to_json(orient = "records", lines=False)

In [None]:
%%time
res = entry_point_pmml(json_simulation)
res

# Using ONNX

#### Model ONNX- generate pipeline

In [None]:
test.columns = [x.lower() for x in test.columns]
train.columns = [x.lower() for x in train.columns]

train.rename(columns={'seriousdlqin2yrs': 'target',  'revolvingutilizationofunsecuredlines':'f1', 'age':'f2',       'numberoftime30-59dayspastduenotworse':'f3', 'debtratio':'f4', 'monthlyincome':'f5',       'numberofopencreditlinesandloans':'f6', 'numberoftimes90dayslate':'f7',
       'numberrealestateloansorlines':'f8', 'numberoftime60-89dayspastduenotworse':'f9','numberofdependents':'f10'}, inplace=True)
test.rename(columns={'seriousdlqin2yrs': 'target',  'revolvingutilizationofunsecuredlines':'f1', 'age':'f2',       'numberoftime30-59dayspastduenotworse':'f3', 'debtratio':'f4', 'monthlyincome':'f5',       'numberofopencreditlinesandloans':'f6', 'numberoftimes90dayslate':'f7',
       'numberrealestateloansorlines':'f8', 'numberoftime60-89dayspastduenotworse':'f9','numberofdependents':'f10'}, inplace=True)

train.fillna(-1, inplace=True)
test.fillna(-1, inplace=True)

train['target'] = train['target'].astype(np.int32)
train['f1'] = train['f1'].astype(np.int32)
train['f2'] = train['f2'].astype(np.int32)
train['f3'] = train['f3'].astype(np.int32)
train['f4'] = train['f4'].astype(np.int32)
train['f5'] = train['f5'].astype(np.int32)
train['f6'] = train['f6'].astype(np.int32)
train['f7'] = train['f7'].astype(np.int32)
train['f8'] = train['f8'].astype(np.int32)
train['f9'] = train['f9'].astype(np.int32)
train['f10'] = train['f10'].astype(np.int32)

test['f1'] = test['f1'].astype(np.int32)
test['f2'] = test['f2'].astype(np.int32)
test['f3'] = test['f3'].astype(np.int32)
test['f4'] = test['f4'].astype(np.int32)
test['f5'] = test['f5'].astype(np.int32)
test['f6'] = test['f6'].astype(np.int32)
test['f7'] = test['f7'].astype(np.int32)
test['f8'] = test['f8'].astype(np.int32)
test['f9'] = test['f9'].astype(np.int32)
test['f10'] = test['f10'].astype(np.int32)

target = 'target'
features = ['f1','f2','f3','f4','f5','f6','f7','f8','f9', 'f10']

pipeline2 = Pipeline([
    ('XGB_preprocess',XGBClassifier(n_estimators=5))
])
pipeline2.fit(train[features], train[target])


In [None]:

update_registered_converter(
    XGBClassifier, 'XGBoostXGBClassifier',
    calculate_linear_classifier_output_shapes, convert_xgboost,
    options={'nocl': [True, False], 'zipmap': [True, False, 'columns']})



model_onnx = convert_sklearn(
    pipeline2, 'pipeline_xgboost',
    [('input', FloatTensorType([None, 10]))],
    target_opset=10)

# And save.
with open("pipeline_xgboost.onnx", "wb") as f:
    f.write(model_onnx.SerializeToString())
    

## Display the ONNX graph

In [None]:
pydot_graph = GetPydotGraph(
    model_onnx.graph, name=model_onnx.graph.name, rankdir="TB",
    node_producer=GetOpNodeProducer(
        "docstring", color="yellow",
        fillcolor="yellow", style="filled"))
pydot_graph.write_dot("pipeline.dot")

os.system('dot -O -Gdpi=300 -Tpng pipeline.dot')

image = plt.imread("pipeline.dot.png")
fig, ax = plt.subplots(figsize=(40, 20))
ax.imshow(image)
ax.axis('off')

In [None]:
def entry_point_onnx(data):
    """ Summary or Description of the Function

    Parameters:
    data (json): data for prediction

    Returns:
    json: client probability that somebody will experience financial distress in the next two years.
    """ 
    jdata = json.loads(data)
    escoragem = pd.DataFrame(jdata)
    sess = rt.InferenceSession("pipeline_xgboost.onnx")
    pred_onx = sess.run(None, {"input": escoragem.values.astype(np.float32)})
    
    return pd.DataFrame([{"probability_target_0":pred_onx[1][0][0],"probability_target_1":pred_onx[1][0][1]}]).to_json(orient = "records", lines=False)


In [None]:
%%time
res = entry_point_onnx(json_simulation)
res

## End notebook

# Final