#     Loan Default Prediction - Imperial College London


# Objective of Competition

This competition asks you to determine whether a loan will default, as well as the loss incurred if it does default. Unlike traditional finance-based approaches to this problem, where one distinguishes between good or bad counterparties in a binary way, we seek to anticipate and incorporate both the default and the severity of the losses that result. In doing so, we are building a bridge between traditional banking, where we are looking at reducing the consumption of economic capital, to an asset-management perspective, where we optimize on the risk to the financial investor.

# About Data

This data corresponds to a set of financial transactions associated with individuals. The data has been standardized, de-trended, and anonymized. You are provided with over two hundred thousand observations and nearly 800 features.  Each observation is independent from the previous. 

For each observation, it was recorded whether a default was triggered. In case of a default, the loss was measured. This quantity lies between 0 and 100. It has been normalised, considering that the notional of each transaction at inception is 100. For example, a loss of 60 means that only 40 is reimbursed. If the loan did not default, the loss was 0. You are asked to predict the losses for each observation in the test set.

Missing feature values have been kept as is, so that the competing teams can really use the maximum data available, implementing a strategy to fill the gaps if desired. Note that some variables may be categorical (e.g. f776 and f777).

The competition sponsor has worked to remove time-dimensionality from the data. However, the observations are still listed in order from old to new in the training set. In the test set they are in random order.


#### Unique Identifier :- 

* ID 

#### Features :- 

* F1-F778

#### Target Variable :-

* loss 

# Evaluation metric

This competition is evaluated on the mean absolute error MAE 

In [None]:
!pip install pyspark==3.0.0
!pip install h2o_pysparkling_3.0

In [None]:
import os
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns


from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession, SQLContext

In [None]:
spark = SparkSession.builder.master("local[2]").appName("Loan Loss Predition").getOrCreate()
sc = spark.sparkContext
sc

# Intialising Sparkling Water 

In [None]:
from pysparkling import *
import h2o
hc = H2OContext.getOrCreate()

# H20 AutoML Approach

In [None]:
import h2o
print(h2o.__version__)
from h2o.automl import H2OAutoML

h2o.init(max_mem_size='16G')

# Data Loading

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [None]:
%%time
train = h2o.import_file("loan-default-prediction/train_v2.csv.zip")
test = h2o.import_file("/kaggle/input/loan-default-prediction/test_v2.csv.zip")

In [None]:
print(f'Size of training set: {train.shape[0]} rows and {train.shape[1]} columns')

In [None]:
x = train.columns
y = 'loss'
x.remove(y)

# H2O AutoML 

In [None]:
aml = H2OAutoML(max_runtime_secs = 3500, seed = 1, project_name = "lb_frame")
aml.train(x = x, y = y, training_frame = train)

# Leaderboard
Next, we will view the AutoML Leaderboard. Since we specified a leaderboard_frame in the H2OAutoML.train() method for scoring and ranking the models, the AutoML leaderboard uses the performance on this data to rank the models.

A default performance metric for each machine learning task (binary classification, multiclass classification, regression) is specified internally and the leaderboard will be sorted by that metric. In the case of regression, the default ranking metric is mean residual deviance. In the future, the user will be able to specify any of the H2O metrics so that different metrics can be used to generate rankings on the leaderboard.

In [None]:
lb = aml.leaderboard
lb.head() 

In [None]:

aml.leader

# Ensemble Exploration

To understand how the ensemble works, let's take a peek inside the Stacked Ensemble "All Models" model. The "All Models" ensemble is an ensemble of all of the individual models in the AutoML run. This is often the top performing model on the leaderboard.

In [None]:
# Get model ids for all models in the AutoML Leaderboard
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
# Get the "All Models" Stacked Ensemble model
se = h2o.get_model([mid for mid in model_ids if "StackedEnsemble_AllModels" in mid][0])
# Get the Stacked Ensemble metalearner model
metalearner = h2o.get_model(se.metalearner()['name'])

Examine the variable importance of the metalearner (combiner) algorithm in the ensemble. This shows us how much each base learner is contributing to the ensemble.

In [None]:
metalearner.coef_norm()

In [None]:
metalearner.std_coef_plot()

In [None]:
pred = aml.predict(test)
pred.head()

In [None]:
h2o.save_model(aml.leader, path = "/kaggle/output/")

In [None]:
fnl = test[['id']]

In [None]:
fnl.as_data_frame()

In [None]:
test_fnl = pd.concat([fnl.as_data_frame(),pred.as_data_frame()],axis=1)

In [None]:
test_fnl.to_csv("/kaggle/output/result.csv",index=False)

In [None]:
## https://docs.h2o.ai/h2o-tutorials/latest-stable/tutorials/pysparkling/Chicago_Crime_Demo.html