<a href="https://colab.research.google.com/github/yuanfeiwo/test-firstrepository/blob/master/Module5_5_5_AutoML_V2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Module: Data Science in Finance, AutoML
# Version 1.0
# Topic :  AutoML - H2O
# Example source: https://www.kaggle.com/wendykan/lending-club-loan-data
#####################################################################
# For support or questions, contact QuantUniversity at
# info@qusandbox.com
# Copyright 2020 QuantUniversity LLC.
#####################################################################

# AutoML with H2O

AutoML is the process of automating an end-to-end Machine Learning pipeline. The [H2O AutoML](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html) interface is designed to have as few parameters as possible so that all the user needs to do is point to their dataset, identify the response column and optionally specify a time constraint or limit on the number of total models trained. It also provides a UI called H2O Flow for monitoring model metrics.

### Imports

In [2]:
!pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o
import h2o
from h2o.automl import H2OAutoML
h2o.init(max_mem_size='3G')

Looking in links: http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html
Collecting h2o
[?25l  Downloading https://files.pythonhosted.org/packages/26/c5/d63a8bfdbeb4ebfb709c010af3e061d89a363204c437cb5527431f6de3d2/h2o-3.32.0.2.tar.gz (164.6MB)
[K     |████████████████████████████████| 164.6MB 55kB/s 
Collecting colorama>=0.3.8
  Downloading https://files.pythonhosted.org/packages/44/98/5b86278fbbf250d239ae0ecb724f8572af1c91f4a11edf4d36a206189440/colorama-0.4.4-py2.py3-none-any.whl
Building wheels for collected packages: h2o
  Building wheel for h2o (setup.py) ... [?25l[?25hdone
  Created wheel for h2o: filename=h2o-3.32.0.2-py2.py3-none-any.whl size=164620456 sha256=e5c3516be9d0dea42718bb75c8d4f50b458a989f7ee9b8e8d2927793ed973f3c
  Stored in directory: /root/.cache/pip/wheels/42/bd/ea/218fd15724eddf6fa7fc8fab802b6fa592e623d87199679721
Successfully built h2o
Installing collected packages: colorama, h2o
Successfully installed colorama-0.4.4 h2o-3.32.0.2
Checking whether there 

0,1
H2O_cluster_uptime:,02 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.32.0.2
H2O_cluster_version_age:,"28 days, 7 hours and 55 minutes"
H2O_cluster_name:,H2O_from_python_unknownUser_jsddr0
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


In [3]:
# for numerical analysis and data processing
import numpy as np
import pandas as pd

from sklearn.metrics import mean_squared_error, mean_absolute_error
from scipy.spatial.distance import cdist

import requests
from io import StringIO

### Dataset

The data set is the lending data for lendingclub from August 2011 to December 2011 for some borrowers. The feature descriptions for the data are also provided. Not all the features are required for making predictions, some features are redundant in the original data file. The provided data file is already cleaned and only relevant features are provided. There are two types of features, numerical and categorical.

Reading the input data from csv file.

In [4]:
orig_url_data='https://drive.google.com//file//d//1yG-JxC1Br3c8u3cfmKQWC9pgz6Pqggw5//view?usp=sharing'
file_id = orig_url_data.split('//')[-2]
dwn_url='https://drive.google.com//uc?export=download&id=' + file_id
url = requests.get(dwn_url).text
csv_raw = StringIO(url)
df = pd.read_csv(csv_raw)

orig_url_description='https://drive.google.com//file//d//1HFd4gKbknC28rHTWysec48NqfB6g3ZHx//view?usp=sharing'
file_id = orig_url_description.split('//')[-2]
dwn_url='https://drive.google.com//uc?export=download&id=' + file_id
url = requests.get(dwn_url).text
csv_raw = StringIO(url)
df_description = pd.read_excel(dwn_url)


del df['issue_d'] # removing issue date as it wont affect the prediction (redundant feature)

print (df_description.head())

               LoanStatNew                                        Description
0               addr_state  The state provided by the borrower in the loan...
1               annual_inc  The self-reported annual income provided by th...
2         annual_inc_joint  The combined self-reported annual income provi...
3         application_type  Indicates whether the loan is an individual ap...
4  collection_recovery_fee                     post charge off collection fee


In [5]:
df.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,grade,sub_grade,emp_length,home_ownership,annual_inc,verification_status,purpose,addr_state,dti,delinq_2yrs,inq_last_6mths,loan_status_Binary
0,5000,36 months,10.65,162.87,B,B2,10+ years,RENT,24000.0,Verified,credit_card,AZ,27.65,0,1,0
1,2500,60 months,15.27,59.83,C,C4,< 1 year,RENT,30000.0,Source Verified,car,GA,1.0,0,5,1
2,2400,36 months,15.96,84.33,C,C5,10+ years,RENT,12252.0,Not Verified,small_business,IL,8.72,0,2,0
3,10000,36 months,13.49,339.31,C,C1,10+ years,RENT,49200.0,Source Verified,other,CA,20.0,0,1,0
4,3000,60 months,12.69,67.79,B,B5,1 year,RENT,80000.0,Source Verified,other,OR,17.94,0,0,0


In [6]:
y ='int_rate'

### Data preprocessing
H2O library is good at handling missing data by use of H2OFrames. It also provides certain preprocessing tools.

In [7]:
hf = h2o.H2OFrame(df)

Parse progress: |█████████████████████████████████████████████████████████| 100%


Test-Train split of the dataframe

In [8]:
splits = hf.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]

### The following is all the code needed to find the best model:

**H2OAutoML's performance is as good as the amount of time it is allowed to optimize.**

In [9]:
aml = H2OAutoML(max_runtime_secs =600, seed = 1, project_name = "H2O_finance")
aml.train(y = y, training_frame = train, leaderboard_frame = test)

AutoML progress: |████████████████████████████████████████████████████████| 100%


#### H2O leaderboards
H2O also provides leaderboard that gives the list of all model and hyperparameter combinations it has tried, sorted based on 'mean_residual_deviance' metric by default.

In [10]:
aml.leaderboard.head()

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
StackedEnsemble_AllModels_AutoML_20201215_233431,0.0681843,0.261121,0.0681843,0.192558,0.020832
StackedEnsemble_BestOfFamily_AutoML_20201215_233431,0.0682217,0.261193,0.0682217,0.193088,0.0208399
XGBoost_grid__1_AutoML_20201215_233431_model_10,0.0691716,0.263005,0.0691716,0.193754,0.0209764
GLM_1_AutoML_20201215_233431,0.0708695,0.266213,0.0708695,0.211654,0.0215047
GBM_grid__1_AutoML_20201215_233431_model_8,0.0712286,0.266887,0.0712286,0.210957,0.0215014
GBM_grid__1_AutoML_20201215_233431_model_10,0.0725182,0.269292,0.0725182,0.211208,0.0217611
GBM_grid__1_AutoML_20201215_233431_model_1,0.0729342,0.270063,0.0729342,0.213446,0.0217789
XGBoost_grid__1_AutoML_20201215_233431_model_11,0.0733434,0.27082,0.0733434,0.20191,0.0218869
XGBoost_3_AutoML_20201215_233431,0.0738917,0.27183,0.0738917,0.20928,0.0217003
XGBoost_grid__1_AutoML_20201215_233431_model_5,0.0739634,0.271962,0.0739634,0.210699,0.021624




**'leader' gives us the best model out of all the models the pipeline tries.    
'model_performance()' provides all important metrics for a given model.**

In [11]:
perf = aml.leader.model_performance(test)
perf


ModelMetricsRegressionGLM: stackedensemble
** Reported on test data. **

MSE: 0.06818429186731996
RMSE: 0.26112122063769533
MAE: 0.19255751807496377
RMSLE: 0.02083198943390966
R^2: 0.99631715869537
Mean Residual Deviance: 0.06818429186731996
Null degrees of freedom: 1978
Residual degrees of freedom: 1966
Null deviance: 36693.46803231744
Residual deviance: 134.9367136054262
AIC: 329.4729464178579




We can predict using H2OFrames as input to the leader

In [12]:
pred = aml.leader.predict(test[0,:])
pred

stackedensemble prediction progress: |████████████████████████████████████| 100%


predict
15.7205




In [13]:
import pickle
pickle.dump(aml.leader, open('h2o_pipeline.model','wb'))

### MAPE (Mean Absolute Percentage Error)

In [14]:
def mean_absolute_percentage_error(y_true, y_pred): 
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

In [15]:
y_test = test[y]
y_train = train[y]

In [16]:
y_test_vals = y_test.as_data_frame().values.ravel()
y_test_pred_vals = aml.leader.predict(test).as_data_frame().values.ravel()

stackedensemble prediction progress: |████████████████████████████████████| 100%


In [17]:
y_train_vals = y_train.as_data_frame().values.ravel()
y_train_pred_vals = aml.leader.predict(train).as_data_frame().values.ravel()

stackedensemble prediction progress: |████████████████████████████████████| 100%


In [18]:
mape_test = mean_absolute_percentage_error(y_test_vals,y_test_pred_vals )
mape_train = mean_absolute_percentage_error(y_train_vals,y_train_pred_vals )

In [19]:
print("Training-set MAPE: "+str(mape_train))
print("Test-set MAPE: "+str(mape_test))

Training-set MAPE: 0.8457119597240534
Test-set MAPE: 1.6558838799077962


### Actual values

In [20]:
y_test_vals[0:5]

array([15.96, 12.69, 15.27,  7.9 , 12.69])

### Predicted Values

In [21]:
y_test_pred_vals[0:5]

array([15.72052164, 12.3919578 , 15.07716178,  7.86833589, 12.70272403])