<a href="https://colab.research.google.com/github/yuanfeiwo/test-firstrepository/blob/master/Module5_5_6_AutoML_V3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Module: Data Science in Finance, AutoML 
# Version 1.0
# Topic : AutoML - auto-sklearn
# Example source: https://www.kaggle.com/wendykan/lending-club-loan-data
#####################################################################
# For support or questions, contact QuantUniversity at
# info@qusandbox.com
# Copyright 2020 QuantUniversity LLC.
#####################################################################

# AutoML with auto-sklearn

AutoML is the process of automating an end-to-end Machine Learning pipeline. [auto-sklearn](https://automl.github.io/auto-sklearn/stable/index.html) specifically uses Bayesian optimization, meta-learning and ensemble construction to optimise these pipelines by selecting the best model and its hyperparamters.

This notebook explains the basic workflow involved in an AutoML pipeline with auto-sklearn

### Imports

In [2]:
!sudo apt-get install build-essential swig
!curl https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt | xargs -n 1 -L 1 pip install
!pip install auto-sklearn
!pip install scikit-learn==0.23.0

Reading package lists... Done
Building dependency tree       
Reading state information... Done
build-essential is already the newest version (12.4ubuntu1).
swig is already the newest version (3.0.12-1).
0 upgraded, 0 newly installed, 0 to remove and 14 not upgraded.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   205  100   205    0     0    791      0 --:--:-- --:--:-- --:--:--   794


In [3]:
# for numerical analysis and data processing
import numpy as np
import pandas as pd

#AutoML
import sklearn.metrics
import autosklearn.regression

import requests
from io import StringIO

import warnings
warnings.filterwarnings('ignore')

### Dataset

The data set is the lending data for lendingclub from August 2011 to December 2011 for some borrowers. The feature descriptions for the data are also provided. Not all the features are required for making predictions, some features are redundant in the original data file. The provided data file is already cleaned and only relevant features are provided. There are two types of features, numerical and categorical.

Reading the input data from csv file.

In [4]:
orig_url_data='https://drive.google.com//file//d//1yG-JxC1Br3c8u3cfmKQWC9pgz6Pqggw5//view?usp=sharing'
file_id = orig_url_data.split('//')[-2]
dwn_url='https://drive.google.com//uc?export=download&id=' + file_id
url = requests.get(dwn_url).text
csv_raw = StringIO(url)
df = pd.read_csv(csv_raw)

orig_url_description='https://drive.google.com//file//d//1HFd4gKbknC28rHTWysec48NqfB6g3ZHx//view?usp=sharing'
file_id = orig_url_description.split('//')[-2]
dwn_url='https://drive.google.com//uc?export=download&id=' + file_id
url = requests.get(dwn_url).text
csv_raw = StringIO(url)
df_description = pd.read_excel(dwn_url)


del df['issue_d'] # removing issue date as it wont affect the prediction (redundant feature)

print (df_description.head())

               LoanStatNew                                        Description
0               addr_state  The state provided by the borrower in the loan...
1               annual_inc  The self-reported annual income provided by th...
2         annual_inc_joint  The combined self-reported annual income provi...
3         application_type  Indicates whether the loan is an individual ap...
4  collection_recovery_fee                     post charge off collection fee


In [5]:
df.info()
feature_types = ['numerical']+['categorical']+(['numerical']*1)+(['categorical']*4)+['categorical']+(['categorical']*3)+(['numerical']*4)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   loan_amnt            9999 non-null   int64  
 1   term                 9999 non-null   object 
 2   int_rate             9999 non-null   float64
 3   installment          9999 non-null   float64
 4   grade                9999 non-null   object 
 5   sub_grade            9999 non-null   object 
 6   emp_length           9644 non-null   object 
 7   home_ownership       9999 non-null   object 
 8   annual_inc           9999 non-null   float64
 9   verification_status  9999 non-null   object 
 10  purpose              9999 non-null   object 
 11  addr_state           9999 non-null   object 
 12  dti                  9999 non-null   float64
 13  delinq_2yrs          9999 non-null   int64  
 14  inq_last_6mths       9999 non-null   int64  
 15  loan_status_Binary   9999 non-null   i

In [6]:
numeric_columns = df.select_dtypes(include=['float64','int64']).columns
categorical_columns = df.select_dtypes(include=['object']).columns

In [7]:
for col in categorical_columns:
    df[col] = df[col].astype('category')

#### Dictionary for categorical features.

In [8]:
categories={}
for cat in categorical_columns:
    categories[cat] = df[cat].cat.categories.tolist()

In [9]:
p_categories = df['purpose'].cat.categories.tolist()
s_categories = df['addr_state'].cat.categories.tolist()
df[categorical_columns] = df[categorical_columns].apply(lambda x: x.cat.codes)

In [10]:
df.columns

Index(['loan_amnt', 'term', 'int_rate', 'installment', 'grade', 'sub_grade',
       'emp_length', 'home_ownership', 'annual_inc', 'verification_status',
       'purpose', 'addr_state', 'dti', 'delinq_2yrs', 'inq_last_6mths',
       'loan_status_Binary'],
      dtype='object')

Storing interest rate statistics

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   loan_amnt            9999 non-null   int64  
 1   term                 9999 non-null   int8   
 2   int_rate             9999 non-null   float64
 3   installment          9999 non-null   float64
 4   grade                9999 non-null   int8   
 5   sub_grade            9999 non-null   int8   
 6   emp_length           9999 non-null   int8   
 7   home_ownership       9999 non-null   int8   
 8   annual_inc           9999 non-null   float64
 9   verification_status  9999 non-null   int8   
 10  purpose              9999 non-null   int8   
 11  addr_state           9999 non-null   int8   
 12  dti                  9999 non-null   float64
 13  delinq_2yrs          9999 non-null   int64  
 14  inq_last_6mths       9999 non-null   int64  
 15  loan_status_Binary   9999 non-null   i

In [12]:
min_rate= df['int_rate'].min()
max_rate= df['int_rate'].max()
print(min_rate, max_rate, max_rate- min_rate)

5.42 24.11 18.689999999999998


In [13]:
df_max = df.max()
df_min = df.min()

## Preparing the dataset 

The data is split into training and testing data. x represents the input features whereas y represents the output i.e. the interest rate.As a rule of thumb, we split the data into 80% training data and 20% testing or validation data.

In [14]:
y = df.iloc[:,df.columns.isin(["int_rate"])]
x = df.loc[:, ~df.columns.isin(["int_rate"])]

total_samples=len(df)
split = 0.8

x_train = x[0:int(total_samples*split)]
x_test = x[int(total_samples*split):total_samples]
y_train = y[0:int(total_samples*split)]
y_test = y[int(total_samples*split):total_samples]

## AutoML

### The following is all the code needed to find the best model:

In [16]:
automl = autosklearn.regression.AutoSklearnRegressor(
    time_left_for_this_task=600, # in seconds
    per_run_time_limit=60, # in seconds
)
automl.fit(x_train, y_train, dataset_name='finance')#,
          #  feat_type=feature_types)

AutoSklearnRegressor(load_models=None, per_run_time_limit=60,
                     time_left_for_this_task=600)

### AutoML training details

#### A list of all the algorithm runs

In [17]:
automl.show_models()

"[(0.460000, SimpleRegressionPipeline({'data_preprocessing:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessing:categorical_transformer:category_coalescence:__choice__': 'minority_coalescer', 'data_preprocessing:numerical_transformer:imputation:strategy': 'most_frequent', 'data_preprocessing:numerical_transformer:rescaling:__choice__': 'quantile_transformer', 'feature_preprocessor:__choice__': 'no_preprocessing', 'regressor:__choice__': 'random_forest', 'data_preprocessing:categorical_transformer:category_coalescence:minority_coalescer:minimum_fraction': 0.010000000000000004, 'data_preprocessing:numerical_transformer:rescaling:quantile_transformer:n_quantiles': 1818, 'data_preprocessing:numerical_transformer:rescaling:quantile_transformer:output_distribution': 'normal', 'regressor:random_forest:bootstrap': 'True', 'regressor:random_forest:criterion': 'mse', 'regressor:random_forest:max_depth': 'None', 'regressor:random_forest:max_features': 

#### A summary of all the algorithm runs

In [18]:
print(automl.sprint_statistics())

auto-sklearn results:
  Dataset name: finance
  Metric: r2
  Best validation score: 0.999018
  Number of target algorithm runs: 31
  Number of successful target algorithm runs: 26
  Number of crashed target algorithm runs: 0
  Number of target algorithms that exceeded the time limit: 5
  Number of target algorithms that exceeded the memory limit: 0



### Using the best pipeline to make predictions

In [19]:
predictions = automl.predict(x_test)
predictions_train = automl.predict(x_train)

#### Best model performance

In [20]:
print("MAE score:", sklearn.metrics.mean_absolute_error(y_test, predictions))
print("R2 score:", sklearn.metrics.r2_score(y_test, predictions))

MAE score: 0.5524993579149245
R2 score: 0.9793720543216555


### Export the best model

In [21]:
import pickle
pickle.dump(automl, open('automl.model','wb'))

### MAPE (Mean Absolute Percentage Error)

In [22]:
# from sklearn.utils import check_arrays
def mean_absolute_percentage_error(y_true, y_pred): 
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

In [23]:
mape_test = mean_absolute_percentage_error(y_test.values.ravel(), predictions)
mape_train = mean_absolute_percentage_error(y_train.values.ravel(), predictions_train)

In [24]:
print("Training-set MAPE: "+str(mape_train))
print("Test-set MAPE: "+str(mape_test))

Training-set MAPE: 0.4204883227702125
Test-set MAPE: 5.02911371792291


In [25]:
y_test.values[0:5].ravel()

array([13.49, 11.49, 13.99, 10.59,  7.49])

In [26]:
predictions[0:5]

array([14.22410762, 12.3641277 , 14.62936765, 10.63200432,  7.85569295])