# Welcome to Paul Kang's Data Science practice

Lets import the basic (normal stuff that I use when it comes to non-big data ML projects) libraries... and make sure the data that I am dealing with does not have any missing/non-numeric(except for categoricals). This became a habit to me cuz... it bit me a lot of times when doing some advanced analytics works on the job...

In [None]:
# import necessary libraries: this will get updated as I go along
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import shap
from sklearn import metrics, model_selection, preprocessing
import warnings 
import xgboost
from statsmodels.graphics import tsaplots as tsa
warnings.filterwarnings('ignore')
%matplotlib inline

In [None]:
train_csv_route = '../input/optiver-realized-volatility-prediction/train.csv'
test_csv_route = '../input/optiver-realized-volatility-prediction/test.csv'
book_train = pd.read_parquet('../input/optiver-realized-volatility-prediction/book_train.parquet')
book_test = pd.read_parquet('../input/optiver-realized-volatility-prediction/book_test.parquet')
trade_train = pd.read_parquet('../input/optiver-realized-volatility-prediction/trade_train.parquet')
trade_test = pd.read_parquet('../input/optiver-realized-volatility-prediction/trade_test.parquet')

train = pd.read_csv(train_csv_route)
test = pd.read_csv(test_csv_route)
data = {
    'book_train':book_train,
    'book_test':book_test,
    'trade_train':trade_train,
    'trade_test':trade_test,
    'train':train,
    'test':test
}

Lets study on how the data that I am going to deal with looks like...

In [None]:
def set_check(df,df_name):
    print("-"*40,"For ",df_name," Set","-"*40)
    print(f"column names: {list(df.columns)}\n")
    print(f"data types: {list(df.dtypes)}\n")
    print(f"shape of data that we are dealing with in this dataset: {df.shape}\n")
    print(f"Null data for each columns?: \n{df.isnull().sum()}\n")

for df_name,df in data.items():
    set_check(df,df_name)

In [None]:
print(sorted(list(train['stock_id'].unique())), "\n unique values: ", len(list(train['stock_id'].unique())))
print(sorted(list(trade_train['stock_id'].unique())), "\n unique values: ", len(list(trade_train['stock_id'].unique())))

# 0. Why am I doing this as a petrochemical process engineer?

In most cases, when Data Scientists (DS) approach and try to solve a business problem, often there is a need for DS to have at least tiny bit of understanding of what the domain that they are dealing with (domain knowledge) to understand what data that they need, what each of the feature mean for the "prepared" data. As a person who came from the Petroleum/Oil trying to observe and study the data science, I can understand that it can be quite difficult for the professional data scientists to actually "perform their magic" without understanding underlying meaning of each attributes and phenomenon. 

Every domain knowledge has its expertise; thats why we have so many different field of study: stocks, engineering, education, medicine, pharmacy, finance, fitness, ..... etc; Its endless. For each of the streams, hundreads of years the experts have constructed the knowledge and it just became a castle. Domain experts are the ones who actually studied those mountains, and someone who has an actual experience of applying those domain knowledge to real world is called engineers; and during the process, engineers often go wild data analysis to do data analyst job: find solutions/answers & ask the right questions that will lead to another answers and another chain of questions, whether if it involves interpolating to describe the phenomenon, or extrapolating to predict future using known algorithms such as multivariate analysis, blah blah blah you name it.

Data science, to me, it is a field that every engineers should learn if they want to become really good engineers. Unless someone wants to go to research finding/designing novel ML algorithm, it is a skill that every engineers should learn to advance their skills on applying their knowledge to their subject matter.

Let me hone my data skills in kaggle, by using the datasets that does not belong to my subject matter. 

# 1. Understanding the Data

Now, I am dealing with 6 datasets, spliting into train/test: book_train, trade_train, train, book_test, trade_test, test, all in csv files, and the book, trade sets are in a parquet. 

test sets are having only 3 rows, which contains the attributes of what booking, trading sets have, and our goal is to predict the realized volatility of the test sets.

* For booking, we have: 

>     time_id              time for the stock that are recorded. just think of this as a time, represented by just integer, instead of widely accepted (ISO8601) of dates and time
>     seconds_in_bucket    yeah. these guys reinvented the wheel on how to record the time. this is just # seconds within the time_id bucket.
>     bid_price1           obvious. read the introductory notebook and u will understand what this means - it is first best bid price recorded in the order book
>     ask_price1           it is first best ask price recorded in the order book
>     bid_price2           second best bid (buy)
>     ask_price2           second best ask (sell)
>     bid_size1            at the first best bid, what was the volume (how many is this person trying to buy at the best buying price 1)
>     ask_size1            at the first best ask, what was the volume (how many is this person trying to sell at the best selling price 1)
>     bid_size2            at the second best bid, what was the volume (how many is this person trying to buy at the second best buying price 2)
>     ask_size2            at the second best ask, what was the volume (how many is this person trying to sell at the second selling price 2)
>     stock_id             this of this as company name like for tesla: its TSLA in NASDAQ or sth. just company name in the stock market

* For trading we have: 

>     time_id              same thing as time_id
>     seconds_in_bucket    same as booking set
>     price                price of a stock when the trade finally got executed. 
>     size                 overall stock selling/buying at that time, at that price it was executed
>     order_count          The number of unique trade orders taking place.
>     stock_id             company name, same as booking set

okay.. we have lots of data and kindly enough, Optiver gave us great hints: bid/ask spread, WAP, Log-return, calculated realized volatility. Before going onto calculate them, lets study the relationships here. 

for the realized volatility, it seems for each stock_id, and each time_id, it is calculated using the data inside those buckets. for example, at company 0, at time 5 (stock_id=0,time_id=5), there are certain amount of "trade" that was happening, and the realized volatility is calculated using those "trade data" within that bucket. Then this is simple. this data, for stock_id, and time_id, it is a "long" data format. 

for each time_id, how can I then represent the "unique" order counts, price, size, 1st bidding/ask 2nd bidding/ask volumen and prices? using WAP? each row will have different WAP (execpt for the first row) we need some creativity here.

I have just checked that for each company stock_id, we have same time_id, and just confirmed they are consistent. then this means, the row value can be just only the time_id, because I also want to study the interaction among the companies and how other parameters at different comapnies they affect the realized volatility of comapny '0' 

for each company, we have 9 basic traits at booking data (except for company id and time id), and 112 different companies. then this means, if we pivot this, 

9 traits for 112 companies, 1008 basic traits on booking data, 4 traits in 112 companies, 448 basic traits on trading data. What am I going to do with other company stock data if our target is only to predict using stock_id=0? well, I want to see the relationship of how other companies interact and affect the company 0. But... in our test data, we only have the booking and trading information with the company '0', which means if I want to predict with the interaction with other company involved, Optiver needs to provide those trade and booking data for other stock ids as well.

for feature engineering:

    for trading sets:

        sum the unique order_count, 
        sum the size, 
        summed_size/summed_order_count, 
        weighted average of seconds_in_bucket according to the size, 
        mean the price
        Log-return of prices
        highest_price-lowest_price

    for booking sets:
    
        mean of:
            BAS_1
            BAS_2
            WAP_1
            WAP_2
            Log_return_1
            Log_return_2
            Calculated_volatility_1 (sigma_1)
            Calculated_volatility_2 (sigma_2)
            bid_price1/2
            ask_price1/2
            seconds_in_bucket
        sum of:
            bid_size1/2
            ask_size1/2
        
# 2.0 Data wrangling / transforming / feature engineering

In [None]:
book_train = book_train[book_train['stock_id']==0]
trade_train = trade_train[trade_train['stock_id']==0]
train = train[train['stock_id']==0]
print(f"booking_training set of stock id 0 is: {book_train.shape}")
print(f"trading_training set of stock id 0 is: {trade_train.shape}")
print(f"actual recorded volatility training set set of stock id 0 is: {train.shape}")

In [None]:
# configuring class that groups the functions
class Optiver_feature_engineered:
    
    """
    it is a collection of the features... docstring work in progress. 
    """
    TODO: 'complete docstring for this'
    
    def __init__(self,df=None,df_name=None):
        self.df = df
        self.df_name = df_name
        
    def BAS(self,ask_price,bid_price):
        return [ask_p/bid_p - 1 for ask_p,bid_p in zip(ask_price,bid_price)]

    def WAP(self,df):
        wap = (df[df.columns[0]] * df[df.columns[1]] + df[df.columns[2]]*df[df.columns[3]])/(df[df.columns[1]]+df[df.columns[3]])
        return wap

    def log_return(self,list_stock_prices):
        return np.log(list_stock_prices).diff() 

    def realized_volatility(self,series_log_return):
        return np.sqrt(np.sum(series_log_return**2))


In [None]:
# preprocessing for booking dataset:
# BAS
# WAP
# Log return
# Calculated volatility


fe = Optiver_feature_engineered(df,'df')
def preprocessings_book(df):
    
    df['seconds_in_bucket'] = df['seconds_in_bucket'] + 1 # cuz 0 seconds will mess up the data internally
    df['seconds_bids'] = df['seconds_in_bucket']*(df['bid_size1']+df['bid_size2'])
    df['seconds_asks'] = df['seconds_in_bucket']*(df['ask_size1']+df['ask_size2'])
    df['BAS1'] = fe.BAS(df['ask_price1'],df['bid_price1'])
    df['BAS2'] = fe.BAS(df['ask_price2'],df['bid_price2'])
    df['WAP1'] = fe.WAP(df[['bid_price1','ask_size1','ask_price1','bid_size1']])
    df['WAP2'] = fe.WAP(df[['bid_price2','ask_size2','ask_price2','bid_size2']])
    df['logr1'] = df.groupby(['time_id'])['WAP1'].apply(fe.log_return)
    df['logr2'] = df.groupby(['time_id'])['WAP2'].apply(fe.log_return)
    apply_functions = {"seconds_in_bucket":"mean",
                       "bid_price1":"mean",
                       "bid_price2":"mean",
                       "ask_price1":"mean",
                       "ask_price2":"mean",
                       "BAS1":"mean",
                       "BAS2":"mean",
                       "WAP1":"mean", # null values to be ignored when taking mean
                       "WAP2":"mean", # null values to be ignored when taking mean
                       "logr1":"mean",
                       "logr2":"mean",
                       "seconds_bids":"sum",
                       "seconds_asks":"sum",
                       'bid_size1':"sum",
                       'bid_size2':"sum",
                       'ask_size1':"sum",
                       'ask_size2':"sum"
                      }
    df_feature = df.groupby(['time_id']).agg(apply_functions)
    df_feature['vol_1'] = df.groupby(['time_id'])['logr1'].apply(fe.realized_volatility)
    df_feature['vol_2'] = df.groupby(['time_id'])['logr2'].apply(fe.realized_volatility)
    df_feature['seconds_bids'] = df_feature['seconds_bids']/(df_feature['bid_size1'] + df_feature['bid_size2'])
    df_feature['seconds_asks'] = df_feature['seconds_asks']/(df_feature['ask_size1'] + df_feature['ask_size2'])
    df_feature.reset_index(inplace=True)
    df_feature.drop(columns='seconds_in_bucket',axis=1,inplace=True)
    return df_feature

# Preprocessing for trading dataset:

def preprocessings_trade(df):

    df['seconds_in_bucket'] = df['seconds_in_bucket'] + 1
    df['seconds_size'] = df['seconds_in_bucket']*df['size']
    df['logr_p'] = df.groupby(['time_id'])['price'].apply(fe.log_return)
    apply_func = {
        'order_count':'sum',
        'seconds_in_bucket':'mean',
        'size':'sum',
        'seconds_size':'sum',
        'price':'mean',
        'logr_p':'mean'
    }

    df_feature = df.groupby(['time_id']).agg(apply_func)
    df_feature['spread'] = df.groupby(['time_id'])['price'].max() - df.groupby(['time_id'])['price'].min()
    df_feature['vol_p'] = df.groupby(['time_id'])['logr_p'].apply(fe.realized_volatility)
    df_feature['seconds_size'] = df_feature['seconds_size']/df_feature['size']
    df_feature.reset_index(inplace=True)
    df_feature.drop(columns='seconds_in_bucket',axis=1,inplace=True)
    return df_feature

book_train_feature = preprocessings_book(book_train)
book_test_feature = preprocessings_book(book_test)
trade_train_feature = preprocessings_trade(trade_train)
trade_test_feature = preprocessings_trade(trade_test)


dataset = pd.merge(book_train_feature,trade_train_feature,how='left',on=['time_id'])
df = pd.merge(dataset,train,how='right',on=['time_id'])
df.drop(columns='stock_id',inplace=True)
df.dropna(inplace=True)

In [None]:
dataset = pd.merge(book_train_feature,trade_train_feature,how='left',on=['time_id'])
df = pd.merge(dataset,train,how='right',on=['time_id'])
df.drop(columns='stock_id',inplace=True)
df.dropna(inplace=True)
dataset_val = pd.merge(book_test_feature,trade_test_feature,how='left',on=['time_id'])
df_val = pd.merge(dataset_val,test,how='right',on=['time_id'])
df_val.drop(columns=['stock_id','row_id'],inplace=True)

In [None]:
# book_train_feature.describe(percentiles = [i/10 for i in range(1,10)])

from IPython.display import display

display(df.head())
df_val.head()

# 3.0 Exploratory Data Analysis (EDA) and visualizations

For me, the easiest way to take a head start is to:

1. study the correlation heatmaps, sort out the predictors that has highest pearson correlation in descending order,

2. look at the scatter matrix to see how the distribution of the datapoints x and y looks like

3. Check each predictor and the response variable with the seasonality or cyclicity (acf,pacf) to check autocorrelations (find out if it is suitable to apply mutivariate RNN/LSTM eventhough it is not sequential dataset, but we can make it like one...

4. draw boxplots for predictors and target

5. draw distribution plot/histogram for each predictors and target to see skewness, kurtosis

   # 3.1 Correlation heatmaps / Scatter Matrix, listings of correlations

In [None]:
import plotly.figure_factory as ff

fig = ff.create_annotated_heatmap(df.corr().apply(lambda x: round(x,2)).to_numpy(),x = list(df.columns),y=list(df.columns))
fig.update_layout(
    margin=dict(l=40, r=40, t=40, b=40),
    autosize=False,
    width=1500,
    height=1500,
    paper_bgcolor="LightSteelBlue",
)

fig.show()

In [None]:
fig = px.scatter_matrix(df)
fig.update_layout(
    margin=dict(l=40, r=40, t=40, b=40),
    autosize=False,
    width=1500,
    height=1500,
    paper_bgcolor="LightSteelBlue",
)
fig.update_traces(marker=dict(size=2))

fig.show()

Sure...what correlates best with the target realized volatility..? shortlist them..

In [None]:
def correlation_list(df,var_name):
    # unwind the matrix to 1d array with the var1,var2 multiindex, descending order
    corr_list = df.corr().unstack().sort_values(kind='quicksort').iloc[::-1]
    # reset the indexes
    corr_list = corr_list.reset_index()
    # rename the indexes to var_1 and var_2
    corr_list.rename(columns={'level_0':'var_1','level_1':'var_2',0:'correlation'},inplace=True)
    # take only the half of the triangle from the correlation matrix, since it is just a duplicate.
    corr_list.drop_duplicates(subset=['correlation'],inplace=True)
    corr_list.reset_index(drop=True,inplace=True)
    # remove the middle diagonal numbers (correlation = 1.0)
    drop_index = []
    for i in range(len(corr_list)):
        if corr_list.iloc[i,0] == corr_list.iloc[i,1]:
            drop_index.append(i)
    corr_list.drop(labels=drop_index,axis=0,inplace=True)
    # check if there is var1 and var 2 overlaps
    count = (corr_list['var_1'] == corr_list['var_2'])
    print(f"duplicated number: {count.sum()}")
    if count.sum() != 0:
        raise Exception('Check your function code. something is not working with ur dataset')
    else:
        corr_var1 = corr_list[corr_list['var_1']==var_name]
        corr_var2 = corr_list[corr_list['var_2']==var_name]
        corr_var2.rename(columns={'var_1':'var_2','var_2':'var_1'},inplace=True)
        corr_var = corr_var1.append(corr_var2,ignore_index=True)
        corr_var.sort_values(by=['correlation'],inplace=True,ascending=False)
        corr_var.reset_index(drop=True,inplace=True)
    return corr_var

In [None]:
corr_var = correlation_list(df,'target')
corr_var

Visualize it:

In [None]:
px.bar(corr_var,x='var_2',y='correlation',title='correlation with the target')

okay,... so the target realized volatility is heavily correlated with the calculated volaility from the booking set and trade sets, and the bid/ask spread metric... also, the "size" from the trade set also matters... of course. higher the size with frequency may correlate well with the realized volatility... i am just surprized how price is not really correlating with the target realized volatility. but lets go ahead with this for now.

# 3.2 Autocorrelation factor / Partial autocorrelation (acf/pacf) and cycliclity

in my previous work... I have done some LSTM/RNN study with predicting tesla stock. and for the dataset that has the characteristics of time series / sequences, it is worthwhile to check the autocorrelation. i am going to copy what I did in the past, and raise again with the concept of autocorrelations. (original workbook: https://www.kaggle.com/pkang0831/failed-lstm-rnn-with-new-recommendation )

**Auto Correlation**

In time series problems, it is important to see what is its autocorrelation and test if there is time-dependent repeated patterns (cyclicity). For this, pandas lag_plot will be used

> A linear shape to the plot suggests that an autoregressive model is probably a better choice.

> An elliptical plot suggests that the data comes from a single-cycle sinusoidal model.

If your data shows a linear pattern, it suggests autocorrelation is present. A positive linear trend (i.e. going upwards from left to right) is suggestive of positive autocorrelation; a negative linear trend (going downwards from left to right) is suggestive of negative autocorrelation. The tighter the data is clustered around the diagonal, the more autocorrelation is present; perfectly autocorrelated data will cluster in a single diagonal line.

![](https://www.statisticshowto.com/wp-content/uploads/2016/11/lag-plot-linear.png)


Data can be checked for seasonality by plotting observations for a greater number of periods (lags). Data with seasonality will repeat itself periodically in a sine or cosine-like wave.

![](https://www.statisticshowto.com/wp-content/uploads/2016/11/seasons-lp.png)

lets give a try at 40 offset and see if it selfcorrelate...

In [None]:
plt.figure(figsize=(20,60))
for i,var in enumerate(df.columns):
    plt.subplot(len(df.columns)/3,3,i+1)
    plt.title(df.columns[i])
    pd.plotting.lag_plot(df[var],lag=1700)


So.... nothing at the lag of 30,80 but 1750, we start to have something... but the the cross shape of the lag plot.. dunno what that refers to. if there is any expert who can interpret what the "+" shape means in lag plot, please teach me... 

umm.. lets calculate the acf and pacf and see where about the lag is sufficient for detecting autocorrelation.

In [None]:
# autocorrelation factor
fig,ax = plt.subplots(27,1,figsize=(20,60))
for i,var in enumerate(df.columns):
    tsa.plot_acf(df[var],lags=1750,ax=ax[i],title=df.columns[i])
    ax[i].set_ylim(0,0.25)

In [None]:
# partial autocorrelation factor - it takes 2.5 hrs to run.. if you guys can please optimize the algorithm for me, I WOULD LOVE TO LEARN.
fig,ax = plt.subplots(27,1,figsize=(20,60))
for i,var in enumerate(df.columns):
    tsa.plot_pacf(df[var],lags=1750,ax=ax[i],title=df.columns[i])
    ax[i].set_ylim(0,0.15)

so... the way to read the pacf plots is that there is a 95% confidence interval plotted 

In [None]:
fig,ax = plt.subplots(figsize=(40,10))

for i,col in enumerate(df.columns):
    plt.subplot(3,len(df.columns)/3,i+1)
    plt.boxplot(x=df[col],showmeans = True, meanline = True)
    x = np.random.normal(1, 0.04, size=len(df[col]))
    plt.plot(x,df[col],'r.',alpha=0.05)
    plt.plot(1,np.mean(df)[i],'bo',alpha=1)
    plt.xlabel(col)

In [None]:
fig,ax = plt.subplots(figsize=(40,10))

for i,col in enumerate(df.columns):
    plt.subplot(3,len(df.columns)/3,i+1)
    sns.distplot(x=df[col])
    plt.xlabel(col)

In [None]:
fig,ax = plt.subplots(figsize=(40,10))

for i,col in enumerate(df.columns):
    plt.subplot(3,len(df.columns)/3,i+1)
    sns.kdeplot(x=df[col])
    plt.xlabel(col)

So for the ones that are having decent correlation with the target, lets try to plot KDE (kernel density estimation) plot to interpret which one affects the most..

Recall corr_var: 

In [None]:
corr_var

Let us say.. for example, anything that has lesser than 0.3 of pearson R is ignored. lets see how the KDE looks like for this. then from the list above, its going to be until target vs size

In [None]:
name_to_plot = list(corr_var['var_2'][0:7])
for i in range(0,7):
    sns.jointplot(x = name_to_plot[i],y = 'target',data = df)

so... most of them positive relationship...

so far what I can say about this data is that for the target, this calculated volatilities from first/second best bid/sell data and the traded calculated volatility is fairly well representative on what the actual volatility is. furthermore, we cannot overlook the importance of the BAS (bid/ask spread). In trading world, i do not know how important this is, but it must be well known variable that tells a story in liquidity of the trading market.

# 4.0 Modelling

So.. I see this as a supervised regression modelling problem. Lets use our famous sklearn library, to split the data, scale our data, fit the data to the collection of models and check the performance. 

I will use tensorflow(tf) for neural network...just because i love tf and more familiar of ANN mapping using tf... 

Anyways!

Predefined Models to study :) (from https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares)

1. Linear methods

    1.1 Ordinary Least Squares - basically... simple multilinear regression that we all know... y = ax+bx+cx..... + e with [a,b,c,...] being weights (X'X)^-1X'y 
    
    1.2 Ridge regression - from what I know.. this is just applying penalty alpha in OLS method - (X'X+aI)^-1X'y so that we introduce slight bias to reduce the high variance, avoid overfitting... get the alpha from the CV and try out, determine optimal alpha (or lambda)... more about this later in hyperparameter tuning
    
    1.3 Lasso regression - this is very similar to the ridge regression, but there is a slight difference in how the model is set up with that alpha term. in Ridge, the penalty does not goes to zero, but in Lasso, it does. this means that ridge regression, even if it identifies that the variable is useless for predicting target, it cannot eliminate the variable. it just reduces the weights. however, in Lasso, it eliminates them (lambda=0). this means that Lasso will perform slightly better with bunch of features which includes variables/predictors terrible predicting the target
    
    1.4 Elastic-Net regression - combination of both ridge and lasso. this lambda 1 and lambda 2 values (L1 - Lasso, L2 - Ridge) are determined using CV, and this model is said to be good at dealing correlated predictors
    
    ~~1.5 Least Angle Regression (LARS) or LAS Lasso - This is mainly for the high dimension data (feature > observations). in this case, since we are not dealing with it, we will skip it~~
    
    1.6 Orthogonal Matching Persuit - Honestly I really don't know about this algorithm. but lets try...
    
    1.7 Bayesian Ridge Regression - again, not too sure about the theoreticals behind this, but this, from what I know, is the bayesian probablistic model for the ridge regression.
    
    ~~1.8 Logistic Regression - yes, this can be used for regression, but from what I learned is that this one is mainly used for classification. we will ignore this.~~
    
    1.9 Generalized Linear Model (GLM) - generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for the response variable to have an error distribution other than the normal distribution. (Got this from wikipedia. we are going to use statsmodels library for this. Scikit Learn has this as well, but statsmodels has put the glm as a separate class from their api.)
    
    1.10 Stochastic Gradient Descent - SGDRegressor, basically implements a plain SGD learning routine supporting various loss functions and penalties to fit linear regression models.  for learning SGD, well.. u know what gradient descent is right? if not... um just think of it as one of the algorithm finding local minima in whatever function in 2D/3D... SGD algorithm is just putting stochastic algorithm to perform graident descent (like picking one data/batch at a step), thus much more efficient and effective finding the solution. I suggest u also review Markov chain while u learn this.
    
    1.11 Perceptron - What I have read from other resources, this is mainly used for classification. However, this will be discussed later when I deal with multilayer perceptron or ANN modelling with tensorflow
    
    
2. Kernel ridge regression - Ridge regression with using kernel trick/method


3. Support Vector Machines
    
    3.1 Epsilon-Support Vector Regression.
    
    3.2 NuSVR
    
    3.3 LinearSVR
    

4. Stochastic Gradient Descent


5. Nearest Neighbors


6. Gaussian Processes


7. PLSRegression (Partial Least Squares...?)


8. Decision Trees



9.  Ensemble methods

In [None]:
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn import linear_model, kernel_ridge, svm, neighbors, gaussian_process, cross_decomposition, tree, ensemble, neural_network
import xgboost
Regress_alg = [
    # Linear Models
    linear_model.LinearRegression(), # parameter for later gridsearch: fit_intercept=True, normalize=False, copy_X=True, n_jobs=None, positive=False
    linear_model.RidgeCV(), # alpha=1.0, fit_intercept=True, normalize=False, copy_X=True, max_iter=None, tol=0.001, solver='auto', random_state=None
    linear_model.LassoCV(), # alpha=1.0, fit_intercept=True, normalize=False, precompute=False, copy_X=True, max_iter=1000, tol=0.0001, warm_start=False, positive=False, random_state=None, selection='cyclic'
    linear_model.ElasticNet(), # alpha=1.0, l1_ratio=0.5, fit_intercept=True, normalize=False, precompute=False, max_iter=1000, copy_X=True, tol=0.0001, warm_start=False, positive=False, random_state=None, selection='cyclic'
    linear_model.OrthogonalMatchingPursuit(), # n_nonzero_coefs=None, tol=None, fit_intercept=True, normalize=True, precompute='auto'
    linear_model.BayesianRidge(), # n_iter=300, tol=0.001, alpha_1=1e-06, alpha_2=1e-06, lambda_1=1e-06, lambda_2=1e-06, alpha_init=None, lambda_init=None, compute_score=False, fit_intercept=True, normalize=False, copy_X=True, verbose=False
    linear_model.GammaRegressor(), # alpha=1.0, fit_intercept=True, max_iter=100, tol=0.0001, warm_start=False, verbose=0
    linear_model.PoissonRegressor(), # alpha=1.0, fit_intercept=True, max_iter=100, tol=0.0001, warm_start=False, verbose=0
    linear_model.TweedieRegressor(), # power=0.0, alpha=1.0, fit_intercept=True, link='auto', max_iter=100, tol=0.0001, warm_start=False, verbose=0
    # Stochastic Gradient Descent (SGD)
    linear_model.SGDRegressor(), # loss='squared_loss', *, penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=1000, tol=0.001, shuffle=True, verbose=0, epsilon=0.1, random_state=None, learning_rate='invscaling', eta0=0.01, power_t=0.25, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, warm_start=False, average=False
    
    # Kernel Ridge Regression
#     kernel_ridge.KernelRidge(), # alpha=1, kernel='linear', gamma=None, degree=3, coef0=1, kernel_params=None
    
    # Support Vector Machine - Regressions
#     svm.SVR(), # kernel='rbf', degree=3, gamma='scale', coef0=0.0, tol=0.001, C=1.0, epsilon=0.1, shrinking=True, cache_size=200, verbose=False, max_iter=- 1
    svm.NuSVR(), # nu=0.5, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, tol=0.001, cache_size=200, verbose=False, max_iter=- 1
    svm.LinearSVR(), # epsilon=0.0, tol=0.0001, C=1.0, loss='epsilon_insensitive', fit_intercept=True, intercept_scaling=1.0, dual=True, verbose=0, random_state=None, max_iter=1000

    # Nearest Neighbours Regression
    neighbors.KNeighborsRegressor(), # n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None
#     neighbors.RadiusNeighborsRegressor(), # radius=1.0, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None
    
    # Gaussian Process Regression (GPR)
    gaussian_process.GaussianProcessRegressor(), # kernel=None, alpha=1e-10, optimizer='fmin_l_bfgs_b', n_restarts_optimizer=0, normalize_y=False, copy_X_train=True, random_state=None
    
    # PLS Regression
#     cross_decomposition.PLSRegression(), # n_components=2, scale=True, max_iter=500, tol=1e-06, copy=True
    
    # Decision Tree
    tree.DecisionTreeRegressor(), # criterion='mse', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, ccp_alpha=0.0
    # Ensemble methods
    ensemble.AdaBoostRegressor(), # base_estimator=None, n_estimators=50, learning_rate=1.0, loss='linear', random_state=None
    ensemble.BaggingRegressor(), # base_estimator=None, n_estimators=10, *, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0
    ensemble.ExtraTreesRegressor(), # n_estimators=100, *, criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=False, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None
    ensemble.RandomForestRegressor(), # n_estimators=100, *, criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None
    ensemble.HistGradientBoostingRegressor(), # loss='least_squares', learning_rate=0.1, max_iter=100, max_leaf_nodes=31, max_depth=None, min_samples_leaf=20, l2_regularization=0.0, max_bins=255, categorical_features=None, monotonic_cst=None, warm_start=False, early_stopping='auto', scoring='loss', validation_fraction=0.1, n_iter_no_change=10, tol=1e-07, verbose=0, random_state=None
    ensemble.GradientBoostingRegressor(), # loss='ls', learning_rate=0.1, n_estimators=100, subsample=1.0, criterion='friedman_mse', min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None, warm_start=False, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001, ccp_alpha=0.0
    
    
    
    # Neural Network, multilayer perceptron
#     neural_network.MLPRegressor(), # hidden_layer_sizes=100, activation='relu', solver='adam', alpha=0.0001, batch_size='auto', learning_rate='constant', learning_rate_init=0.001, power_t=0.5, max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08, n_iter_no_change=10, max_fun=15000
    xgboost.XGBRegressor()
]

Data prep for fitting the models

1. Split the data

2. Normalize the training set and apply the normalized parameters to the test set

before I do this, I will just drop 'time_id' column because to me, it just looks like a time index that it just doesn't make sense

In [None]:
# bring the right tool for splitting the data and normalizing the data
from sklearn.model_selection import train_test_split, cross_validate
from sklearn import preprocessing

# Split the data
# X_train, X_test, Y_train, Y_test, = train_test_split(df.drop(columns='target',axis=1),df['target'], test_size = .33)
X_train, X_test, Y_train, Y_test, = train_test_split(df.drop(columns='target',axis=1),df['target'], test_size = .2)


# Normalizing the train,test predictor variables.
scaler = preprocessing.StandardScaler()

# Normalize the train predictors
X_train[X_train.columns] = scaler.fit_transform(X_train[X_train.columns]) 

# Apply normalization traits to the test predictors
X_test[X_test.columns] = scaler.transform(X_test[X_test.columns])

print(f'Train dataset shape: {X_train.shape}')
print(f'Test dataset shape: {X_test.shape}')
print(f'Train target dataset shape: {Y_train.shape}')
print(f'Test target dataset shape: {Y_test.shape}')

def rmspe(y_true, y_pred):
    '''
    Compute Root Mean Square Percentage Error between two arrays.
    '''
    loss = -np.sqrt(np.mean(np.square(((y_true - y_pred) / y_true))))

    return loss

rmspe_loss = metrics.make_scorer(rmspe)

In [None]:
# Pay Homage to Titanic dataset kaggler
MLA_columns = ['algorithm Name', 'algorithm Parameters','algorithm Train R2', 'algorithm Test R2','algorithm Test RMSPE','algorithm Time']
MLA_compare = pd.DataFrame(columns = MLA_columns)

#create table to compare MLA predictions
MLA_predict = Y_train

#index through MLA and save performance to table
row_index = 0

scoring = {
    'r2': 'r2',
    'rmspe': rmspe_loss
}

for alg in Regress_alg:

    #set name and parameters
    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index, 'algorithm Name'] = MLA_name
    MLA_compare.loc[row_index, 'algorithm Parameters'] = str(alg.get_params())
    
    #score model with cross validation:
    cv_results = cross_validate(alg, X_train, Y_train, return_train_score=True, scoring=scoring) # 5 fold 
    
    MLA_compare.loc[row_index, 'algorithm Time'] = cv_results['fit_time'].mean()
    MLA_compare.loc[row_index, 'algorithm Train R2'] = cv_results['train_r2'].mean()
    MLA_compare.loc[row_index, 'algorithm Test R2'] = cv_results['test_r2'].mean()
    MLA_compare.loc[row_index, 'algorithm Test RMSPE'] = cv_results['test_rmspe'].mean()
    
    #save MLA predictions - see section 6 for usage
#     alg.fit(X_train, Y_train)
#     MLA_predict[MLA_name] = alg.predict(X_train)
    
    row_index+=1

MLA_compare.sort_values(by = ['algorithm Test RMSPE'], ascending = False, inplace = True)
MLA_compare

In [None]:
predicted = pd.Series()

for alg in Regress_alg:
    model = alg.fit(X_train,Y_train)
    predicted = pd.concat([predicted,pd.Series(model.predict(X_test),name=alg.__class__.__name__)],axis=1)
predicted.drop(columns=[0],axis=1,inplace=True)

In [None]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

predicted_graph = pd.concat([predicted,Y_test.reset_index(drop=True)],axis=1)
fig = make_subplots(specs=[[{"secondary_y": True}]])
x = [0,0.018]
y = x
df = [x,y]



fig1 = px.scatter(predicted_graph, x='GradientBoostingRegressor', y='target', opacity=0.65, trendline='ols', trendline_color_override='blue')
fig2 = px.scatter(df,x=x,y=y,opacity=1.0,trendline='ols',trendline_color_override='red')

fig = make_subplots()

fig.add_trace((go.Scatter(x=fig1.data[0]['x'],y=fig1.data[0]['y'],name='data',mode='markers',opacity=0.65)))
fig.add_trace((go.Scatter(x=fig1.data[1]['x'],y=fig1.data[1]['y'],name='linear fit to pred vs true',fill='tonexty')))
fig.add_trace((go.Scatter(x=fig2.data[0]['x'],y=fig2.data[0]['y'],name='reference, 45 deg line',fill='tonexty')))

# Artificial Neural Network Regression Mapping, Hyperparameter tuning, Visualizing Hyperparameter tuning with Tensorboard

Going to construct the CNN model architecture

In [None]:
%load_ext tensorboard

In [None]:
# # Artificial Neural Network mapping with tensorflow:

import tensorflow as tf # import tensorflow

from tensorflow import feature_column
from tensorboard.plugins.hparams import api as hp

# Delete the previous logs
!rm -rf ./logs/

TODO:' explain why you are doing this---------------------------------------------------'

# Hidden layer1 # of neurons
HP_NUM_UNITS1 = hp.HParam('num_units_1', hp.Discrete([8,16,24,32,40])) 
# Hidden layer1 # of neurons
HP_NUM_UNITS2 = hp.HParam('num_units_2', hp.Discrete([4,8,12,16,20,24]))
# Optimizer model selection
HP_OPTIMIZER = hp.HParam('optimizer', hp.Discrete(['adam', 'sgd','RMSprop']))

METRIC_MAPE = 'mean_absolute_percentage_error'

with tf.summary.create_file_writer('logs/hparam_tuning1').as_default():
    hp.hparams_config(
        hparams=[HP_NUM_UNITS1,HP_NUM_UNITS2 ,HP_OPTIMIZER],
        metrics=[hp.Metric(METRIC_MAPE, display_name='MAPE')]
    )


def Neural_Net_Train(hparams):
    
    X_initial = tf.keras.layers.Input(shape=[26])
    H_Initial = tf.keras.layers.Dense(hparams[HP_NUM_UNITS1])(X_initial)
    H_Initial = tf.keras.layers.BatchNormalization()(H_Initial)
    H_Initial = tf.keras.layers.Activation('swish')(H_Initial)

    H_Initial = tf.keras.layers.Dense(hparams[HP_NUM_UNITS2])(H_Initial)
    H_Initial = tf.keras.layers.BatchNormalization()(H_Initial)
    H_Initial = tf.keras.layers.Activation('swish')(H_Initial)

    Y_Initial = tf.keras.layers.Dense(1)(H_Initial)

    model_ANN = tf.keras.models.Model(X_initial,Y_Initial)
    model_ANN.compile(optimizer=hparams[HP_OPTIMIZER], loss='mse', metrics=[tf.keras.metrics.MeanAbsolutePercentageError()])

    model_ANN.fit(X_train, Y_train, epochs=100,verbose=0)
    _, MAPE = model_ANN.evaluate(X_test, Y_test)
    return MAPE


def run(run_dir, hparams):
    with tf.summary.create_file_writer(run_dir).as_default():
        hp.hparams(hparams)  # record the values used in this trial
        MAPE = Neural_Net_Train(hparams)
        tf.summary.scalar(METRIC_MAPE, MAPE, step=1)

        
session_num = 0

for num_units1 in HP_NUM_UNITS1.domain.values:
    for num_units2 in HP_NUM_UNITS2.domain.values:
        for optimizer in HP_OPTIMIZER.domain.values:
            hparams = {
                HP_NUM_UNITS1: num_units1,
                HP_NUM_UNITS2: num_units2,
                HP_OPTIMIZER: optimizer
            }
            run_name = "run-%d" % session_num
            print('--- Starting trial: %s' % run_name)
            print({h.name: hparams[h] for h in hparams})
            run('logs/hparam_tuning1/' + run_name, hparams)
            session_num += 1

In [None]:
# I don't know why it is not working here, but if you try it in ur own jupyter notebbok, it works..
%tensorboard --logdir logs/hparam_tuning1


In [None]:
rmspe_loop = []
# rmspe_val = 1

# Modelling after grid search
X_initial = tf.keras.layers.Input(shape=[26])
H_Initial = tf.keras.layers.Dense(27)(X_initial)
H_Initial = tf.keras.layers.BatchNormalization()(H_Initial)
H_Initial = tf.keras.layers.Activation('relu')(H_Initial)

H_Initial = tf.keras.layers.Dense(4)(H_Initial)
H_Initial = tf.keras.layers.BatchNormalization()(H_Initial)
H_Initial = tf.keras.layers.Activation('relu')(H_Initial)

Y_Initial = tf.keras.layers.Dense(1)(H_Initial)

model_ANN = tf.keras.models.Model(X_initial,Y_Initial)

model_ANN = tf.keras.models.Model(X_initial,Y_Initial)
model_ANN.compile(optimizer='adam', loss='mse')

# while rmspe_val > 0.8:
for _ in range(0,200):

    model_ANN.fit(X_train,Y_train,epochs=1,verbose=0)
    ANN_prediction = model_ANN.predict(X_test)
    rmspe_val = -rmspe(Y_test.to_numpy(),ANN_prediction)
    rmspe_loop.append(rmspe_val)

print(min(rmspe_loop))
px.line(rmspe_loop)

In [None]:
from sklearn.metrics import r2_score
print(r2_score(Y_test,ANN_prediction))
plot_prediction = [item for sublist in ANN_prediction for item in sublist]
plot_data = pd.DataFrame([plot_prediction,Y_test])
plot_data = plot_data.transpose()
plot_data.rename(columns={0:'Predicted',1:'True_val'},inplace=True)
px.line(plot_data, y=plot_data.columns)

So... Artificial Neural Network, even if going through grid search, there is a limitation. I am going to reconvene after the feature selection. Let us go on to grid search with the scikit Learn models

In [None]:
for alg in Regress_alg:
    print(f'Predictors: {alg.__class__.__name__}, \n\nParameter list: {alg.get_params()}\n')
    

So here, we have 23 ML regression algorithm that we can perform the grid search. However, each algorithm requires their own hyperparameter tuning and they have their own list. so, I am going to make a list of dictionaries which is a compilation of grid search parameters.
(e.g. GridSearchCV for sklearn.linear_model.LinearRegression())

In [None]:
param_grid_LinearRegression = {
    'fit_intercept': [True,False],
    'normalize': [True,False], 
    'positive': [True,False]
}

param_grid_RidgeCV = {
    'alphas': [np.array([0.001,0.01]),np.array([0.1,1]),np.array([2,3]),np.array([4,5]),np.array([6,7]),np.array([8,9]),np.array([10,11])],
    'cv': [3,4,5], 
    'fit_intercept': [True,False], 
    'gcv_mode': ['auto', 'svd', 'eigen'], 
    'normalize': [True,False], 
    'store_cv_values': [False]
}

param_grid_LassoCV = {
    'alphas': [None,np.array([0.001,0.01]),np.array([0.1,1]),np.array([2,3]),np.array([4,5]),np.array([6,7]),np.array([8,9]),np.array([10,11])], 
    'fit_intercept': [True,False], 
    'max_iter': [1000], 
    'n_alphas': [0,1,5,10,20],  
    'normalize': [True,False], 
    'positive': [True,False], 
    'random_state': [42]
}

param_grid_ElasticNet = {
    'alpha': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.0, 1.0, 10.0, 100.0],
    'fit_intercept': [True,False],
    'l1_ratio': [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9],
    'normalize': [True,False],
    'positive': [True,False],
    'precompute': [True,False],
    'random_state': [12,42],
    'tol': [0.0001,0.001], 
    'warm_start': [True,False]
}

param_grid_OrthogonalMatchingPursuit = {
    'fit_intercept': [True,False], 
    'n_nonzero_coefs': [0,1], 
    'tol': [0.0001,0.001,0.01]
}

param_grid_BayesianRidge = {
    'alpha_1': [1e-02,1e-01], 
    'alpha_2': [1e-02,1e-01], 
    'compute_score': [True,False],
    'fit_intercept':[True,False],
    'lambda_1': [1e-06,1e-01],
    'lambda_2': [1e-06,1e-01],
    'n_iter': [100,200], 
    'normalize': [True,False],
    'tol': [0.0001,0.001]
}

param_grid_GammaRegressor = {
    'alpha': [0.01,0.1,1.0], 
    'fit_intercept': [True,False], 
    'max_iter': [100,200], 
    'tol': [0.0001,0.001,0.01], 
}

param_grid_PoissonRegressor = {
    'alpha': [0.01,0.1,1.0,10,100], 
    'fit_intercept': [True,False], 
    'max_iter': [100,200,300], 
    'tol': [0.0001,0.001,0.01], 
    'warm_start': [True,False]
}

param_grid_TweedieRegressor = {
    'alpha': [0.01,0.1,1.0,10,100], 
    'fit_intercept': [True,False], 
    'max_iter': [100,200,300], 
    'alpha':[0,0.1,1],
    'tol': [0.0001,0.001,0.01], 
    'warm_start': [True,False]
}

param_grid_SGDRegressor = {
    'alpha': [0.0001], 
    'average': [False],
    'early_stopping': [False], 
    'epsilon': [0.01], 
    'eta0': [0.01,0.1],
    'fit_intercept': [True], 
    'l1_ratio': [0.35], 
    'learning_rate': ['invscaling'], 
    'loss': ['huber'],
    'penalty': ['l1'], 
    'max_iter': [3000],
    'power_t': [0.35],
    'shuffle': [True], 
    'validation_fraction': [0.1], 
    'warm_start': [True]
}

# param_grid_KernelRidge = { # Not working at all...
#     'alpha': [1,10,100], 
#     'coef0': [0.01,0.001], 
#     'degree': [0.1,1], 
#     'gamma': [None], 
#     'kernel': ['linear', 'rbf','poly']
# }

# param_grid_SVR = { # Not working at all...
#     'C': [0.001,1,1.5], 
#     'epsilon': [0.1,0.2,0.3],
#     'gamma': [1e-4,0.01],
#     'kernel': ['linear', 'rbf','poly']
# }

param_grid_NuSVR = {
    'C': [0.001,1,1.5], 
    'gamma': [1e-4,0.01],
    'nu': [0.5,1],
    'kernel': ['linear', 'rbf','poly']
}

param_grid_LinearSVR = {
    'C': [0.001], 
    'fit_intercept': [True],
    'max_iter': [1000],
    'random_state': [42]
}

param_grid_KNeighborsRegressor = {
    'algorithm': ['ball_tree', 'kd_tree', 'brute'], 
    'leaf_size': [1,10],  
    'n_neighbors': [3,4,5], 
    'p': [1,2]
}

param_grid_GaussianProcessRegressor = {
    'alpha': [1e-10,1e-7,1e-4], 
    'normalize_y': [True,False], 
    'n_restarts_optimizer': [0,2], 
    'random_state': [12,42]
}

param_grid_DecisionTreeRegressor = {
    "splitter":["best","random"],
    "max_depth" : [1,3,5,7,9,11,12],
    "min_samples_leaf":[1,2,3,4,5,6,7,8,9,10],
    "min_weight_fraction_leaf":[0.1,0.2,0.3,0.4],
    "max_features":["auto","log2","sqrt",None],
    "max_leaf_nodes":[None,10,20,30,40,50,60,70,80,90]
}

param_grid_AdaBoostRegressor = {
    'base_estimator': [tree.DecisionTreeRegressor()],
    'n_estimators':[1000,1500],
    'learning_rate':[.0001,0.001,.01],
    'random_state':[42]
}

param_grid_BaggingRegressor = {
    'bootstrap': [True], 
    'bootstrap_features': [False], 
    'max_features': [10],
    'max_samples': [1.0], 
    'n_estimators': [100,200,1000], 
    'oob_score': [True], 
    'warm_start': [False]
}

param_grid_ExtraTreesRegressor = {
    'bootstrap': [True,False], # False
    'ccp_alpha': [0.0,0.1,0.2], # 0.0
    'max_depth': [None,1,10,100], # 10
    'max_features': ['auto','sqrt', 'log2'], # 'auto'
    'min_samples_leaf': [1,2,4,6,8,10], # 2
    'min_samples_split': [2,4,6,8,10], # 2
    'n_estimators': [10,100,1000], # 10
    'warm_start': [True,False] # False
}

param_grid_RandomForestRegressor = {
    'bootstrap': [True,False], 
    'max_depth': [10, 20],
    'max_features': ['auto', 'sqrt', 'log2'],
    'min_samples_leaf': [1, 2],
    'min_samples_split': [2, 5],
    'n_estimators': [10,100]
}

param_grid_HistGradientBoostingRegressor = {
    'categorical_features': [None], # we got no categorical feature, no one-hot-encoding involved.
    'early_stopping': [True,False], # True
    'l2_regularization':[0.0,0.5,1.0], # 0.5
    'learning_rate': [0.1,1], # 0.1
    'loss': ['least_squares','least_absolute_deviation','poisson'], # poisson
    'max_bins':[100, 255], # 100
    'max_depth': [10, 100, None], # 100
    'max_iter': [500], 
    'max_leaf_nodes': [31,None], # 31
    'min_samples_leaf': [20,10], # 10
    'validation_fraction': [0.1,0.2] # 0.2
}

param_grid_GradientBoostingRegressor = {
    'alpha': [0.1,0.5,0.9], # 0.9
    'criterion': ['friedman_mse','mse'], # friedman
    'learning_rate': [0.05,0.1,0.2], # 0.05
    'loss': ['ls', 'huber', 'quantile'], # huber
    'max_depth': [3,5], # 3
    'max_features': ['sqrt'], # sqrt
    'min_samples_leaf': [10,20],# 10
    'min_samples_split': [20,30], # 20
    'n_estimators': [150,160], # 150
    'subsample': [1.0], # 1.0
    'validation_fraction': [0.2] # 0.2
}
param_grid_lists = [param_grid_LinearRegression, param_grid_RidgeCV, param_grid_LassoCV, param_grid_ElasticNet
                   , param_grid_OrthogonalMatchingPursuit, param_grid_BayesianRidge, param_grid_GammaRegressor
                   , param_grid_PoissonRegressor, param_grid_TweedieRegressor, param_grid_SGDRegressor
                   , param_grid_NuSVR, param_grid_LinearSVR, param_grid_KNeighborsRegressor
                   , param_grid_GaussianProcessRegressor, param_grid_DecisionTreeRegressor, param_grid_AdaBoostRegressor
                   , param_grid_BaggingRegressor, param_grid_ExtraTreesRegressor, param_grid_RandomForestRegressor
                   , param_grid_HistGradientBoostingRegressor, param_grid_GradientBoostingRegressor]

In [None]:
# Import the grid search class from the scikit learn model selection module
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
# Linear Regression estimator has following parameters

# Parameter List: {
#     'copy_X': True, 
#     'fit_intercept': True, 
#     'n_jobs': None, 
#     'normalize': False, 
#     'positive': False
# }

# Define parameter lists for grid searching. this will be defined as dictionaries.

# I suggest you run 3 groups at a time to avoid any kernel death, and overloading of the computers

tuned_params = pd.DataFrame(columns = ['name','pre tuned RMSPE','post tuned RMSPE','params'])
params = []
x=1
for i , alg in enumerate(Regress_alg[x:x+3]):
    grid = GridSearchCV(estimator=alg, param_grid=param_grid_lists[i+x], cv=5, n_jobs = -1, scoring = scoring, refit = 'rmspe')
    grid_result = grid.fit(X_train, Y_train)
    tuned_params.loc[i,'name'] = alg.__class__.__name__
    tuned_params.loc[i,'pre tuned RMSPE'] = MLA_compare['algorithm Test RMSPE'].loc[MLA_compare['algorithm Name'] == str(alg.__class__.__name__)].values[0]
    tuned_params.loc[i,'post tuned RMSPE'] = grid_result.best_score_*-1
    tuned_params.loc[i,'params'] = str(grid_result.best_params_)
    params.append(grid_result.best_params_)

tuned_params

In [None]:
for i in range(len(tuned_params)):
    print(tuned_params.loc[i,'params'],'\n\n Vs \n')
    print(MLA_compare['algorithm Parameters'].loc[MLA_compare['algorithm Name'] == tuned_params.loc[i,'name']][i+x],'\n')
    
# We will update our model's parameter as per tuned parameters..

Regress_alg_tuned = [
    # Linear Models
    linear_model.LinearRegression(fit_intercept= True, normalize= True),
    linear_model.RidgeCV(alphas = np.array([0.001, 0.01 ]), cv = 5, fit_intercept = True, gcv_mode = 'auto', normalize= False, store_cv_values = False),
    linear_model.LassoCV(alphas = None, fit_intercept= True, max_iter= 1000, n_alphas= 1, normalize= False, positive= False, random_state= 42), 
    linear_model.ElasticNet(alpha= 1e-05, fit_intercept= True, l1_ratio= 0.7, normalize= False, positive= False, precompute= False, random_state= 12, tol= 0.001, warm_start= True), 
    linear_model.OrthogonalMatchingPursuit(fit_intercept= True, n_nonzero_coefs= 0, tol= 0.0001),
    linear_model.BayesianRidge(alpha_1= 0.1, alpha_2= 0.01, compute_score= True, fit_intercept= True, lambda_1= 1e-06, lambda_2= 0.1, n_iter= 100, normalize= False, tol= 0.0001),
    linear_model.GammaRegressor(alpha= 1.0, fit_intercept= True, max_iter= 100, tol= 0.0001),
    linear_model.PoissonRegressor(alpha= 0.01, fit_intercept= True, max_iter= 100, tol= 0.0001, warm_start= True),
    linear_model.TweedieRegressor(alpha= 0, fit_intercept= True, max_iter= 100, tol= 0.0001, warm_start= True),
    # Stochastic Gradient Descent (SGD)
    linear_model.SGDRegressor(alpha= 0.0001, average= False, early_stopping= False, epsilon= 0.01, eta0= 0.1, fit_intercept= True, l1_ratio= 0.35, learning_rate= 'invscaling', loss= 'huber', max_iter= 3000, penalty= 'l1', power_t= 0.35, shuffle= True, validation_fraction= 0.1, warm_start= True),
    
    # Kernel Ridge Regression
#     kernel_ridge.KernelRidge(), # alpha=1, kernel='linear', gamma=None, degree=3, coef0=1, kernel_params=None
    
    # Support Vector Machine - Regressions
#     svm.SVR(), # kernel='rbf', degree=3, gamma='scale', coef0=0.0, tol=0.001, C=1.0, epsilon=0.1, shrinking=True, cache_size=200, verbose=False, max_iter=- 1
    svm.NuSVR(C= 1.5, gamma= 0.0001, kernel= 'rbf', nu= 1), 
    svm.LinearSVR(C= 0.001, fit_intercept= True, max_iter= 1000, random_state= 42), 

    # Nearest Neighbours Regression
    neighbors.KNeighborsRegressor(algorithm= 'ball_tree', leaf_size= 1, n_neighbors= 5, p= 1), 
#     neighbors.RadiusNeighborsRegressor(), # radius=1.0, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None
    
    # Gaussian Process Regression (GPR)
    gaussian_process.GaussianProcessRegressor(alpha= 1e-10, n_restarts_optimizer= 0, normalize_y= True, random_state= 12),
    
    # PLS Regression
#     cross_decomposition.PLSRegression(), # n_components=2, scale=True, max_iter=500, tol=1e-06, copy=True
    
    # Decision Tree
    tree.DecisionTreeRegressor(max_depth= 7, max_features= 'auto', max_leaf_nodes= 80, min_samples_leaf= 8, min_weight_fraction_leaf= 0.1, splitter= 'best'), 
    # Ensemble methods
    ensemble.AdaBoostRegressor(base_estimator= tree.DecisionTreeRegressor(), learning_rate= 0.0001, n_estimators= 1500, random_state= 42),
    ensemble.BaggingRegressor(bootstrap= True, bootstrap_features= False, max_features= 10, max_samples= 1.0, n_estimators= 200, oob_score= True, warm_start= False), 
    ensemble.ExtraTreesRegressor(bootstrap= False, ccp_alpha= 0.0, max_depth= 10, max_features= 'auto', min_samples_leaf= 4, min_samples_split= 6, n_estimators= 100, warm_start= True), 
    ensemble.RandomForestRegressor(bootstrap= True, max_depth= 10, max_features= 'sqrt', min_samples_leaf= 2, min_samples_split= 5, n_estimators= 100),
    ensemble.HistGradientBoostingRegressor(early_stopping=True,l2_regularization=0.5,learning_rate=0.1,loss='poisson',max_bins=100,max_depth=100,max_iter=500,max_leaf_nodes=31,min_samples_leaf=10,validation_fraction=0.2), 
    ensemble.GradientBoostingRegressor(alpha=0.9,criterion='friedman_mse',learning_rate=0.05,loss='huber',max_depth=3,max_features='sqrt',min_samples_leaf=10,min_samples_split=20,n_estimators=150,subsample=1.0,validation_fraction=0.2),

    # Neural Network, multilayer perceptron
#     neural_network.MLPRegressor(), # hidden_layer_sizes=100, activation='relu', solver='adam', alpha=0.0001, batch_size='auto', learning_rate='constant', learning_rate_init=0.001, power_t=0.5, max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08, n_iter_no_change=10, max_fun=15000
    xgboost.XGBRegressor()
]

In [None]:
# Pay Homage to Titanic dataset kaggler
MLA_columns_tuned = ['algorithm Name', 'algorithm Parameters_tuned','algorithm Train R2_tuned', 'algorithm Test R2_tuned','algorithm Test RMSPE_tuned','algorithm Time_tuned']
MLA_compare_tuned = pd.DataFrame(columns = MLA_columns_tuned)

#create table to compare MLA predictions
MLA_predict_tuned = Y_train

#index through MLA and save performance to table
row_index = 0

scoring = {
    'r2': 'r2',
    'rmspe': rmspe_loss
}

for alg in Regress_alg_tuned:

    #set name and parameters
    MLA_name_tuned = alg.__class__.__name__
    MLA_compare_tuned.loc[row_index, 'algorithm Name'] = MLA_name_tuned
    MLA_compare_tuned.loc[row_index, 'algorithm Parameters_tuned'] = str(alg.get_params())
    
    #score model with cross validation:
    cv_results = cross_validate(alg, X_train, Y_train, return_train_score=True, scoring=scoring) # 5 fold 
    
    MLA_compare_tuned.loc[row_index, 'algorithm Time_tuned'] = cv_results['fit_time'].mean()
    MLA_compare_tuned.loc[row_index, 'algorithm Train R2_tuned'] = cv_results['train_r2'].mean()
    MLA_compare_tuned.loc[row_index, 'algorithm Test R2_tuned'] = cv_results['test_r2'].mean()
    MLA_compare_tuned.loc[row_index, 'algorithm Test RMSPE_tuned'] = cv_results['test_rmspe'].mean()*-1

    MLA_compare_tuned.loc[row_index, 'RMSPE_pretuned_result'] = -MLA_compare['algorithm Test RMSPE'].loc[MLA_compare['algorithm Name'] == MLA_name_tuned][row_index]
    
    #save MLA predictions - see section 6 for usage
#     alg.fit(X_train, Y_train)
#     MLA_predict[MLA_name] = alg.predict(X_train)
    
    row_index+=1

MLA_compare_tuned.sort_values(by = ['algorithm Test RMSPE_tuned'], ascending = True, inplace = True)
MLA_compare_tuned

In [None]:
a,b = [],[]
for i in Regress_alg_tuned:
    a.append("Tuned")
    b.append("Not_Tuned")
    
a = pd.concat([MLA_compare_tuned[['algorithm Name','algorithm Test RMSPE_tuned']],pd.Series(a,name='Tuned/not_Tuned')],axis=1)
a.rename(columns={'algorithm Test RMSPE_tuned':'RMSPE'},inplace=True)
b = pd.concat([MLA_compare_tuned[['algorithm Name','RMSPE_pretuned_result']],pd.Series(b,name='Tuned/not_Tuned')],axis=1)
b.rename(columns={'RMSPE_pretuned_result':'RMSPE'},inplace=True)
graph = a.append(b)
px.bar(graph, x="RMSPE", y="algorithm Name",color='Tuned/not_Tuned', orientation='h').update_layout(barmode='group')

Marginal Increase... except for the few algorithms...

How about I stack them? - Work in Progress

lets pick one and submit

In [None]:
df_val.fillna(0,inplace=True)
ans = test
for alg in Regress_alg_tuned:
    model = alg.fit(X_train,Y_train)
    predicted = model.predict(df_val)
    predicted = pd.DataFrame(predicted,columns=['predicted '+str(alg.__class__.__name__)])
    ans = pd.concat([ans,predicted],axis=1)
ans = ans.transpose()
ans

In [None]:
import ipywidgets as widgets

# Create the list of all labels for the drop down list
list_of_labels = X_train.columns.to_list()

# Create a list of tuples so that the index of the label is what is returned
tuple_of_labels = list(zip(list_of_labels, range(len(list_of_labels))))

# Create a widget for the labels and then display the widget
current_label = widgets.Dropdown(options=tuple_of_labels,
                              value=0,
                              description='Select Label:'
                              )

# Display the dropdown list (Note: access index value with 'current_label.value')
current_label

explainer = shap.KernelExplainer(model = Regress_alg_tuned[18].predict, data = X_train.head(200), link = "identity")
shap_values = explainer.shap_values(X = X_train.iloc[0:200,:], nsamples = 300)
shap.initjs()

print(f'Current Label Shown: {list_of_labels[current_label.value]}\n')


shap.summary_plot(shap_values = shap_values[0:200,:],
                  features = X_train.iloc[0:200,:]
                  )

Its quite interesting to see that the directionality of each variable impact on the final prediction... hmm.. intereting.

Also, this result was only using with with the grouped (aggregated) at stock_id = 0, and not considering the data when stock_id != 0. Also, this plot and result is only produced using LinearSVR, but I see that a lot of people are using gradient boosting methods and getting way better result than I do.

I will post another version that uses pyspark

In [None]:
ans[['row_id','predicted GradientBoostingRegressor']].to_csv('submission.csv',index = False)