# General information about this notebook

Hi,

I am Thomas, creator of the automl library e2eml. In this notebook I give you a little walkthrough and example run using a feature called Timewalk.

I created this library for personal development, but also to give something back to the data science community. You can install the library using !pip install e2eml (here I installed it using no internet, but which is painful).
What does e2eml offer?

e2eeml has two major goals: You can either speed up your prototyping and exploration (with Timewalk) or create a full pipeline with just a few lines of code. e2eml will take care of:

- data preprocessing
- model training
- model fine-tuning
 - model evaluation
 - logging file

It can handle datetime, categorical and numerical data and allows you to build classification & regression models. For NLP tasks you can even let it create a full BERT model for you. It is NOT built for time series at all currently, but I wanted to give it a shot at least.

# Why e2eml and not any other automl framework?

This decision is fully up to you. They all have their ins & outs. Chose what suits your needs best. e2eml is just an option for you.

# What is the spirit of e2eml?

This library tries to maximize the model performance. It shall help to see how far you can get given your data. As a characteristic it creates huge notebooks, but we really wanted to print out a lot for. We want you to be able to actually see what is happening under the hood. That comes at a cost here. If you prefer very elegant and silent implementations, check out the fantastic Pycaret library. Additionally being able to fine-tune BERT models for you can be a life saver. We also provide some GPU acceleration with RAPIDS. Currently this is implemented in a few spots only however.


In [None]:
!pip install interface_meta --no-index --find-links=file:../input/e2eml-inc-dependencies/interface_meta-1.2.4-py2.py3-none-any.whl
!pip install astor --no-index --find-links=file:../input/e2eml-inc-dependencies/astor-0.8.1-py2.py3-none-any.whl
!pip install formulaic --no-index --find-links=file:../input/formulaic/formulaic-0.2.4-py3-none-any.whl
!pip install autograd --no-index --find-links=file:../input/e2eml-inc-dependencies/autograd-1.3-py3-none-any.whl
!pip install autograd-gamma --no-index --find-links=file:../input/e2eml-inc-dependencies/autograd-gamma-0.5.0/dist/autograd-gamma-0.5.0.tar
!pip install lifelines --no-index --find-links=file:../input/lifelines/lifelines-0.26.4-py3-none-any.whl
!pip install ngboost --no-index --find-links=file:../input/e2eml-inc-dependencies/ngboost-0.3.12-py3-none-any.whl
!pip install boostaroota --no-index --find-links=file:../input/e2eml-inc-dependencies/boostaroota-1.3-py2.py3-none-any.whl
!pip install matplotlib --no-index --find-links=file:../input/e2eml-inc-dependencies/matplotlib-3.1.3-cp38-cp38-manylinux1_x86_64.whl
!pip install catboost --no-index --find-links=file:../input/e2eml-inc-dependencies/catboost-0.21-cp38-none-manylinux1_x86_64.whl
!pip install pytorch_tabnet --no-index --find-links=file:../input/e2eml-inc-dependencies/pytorch_tabnet-3.1.1-py3-none-any.whl
!pip install shap --no-index --find-links=file:../input/e2eml-inc-dependencies/shap-0.39.0-cp35-cp35m-linux_armv6l.whl
!pip install e2eml --no-index --no-dependencies --find-links=file:../input/e2eml-inc-dependencies/e2eml-2.10.4-py3-none-any.whl

Trying to install RAPIDS, but I could not make it work.
Otherwise we could use RAPIDS to accelerate the clustering parts a lot.

In [None]:
import sys
!cp ../input/rapids/rapids.21.06 /opt/conda/envs/rapids.tar.gz
!cd /opt/conda/envs/ && tar -xzvf rapids.tar.gz > /dev/null
sys.path = ["/opt/conda/envs/rapids/lib/python3.7/site-packages"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib/python3.7"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib"] + sys.path 
!cp /opt/conda/envs/rapids/lib/libxgboost.so /opt/conda/lib/

In [None]:
conda uninstall pyarrow

In [None]:
pip install pyarrow

In [None]:
import pandas as pd
import numpy as np
# load libraries
import sys
from e2eml.regression import regression_blueprints
from e2eml.full_processing import postprocessing
from e2eml.timetravel import timetravel
from sklearn.model_selection import train_test_split
import gc

In [None]:
target = "target"

In [None]:
df_train = pd.read_csv("../input/ubiquant-market-prediction/train.csv", nrows=40000)
df_train

# Test train split

We hold back the newest data points as unseen holdout data for validation.
This has two reasons:
- Without any unseen data we cannot see overfitting
- The data might not consist of static behaviour. In real world applications patterns might change over time. And in timne series we also really want to make sure not to overfit.

In [None]:
train_df = df_train.head(30000).copy()
val_df = df_train.tail(10000).copy()

In [None]:
val_df_target = val_df[target]
val_df = val_df.drop(target, axis=1)
val_df

# Automl using e2eml

We use e2eml. There are plenty of fantastic frameworks. Chose whatever you like.
Here we instantiate the base class we need to use for the Timewalk function and all individual pipelines as well.

In [None]:
market_ml = regression_blueprints.RegressionBluePrint(datasource=train_df, 
                        target_variable=target,
                        train_split_type='cross',
                        rapids_acceleration=True,
                        preferred_training_mode='auto',
                        ml_task='regression')

From here we have two general options. We can:
- run a certain blueprint straight away (this is great to create a ready-to-use pipeline for prediction)
- run Timewalk to fully explore many algorithms and preprocessing combinations

Here we assume that we don't know much about what works for this dataset. So we chose Timewalk. Timewalk takes a long time to run (this can be controlled by manually chosing algorithms and preprocessing steps to use). As we have plenty of data Timewalk will automatically switch off some algorithms. I.e. Xgboost will not run here (this has been designed due to problems of releasing the memory again and also due to high consumption of system resources).

In fact I decided to chose fast algorithms here. So Xgboost has been left out because of memory, SVM regression because of sped and scalability and TabNet because of speed.
As this is for show casing only here, we also use a micro sample considering the amount of data. So not expecting any good performance here.

In [None]:
results = timetravel.timewalk_auto_exploration(class_instance=market_ml,
                                   holdout_df=val_df,
                                   holdout_target=val_df_target,
                                   algs_to_test=["linear_regression","vowpal_wabbit", "ridge", "lgbm"],
                                   speed_up_model_tuning=True,
                                   experiment_comment='First short glimpse',
                                   experiment_name="market_prediction.pkl")

Timewalk returns a result dataframe. Performance has been poor in all instances, but LGBM already grabbed some signal at least. So in another notebook I will use a much bigger sample and try to run the LGBM blueprint using our winning preprocessing steps.

In [None]:
results.sort_values(by=["Mean absolute error"], ascending=[True])

In [None]:
# showing the preprocessing steps of our best iteration and model
best_params = results.sort_values(by=["Mean absolute error"], ascending=[False]).head(1)["Preprocessing applied"].values.tolist()[0]
best_params