# Experimenting with AutoML
Explore different AutoML libraries for Scikit-Learn to see whether this can improve our predictions.

### Table of contents
* [Data loading, performance metric & preprocessing](#loaddata)
* [AutoML libraries](#automl)
    * [TPOT](#tpot)
    * [Auto-Sklearn](#autosklearn)
    * [HyperOpt](#hyperopt)
* [Conclusion](#conclusion)

In [1]:
import pandas as pd
from sklearn.metrics import make_scorer, mean_squared_log_error
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_validate, train_test_split
from tpot import TPOTRegressor
from sklearn.pipeline import Pipeline

train_data_file = "../data/train.csv"
test_data_file = "../data/test.csv"
tpot_predictions_file = "../results/02_predictions_tpot.csv"
SEED = 0
CV = 5

## Data loading, performance metric & preprocessing <a class="anchor"  id="loaddata"></a>
The code for loading the data, defining the performance metric and a basic preprocessing is copied from [01_basic.ipynb](jupyter_notebooks/01_basic.ipynb).

In [2]:
# load data
train_df = pd.read_csv(train_data_file)
train_df.set_index("Id", inplace=True)
target_col = "SalePrice"
y_train = train_df[target_col]
X_train = train_df.drop(columns=[target_col])
cat_cols = X_train.select_dtypes(include=["object"]).columns
num_cols = X_train.select_dtypes(exclude=["object"]).columns

X_test = pd.read_csv(test_data_file)
X_test.set_index("Id", inplace=True)

In [3]:
# define performance metric
neg_RMSLE_scorer = make_scorer(
    mean_squared_log_error, greater_is_better=False, squared=False
)


def measure_performance(estimator, X, y, scorer=neg_RMSLE_scorer, cv=CV):
    """Calculate negative RMSLE for train and test set via cross validation."""
    cv_results = cross_validate(
        estimator=estimator,
        X=X,
        y=y,
        cv=cv,
        scoring=scorer,
        return_train_score=True,
    )
    test_error = cv_results["test_score"].mean()
    train_error = cv_results["train_score"].mean()
    return train_error, test_error

In [4]:
# basic preprocessing
simple_imputer = ColumnTransformer(
    transformers=[
        (
            "num_imputer",
            SimpleImputer(strategy="mean", keep_empty_features=True),
            num_cols,
        ),
        (
            "cat_imputer",
            SimpleImputer(strategy="most_frequent", keep_empty_features=True),
            cat_cols,
        ),
    ],
    verbose_feature_names_out=False,
)
simple_imputer.set_output(transform="pandas")

ordinal_encoder = ColumnTransformer(
    transformers=[
        (
            "ordinal_encoder",
            OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1),
            cat_cols,
        )
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
)
ordinal_encoder.set_output(transform="pandas")

preprocessing = Pipeline(
    steps=[("imputer", simple_imputer), ("encoder", ordinal_encoder)]
)

## AutoML libraries <a class="anchor"  id="automl"></a>
AutoML stands for Automated Machine Learning. It is a tool to automatically obtain a machine learning pipeline with a relatively good performance. This is achieved via an optimization of the model selection, hyperparameter tuning and some data preprocessing. There are different applied techniques for optimzation like grid or random search, Bayesian optimization or evolutionary algorithms.


We will test and compare multiple AutoML libraries for scikit-learn: TPOT, Auto-Sklearn and HyperOpt-Sklearn.

### TPOT <a class="anchor"  id="tpot"></a>
TPOT stands for tree based pipeline optimization tool. it uses genetic programming (evolutionary algotihm) for optimization.

The input data must be numerical only. Therefore before running TPOT, missing values are handled and categorical features are encoded.

In [5]:
# run TPOT:
tpot = TPOTRegressor(
    generations=5,
    population_size=50,
    cv=CV,
    scoring=neg_RMSLE_scorer,
    early_stop=3,
    verbosity=2,
    random_state=SEED,
    n_jobs=-1,
)
X_train_processed = preprocessing.fit_transform(X_train)
tpot.fit(X_train_processed, y_train)
tpot_best_model = tpot.fitted_pipeline_
tpot_best_model

                                                                             
Generation 1 - Current best internal CV score: -0.13356697090230968
                                                                              
Generation 2 - Current best internal CV score: -0.13356697090230968
                                                                              
Generation 3 - Current best internal CV score: -0.1332053119755611
                                                                              
Generation 4 - Current best internal CV score: -0.13114245280134335
                                                                              
Generation 5 - Current best internal CV score: -0.13114245280134335
                                                                              
Best pipeline: XGBRegressor(input_matrix, learning_rate=0.1, max_depth=6, min_child_weight=18, n_estimators=100, n_jobs=1, objective=reg:squarederror, subsample=0.55, verbosity=0)


In [6]:
# measure performance:
train_error, test_error = measure_performance(
    tpot_best_model, X_train_processed, y_train
)
print(f"Train error: {train_error}; Test error: {test_error}")

Train error: -0.08749420601136189; Test error: -0.13114245280134335


In [7]:
# make test predictions:
X_test_processed = preprocessing.transform(X_test)
tpot_best_model.fit(X=X_train_processed, y=y_train)
pred_test = tpot_best_model.predict(X=X_test_processed)
prediction_df = pd.DataFrame({"Id": X_test.index, "SalePrice": pred_test})
prediction_df.to_csv(tpot_predictions_file, index=False)
prediction_df.head()

Unnamed: 0,Id,SalePrice
0,1461,125455.976562
1,1462,160246.015625
2,1463,181554.75
3,1464,183471.78125
4,1465,192670.5


### Auto-Sklearn <a class="anchor"  id="autosklearn"></a>

### Hyperopt <a class="anchor"  id="hyperopt"></a>

## Conclusion <a class="anchor"  id="conclusion"></a>