# AutoML tools: TPOT

In this notebook, we will explore how to use [**TPOT**](https://epistasislab.github.io/tpot/) to automatically optimize machine learning pipelines.

TPOT, which stands for **Tree-based Pipeline Optimization Tool**, is an open-source AutoML library in Python. It utilizes **genetic programming** to automate the process of feature engineering, model selection, and hyperparameter tuning. TPOT generates and evaluates a population of pipelines, evolving them over generations to identify the most effective combination of data preprocessing steps and machine learning models.

Before we preceed, let's install TPOT library.

In [0]:
!pip install -q tpot

In [0]:
# You only need to run this cell after installing the optuna package on Databricks
dbutils.library.restartPython()

Let's start by importing the necessary libraries and loading the Boston dataset for regression.

In [0]:
import numpy as np
import pandas as pd
import tpot
from tpot import TPOTRegressor, TPOTClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

pd.options.mode.chained_assignment = None

## Regression with TPOT

In [0]:
# Load Boston dataset
boston_df = pd.read_csv("../../../../Data/Boston.csv")

X = boston_df.iloc[:, 1:14]
y = boston_df.iloc[:, -1]

# Split the Boston data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

The TPOTRegressor performs an intelligent search over machine learning pipelines that can contain supervised regression models, preprocessors, feature selection techniques, and any other estimator or transformer that follows the scikit-learn API. The TPOTRegressor will also search over the hyperparameters of all objects in the pipeline.

[TPOT Regressor provides various parameters](https://epistasislab.github.io/tpot/api/#regression) to control the optimization process, including:

* generations: The number of generations (iterations) for the genetic
optimization process.
* population_size: The number of pipelines to maintain in each generation.
* max_time_mins: The maximum time (in minutes) that TPOT should run for optimization.
* scoring: The performance metric used to evaluate the pipelines (e.g., 'neg_mean_squared_error', 'r2', etc.).
* cv: The number of cross-validation folds to use during pipeline evaluation.
* verbosity: The level of verbosity for output during optimization (higher values provide more details).

Now, let's create an instance of TPOTRegressor and let it search for the best regression pipeline on the Boston dataset:

In [0]:
# Create a TPOTRegressor instance
tpot_reg = TPOTRegressor(generations=5, population_size=50, verbosity=2, random_state=42)

# Fit TPOT on the training data for regression
tpot_reg.fit(X_train, y_train)

In [0]:
# Evaluate the best regression pipeline on the test set
best_pipeline_reg = tpot_reg.fitted_pipeline_
test_score_reg = best_pipeline_reg.score(X_test, y_test)
print(f'Test Set R^2 Score (Regression): {test_score_reg:.3f}')

## Classification with TPOT

*You may use the documentation as a guide:*

*http://epistasislab.github.io/tpot/api/*

In [0]:
# Task: Import titanic.csv dataset

titanic_df = pd.read_csv("../../../../Data/titanic.csv")

In [0]:
X = titanic_df[['Sex', 'Embarked', 'Pclass', 'Age']]
y = titanic_df['Survived']

In [0]:
# Task: split the dataset into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [0]:
# We need to deal with the missing data and transform categorical attributes

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
        ("encoder", OneHotEncoder(handle_unknown="ignore")),
    ]
)

numerical_transformer = Pipeline(steps=[("imputer", SimpleImputer())])

preprocessor = ColumnTransformer(
    transformers=[
        ("cat", categorical_transformer, ["Sex", "Embarked"]),
        ("num", numerical_transformer, ["Age"])
    ],
    remainder="passthrough",
)

X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

In [0]:
# Task: Create a TPOTClassifier instance
tpot_clas = TPOTClassifier(generations=5, population_size=20, verbosity=2, random_state=42)

# Task: Fit TPOT on the training data for classification
tpot_clas.fit(X_train, y_train)

In [0]:
# Task: Evaluate the best classification pipeline on the test set
best_pipeline_clas = tpot_clas.fitted_pipeline_
test_score_clas = best_pipeline_clas.score(X_test, y_test)
print(f'Test Set Accuracy (Classification): {test_score_clas*100:.2f}%')

In this notebook, we used TPOT, a powerful AutoML library, to automate the process of optimizing machine learning pipelines for both regression and classification tasks. We applied TPOT on two different datasets: the Boston dataset for regression and the Titanic dataset for classification.

Using TPOT, we significantly reduced the manual effort of hyperparameter tuning and pipeline construction while achieving competitive performance in both tasks. TPOT is a valuable tool for automating the machine learning workflow and is worth exploring further for various datasets and tasks.

We highly recommend delving deeper into the documentation of TPOT to gain a comprehensive understanding of its functionalities and capabilities.

**Documentation:**

http://epistasislab.github.io/tpot/



Happy automating!