# Parameter Optimization with Optuna

In this example we will train a RandomForest model and optimize its parameters using [Optuna](https://optuna.readthedocs.io/en/stable/).
This example is an adapted version from the Optuna [Basic Concept example](https://optuna.readthedocs.io/en/stable/#basic-concepts).



In [1]:
# Setup temporary directory and initialize git and dvc
from zntrack import config

config.nb_name = "parameter_optimization.ipynb"

from zntrack.utils import cwd_temp_dir

temp_dir = cwd_temp_dir()

!git init
!dvc init

Initialized empty Git repository in /tmp/tmpexwfpu4k/.git/
Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>


## Workflow
Our Workflow consists of multiple steps:
- Download the dataset
- Split into train / test data
- Train a RandomForest model on the train data
- Evaluate the model on the test data

We want to optimize using two differen Models: RandomForest and LinearSVR with their respective hyperparameters.
We want to optimize them and use the `Evaluate` Node to compute a RMSE that Optuna optimizes.
We will use DVC [Experiments](https://dvc.org/doc/start/experiments) to track each run.
In combination with Optuna, this allows us not only to optimize the parameters but also easily store and access the trained models afterwards.


[![](https://mermaid.ink/img/pako:eNp1jz0PgjAQhv8KuVkG1InBCY2zsFmHCz2wSXslpdUYwn_3YpSwuF2e572vCVqvCUrorH-2dwwxayrFKrLg4qrg-ECbMJKC2w9vBZ99Gg33FUasKa7kTuQFWXt38oHGtdqLagIaboTXgzVruc3zwye0DPqCYmn_B3aKYQOOgkOj5ZVJcZYpiHdycnYppaYOk5V1imeJYoq-fnELZQyJNpAGLR9WBvuADsoO7UjzG6bTY5I?type=png)](https://mermaid.live/edit#pako:eNp1jz0PgjAQhv8KuVkG1InBCY2zsFmHCz2wSXslpdUYwn_3YpSwuF2e572vCVqvCUrorH-2dwwxayrFKrLg4qrg-ECbMJKC2w9vBZ99Gg33FUasKa7kTuQFWXt38oHGtdqLagIaboTXgzVruc3zwye0DPqCYmn_B3aKYQOOgkOj5ZVJcZYpiHdycnYppaYOk5V1imeJYoq-fnELZQyJNpAGLR9WBvuADsoO7UjzG6bTY5I)

In [2]:
import optuna, sklearn, zntrack
import sklearn.datasets
import sklearn.ensemble
import sklearn.model_selection
import sklearn.metrics


class HousingDataSet(zntrack.Node):
    """Download and prepare the California housing dataset."""

    data = zntrack.dvc.outs("scikit_learn_data")

    def run(self) -> None:
        _ = sklearn.datasets.fetch_california_housing(
            data_home=self.data, return_X_y=True
        )

    @property
    def labels(self) -> dict:
        _, labels = sklearn.datasets.fetch_california_housing(
            data_home=self.data, return_X_y=True
        )
        return labels

    @property
    def features(self) -> dict:
        features, _ = sklearn.datasets.fetch_california_housing(
            data_home=self.data, return_X_y=True
        )
        return features


class TrainTestSplit(zntrack.Node):
    """Split the dataset into train and test sets."""

    labels = zntrack.zn.deps()
    features = zntrack.zn.deps()
    seed = zntrack.zn.params(1234)

    train_features = zntrack.zn.outs()
    test_features = zntrack.zn.outs()
    train_labels = zntrack.zn.outs()
    test_labels = zntrack.zn.outs()

    def run(self) -> None:
        self.train_features, self.test_features, self.train_labels, self.test_labels = (
            sklearn.model_selection.train_test_split(
                self.features, self.labels, test_size=0.2, random_state=self.seed
            )
        )


class RandomForest(zntrack.Node):
    """Train a random forest model."""

    train_features = zntrack.zn.deps()
    train_labels = zntrack.zn.deps()
    seed = zntrack.zn.params(1234)
    max_depth = zntrack.zn.params()

    model = zntrack.zn.outs()

    def run(self) -> None:
        self.model = sklearn.ensemble.RandomForestRegressor(
            random_state=self.seed, max_depth=self.max_depth
        )
        self.model.fit(self.train_features, self.train_labels)


class LinearSVR(zntrack.Node):
    """Train a SVR model."""

    train_features = zntrack.zn.deps()
    train_labels = zntrack.zn.deps()
    C = zntrack.zn.params()

    model = zntrack.zn.outs()

    def run(self) -> None:
        self.model = sklearn.svm.LinearSVR(C=self.C)
        self.model.fit(self.train_features, self.train_labels)


class Evaluate(zntrack.Node):
    """Evaluate the model on a test set."""

    model = zntrack.zn.deps()
    test_features = zntrack.zn.deps()
    test_labels = zntrack.zn.deps()

    score = zntrack.zn.metrics()

    def run(self) -> None:
        prediction = self.model.predict(self.test_features)
        self.score = sklearn.metrics.mean_squared_error(self.test_labels, prediction)

We use the `zntrack.Project` to create our workflow as usual.
To use DVC Experiments, we need to create an initial commit.
Therefore, we run the project directly and make an initial git commit afterwards.

In [3]:
with zntrack.Project() as project:
    data = HousingDataSet()
    split = TrainTestSplit(labels=data.labels, features=data.features)
    model = RandomForest(
        train_features=split.train_features,
        train_labels=split.train_labels,
        max_depth=2,
        name="model",
    )
    evaluate = Evaluate(
        model=model.model,
        test_features=split.test_features,
        test_labels=split.test_labels,
    )

project.run()

Running DVC command: 'stage add --name HousingDataSet --force ...'
Jupyter support is an experimental feature! Please save your notebook before running this command!
Submit issues to https://github.com/zincware/ZnTrack.
 

 

Running DVC command: 'stage add --name TrainTestSplit --force ...'
 

 

Running DVC command: 'stage add --name model --force ...'
 

 

Running DVC command: 'stage add --name Evaluate --force ...'
 

 

Running DVC command: 'repro'


In [4]:
RandomForest.from_rev(name="model").state

 

 

NodeStatus(loaded=True, results=<NodeStatusResults.UNKNOWN: 0>, remote=None, rev=None)

In [5]:
!git add .
!git commit -m "initial commit"

[main (root-commit) 9a1b65d] initial commit
 20 files changed, 1994 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvcignore
 create mode 100644 .gitignore
 create mode 100644 dvc.lock
 create mode 100644 dvc.yaml
 create mode 100644 nodes/Evaluate/score.json
 create mode 100644 nodes/TrainTestSplit/.gitignore
 create mode 100644 nodes/model/.gitignore
 create mode 100644 parameter_optimization.ipynb
 create mode 100644 params.yaml
 create mode 100644 src/Evaluate.py
 create mode 100644 src/HousingDataSet.py
 create mode 100644 src/RandomForest.py
 create mode 100644 src/TrainTestSplit.py
 create mode 100644 src/__pycache__/Evaluate.cpython-310.pyc
 create mode 100644 src/__pycache__/HousingDataSet.cpython-310.pyc
 create mode 100644 src/__pycache__/RandomForest.cpython-310.pyc
 create mode 100644 src/__pycache__/TrainTestSplit.cpython-310.pyc
 create mode 100644 zntrack.json


## Optimize

For Optuna we need to define an objective we want to optimize.
We use the `project.create_experiment` API from ZnTrack to change the model parameter and return the score from the `Evaluate` stage as final metric to optimize.
To later identify the experiments, we name them according to the `trial.number` from optuna.

In [6]:
def objective(trial):
    with project.create_experiment(queue=False, name=f"exp-{trial.number}") as exp:
        regressor_name = trial.suggest_categorical("classifier", ["SVR", "RandomForest"])

        # we need to replace the existing model on the graph with a new model.

        project.remove("model")

        if regressor_name == "SVR":
            svr_c = trial.suggest_float("svr_c", 1e-10, 1e10, log=True)
            model = LinearSVR(
                train_features=split.train_features,
                train_labels=split.train_labels,
                C=svr_c,
                name="model",
            )
        else:
            max_depth = trial.suggest_int("max_depth", 2, 32)
            model = RandomForest(
                train_features=split.train_features,
                train_labels=split.train_labels,
                max_depth=max_depth,
                name="model",
            )

        # need to let the evaluate node know which model to evaluate
        evaluate.model = model.model

    return exp[evaluate].score


study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=10)

[I 2023-05-26 14:19:10,122] A new study created in memory with name: no-name-47e7cf69-b7c1-425e-acf9-396a86518b6a
Running DVC command: 'stage add --name HousingDataSet --force ...'
 

 

Running DVC command: 'stage add --name TrainTestSplit --force ...'
 

 

Running DVC command: 'stage add --name model --force ...'
 

 

Running DVC command: 'stage add --name Evaluate --force ...'
 

 

Running DVC command: 'exp apply exp-0'
 [I 2023-05-26 14:19:21,149] Trial 0 finished with value: 11.100880417561175 and parameters: {'classifier': 'SVR', 'svr_c': 12078.934744589767}. Best is trial 0 with value: 11.100880417561175.
Running DVC command: 'stage add --name HousingDataSet --force ...'


 

 

 

Running DVC command: 'stage add --name TrainTestSplit --force ...'
 

 

Running DVC command: 'stage add --name model --force ...'
 

 

Running DVC command: 'stage add --name Evaluate --force ...'
 

 

Running DVC command: 'exp apply exp-1'
 [I 2023-05-26 14:19:31,962] Trial 1 finished with value: 526.9862708662988 and parameters: {'classifier': 'SVR', 'svr_c': 5051.58980837076}. Best is trial 1 with value: 526.9862708662988.
Running DVC command: 'stage add --name HousingDataSet --force ...'


 

 

 

Running DVC command: 'stage add --name TrainTestSplit --force ...'
 

 

Running DVC command: 'stage add --name model --force ...'
 

 

Running DVC command: 'stage add --name Evaluate --force ...'
 

 

Running DVC command: 'exp apply exp-2'
 [I 2023-05-26 14:19:47,429] Trial 2 finished with value: 0.2779238523214554 and parameters: {'classifier': 'RandomForest', 'max_depth': 12}. Best is trial 1 with value: 526.9862708662988.
Running DVC command: 'stage add --name HousingDataSet --force ...'


 

 

 

Running DVC command: 'stage add --name TrainTestSplit --force ...'
 

 

Running DVC command: 'stage add --name model --force ...'
 

 

Running DVC command: 'stage add --name Evaluate --force ...'
 

 

Running DVC command: 'exp apply exp-3'
 [I 2023-05-26 14:20:05,182] Trial 3 finished with value: 0.2627596918267919 and parameters: {'classifier': 'RandomForest', 'max_depth': 27}. Best is trial 1 with value: 526.9862708662988.
Running DVC command: 'stage add --name HousingDataSet --force ...'


 

 

 

Running DVC command: 'stage add --name TrainTestSplit --force ...'
 

 

Running DVC command: 'stage add --name model --force ...'
 

 

Running DVC command: 'stage add --name Evaluate --force ...'
 

 

Running DVC command: 'exp apply exp-4'
 [I 2023-05-26 14:20:21,223] Trial 4 finished with value: 0.2663470933213184 and parameters: {'classifier': 'RandomForest', 'max_depth': 15}. Best is trial 1 with value: 526.9862708662988.
Running DVC command: 'stage add --name HousingDataSet --force ...'


 

 

 

Running DVC command: 'stage add --name TrainTestSplit --force ...'
 

 

Running DVC command: 'stage add --name model --force ...'
 

 

Running DVC command: 'stage add --name Evaluate --force ...'
 

 

Running DVC command: 'exp apply exp-5'
 [I 2023-05-26 14:20:31,632] Trial 5 finished with value: 0.9854179960940935 and parameters: {'classifier': 'SVR', 'svr_c': 1.0818236715785393e-05}. Best is trial 1 with value: 526.9862708662988.
Running DVC command: 'stage add --name HousingDataSet --force ...'


 

 

 

Running DVC command: 'stage add --name TrainTestSplit --force ...'
 

 

Running DVC command: 'stage add --name model --force ...'
 

 

Running DVC command: 'stage add --name Evaluate --force ...'
 

 

Running DVC command: 'exp apply exp-6'
 [I 2023-05-26 14:20:44,455] Trial 6 finished with value: 0.47413665705821306 and parameters: {'classifier': 'RandomForest', 'max_depth': 5}. Best is trial 1 with value: 526.9862708662988.
Running DVC command: 'stage add --name HousingDataSet --force ...'


 

 

 

Running DVC command: 'stage add --name TrainTestSplit --force ...'
 

 

Running DVC command: 'stage add --name model --force ...'
 

 

Running DVC command: 'stage add --name Evaluate --force ...'
 

 

Running DVC command: 'exp apply exp-7'
 [I 2023-05-26 14:20:54,709] Trial 7 finished with value: 3.12196099352187 and parameters: {'classifier': 'SVR', 'svr_c': 1.0824013441742968e-10}. Best is trial 1 with value: 526.9862708662988.
Running DVC command: 'stage add --name HousingDataSet --force ...'


 

 

 

Running DVC command: 'stage add --name TrainTestSplit --force ...'
 

 

Running DVC command: 'stage add --name model --force ...'
 

 

Running DVC command: 'stage add --name Evaluate --force ...'
 

 

Running DVC command: 'exp apply exp-8'
 [I 2023-05-26 14:21:05,653] Trial 8 finished with value: 5.3304504202911795 and parameters: {'classifier': 'SVR', 'svr_c': 35.29665207527377}. Best is trial 1 with value: 526.9862708662988.


 

Running DVC command: 'stage add --name HousingDataSet --force ...'
 

 

Running DVC command: 'stage add --name TrainTestSplit --force ...'
 

 

Running DVC command: 'stage add --name model --force ...'
 

 

Running DVC command: 'stage add --name Evaluate --force ...'
 

 

Running DVC command: 'exp apply exp-9'
 [I 2023-05-26 14:21:13,172] Trial 9 finished with value: 0.2627596918267919 and parameters: {'classifier': 'RandomForest', 'max_depth': 27}. Best is trial 1 with value: 526.9862708662988.


 

## Evaluate

We can now investigate the best parameters via `study.best_params`.
Additionally, because we used DVC experiments we can directly access the experiment with the best parameters, by the name we used.

In [7]:
study.best_params

{'classifier': 'SVR', 'svr_c': 5051.58980837076}

In [8]:
project.experiments.keys()

dict_keys([None, 'exp-9', 'exp-8', 'exp-7', 'exp-6', 'exp-5', 'exp-4', 'exp-3', 'exp-2', 'exp-1', 'exp-0'])

In [9]:
exp = project.experiments[f"exp-{study.best_trial.number}"]
best_model = exp["model"]
best_model

NodeNotAvailableError: Node model is not available.

In [None]:
# we load split data into memory to compute the score.
split.load()

best_score = evaluate.from_rev(rev=f"exp-{study.best_trial.number}").score
initial_score = evaluate.from_rev(rev="HEAD").score
print(f"Best score: {best_score:.3f} compared to initial score: {initial_score:.3f}")

Best score: 4.757 compared to initial score: 0.750


temp_dir.cleanup()