# Parameter Optimization with Optuna

In this example we will train a RandomForest model and optimize its parameters using [Optuna](https://optuna.readthedocs.io/en/stable/).
This example is an adapted version from the Optuna [Basic Concept example](https://optuna.readthedocs.io/en/stable/#basic-concepts).



In [None]:
# Setup temporary directory and initialize git and dvc
from zntrack import config

config.nb_name = "parameter_optimization.ipynb"

from zntrack.utils import cwd_temp_dir

temp_dir = cwd_temp_dir()

!git init
!dvc init

## Workflow
Our Workflow consists of multiple steps:
- Download the dataset
- Split into train / test data
- Train a RandomForest model on the train data
- Evaluate the model on the test data

We want to optimize using two differen Models: RandomForest and LinearSVR with their respective hyperparameters.
We want to optimize them and use the `Evaluate` Node to compute a RMSE that Optuna optimizes.
We will use DVC [Experiments](https://dvc.org/doc/start/experiments) to track each run.
In combination with Optuna, this allows us not only to optimize the parameters but also easily store and access the trained models afterwards.


[![](https://mermaid.ink/img/pako:eNp1j7sOgkAQRX-FTC0FYEVhhYmNFXSuxQQG2GQfZJnVGMK_OzFKaKxmcu6981ig9R1BCb3xz3bEwElTKafYCc5uCs4PNBGZFNx_OBd88XHWbqiQsSbeiYWIV6lmx47CmoDaNTRzPRm9D-RpevqYtglfkG3xf6BQDg5gKVjUnfywKJckCngkK_eW0nbUYzSyTrlVrBjZ1y_XQskh0gHi1MlrlcYhoIWyRzPT-gaiDmCv?type=png)](https://mermaid.live/edit#pako:eNp1j7sOgkAQRX-FTC0FYEVhhYmNFXSuxQQG2GQfZJnVGMK_OzFKaKxmcu6981ig9R1BCb3xz3bEwElTKafYCc5uCs4PNBGZFNx_OBd88XHWbqiQsSbeiYWIV6lmx47CmoDaNTRzPRm9D-RpevqYtglfkG3xf6BQDg5gKVjUnfywKJckCngkK_eW0nbUYzSyTrlVrBjZ1y_XQskh0gHi1MlrlcYhoIWyRzPT-gaiDmCv)

In [None]:
import optuna, sklearn, zntrack
import sklearn.datasets
import sklearn.ensemble
import sklearn.model_selection
import sklearn.metrics


class HousingDataSet(zntrack.Node):
    """Download and prepare the California housing dataset."""

    data = zntrack.dvc.outs("scikit_learn_data")

    def run(self) -> None:
        _ = sklearn.datasets.fetch_california_housing(
            data_home=self.data, return_X_y=True
        )

    @property
    def labels(self) -> dict:
        _, labels = sklearn.datasets.fetch_california_housing(
            data_home=self.data, return_X_y=True
        )
        return labels

    @property
    def features(self) -> dict:
        features, _ = sklearn.datasets.fetch_california_housing(
            data_home=self.data, return_X_y=True
        )
        return features


class TrainTestSplit(zntrack.Node):
    """Split the dataset into train and test sets."""

    labels = zntrack.zn.deps()
    features = zntrack.zn.deps()
    seed = zntrack.zn.params(1234)

    train_features = zntrack.zn.outs()
    test_features = zntrack.zn.outs()
    train_labels = zntrack.zn.outs()
    test_labels = zntrack.zn.outs()

    def run(self) -> None:
        self.train_features, self.test_features, self.train_labels, self.test_labels = (
            sklearn.model_selection.train_test_split(
                self.features, self.labels, test_size=0.2, random_state=self.seed
            )
        )


class RandomForest(zntrack.Node):
    """Train a random forest model."""

    train_features = zntrack.zn.deps()
    train_labels = zntrack.zn.deps()
    seed = zntrack.zn.params(1234)
    max_depth = zntrack.zn.params()

    model = zntrack.zn.outs()

    def run(self) -> None:
        self.model = sklearn.ensemble.RandomForestRegressor(
            random_state=self.seed, max_depth=self.max_depth
        )
        self.model.fit(self.train_features, self.train_labels)


class LinearSVR(zntrack.Node):
    """Train a SVR model."""

    train_features = zntrack.zn.deps()
    train_labels = zntrack.zn.deps()
    C = zntrack.zn.params()

    model = zntrack.zn.outs()

    def run(self) -> None:
        self.model = sklearn.svm.LinearSVR(C=self.C)
        self.model.fit(self.train_features, self.train_labels)


class Evaluate(zntrack.Node):
    """Evaluate the model on a test set."""

    model = zntrack.zn.deps()
    test_features = zntrack.zn.deps()
    test_labels = zntrack.zn.deps()

    score = zntrack.zn.metrics()

    def run(self) -> None:
        prediction = self.model.predict(self.test_features)
        self.score = sklearn.metrics.mean_squared_error(self.test_labels, prediction)

We use the `zntrack.Project` to create our workflow as usual.
To use DVC Experiments, we need to create an initial commit.
Therefore, we run the project directly and make an initial git commit afterwards.

In [None]:
with zntrack.Project() as project:
    data = HousingDataSet()
    split = TrainTestSplit(labels=data.labels, features=data.features)
    model = RandomForest(
        train_features=split.train_features,
        train_labels=split.train_labels,
        max_depth=2,
        name="model",
    )
    evaluate = Evaluate(
        model=model.model,
        test_features=split.test_features,
        test_labels=split.test_labels,
    )

project.run()

In [None]:
RandomForest.from_rev(name="model").state

In [None]:
!git add .
!git commit -m "initial commit"

## Optimize

For Optuna we need to define an objective we want to optimize.
We use the `project.create_experiment` API from ZnTrack to change the model parameter and return the score from the `Evaluate` stage as final metric to optimize.
To later identify the experiments, we name them according to the `trial.number` from optuna.

In [None]:
def objective(trial):
    with project.create_experiment(queue=False, name=f"exp-{trial.number}") as exp:
        regressor_name = trial.suggest_categorical("classifier", ["SVR", "RandomForest"])

        # we need to replace the existing model on the graph with a new model.

        project.remove("model")

        if regressor_name == "SVR":
            svr_c = trial.suggest_float("svr_c", 1e-10, 1e10, log=True)
            model = LinearSVR(
                train_features=split.train_features,
                train_labels=split.train_labels,
                C=svr_c,
                name="model",
            )
        else:
            max_depth = trial.suggest_int("max_depth", 2, 32)
            model = RandomForest(
                train_features=split.train_features,
                train_labels=split.train_labels,
                max_depth=max_depth,
                name="model",
            )

        # need to let the evaluate node know which model to evaluate
        evaluate.model = model.model

    return exp[evaluate].score


study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=3)

## Evaluate

We can now investigate the best parameters via `study.best_params`.
Additionally, because we used DVC experiments we can directly access the experiment with the best parameters, by the name we used.

In [None]:
study.best_params

In [None]:
project.experiments.keys()

We can either load the Node via the experiment or by it's name using `zntrack.from_rev()`.
The node should not be loaded via `model.load()` because the `model` instance could be `RandomForest` and the best model would be `LinearSVR` or *vice versa*.

In [None]:
exp = project.experiments[f"exp-{study.best_trial.number}"]
best_model = exp["model"]

In [None]:
f"exp-{study.best_trial.number}"

In [None]:
# we load split data into memory to compute the score.
split.load()

best_score = evaluate.from_rev(rev=f"exp-{study.best_trial.number}").score
initial_score = evaluate.from_rev(rev="HEAD").score
print(f"Best score: {best_score:.3f} compared to initial score: {initial_score:.3f}")

temp_dir.cleanup()