<a href="https://colab.research.google.com/github/williamtbarker/ML4Molecules/blob/main/09_Hyperparameter_Tuning_complete.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Optimal hyperparameters

The model weights are the parameters for the ML model. However, the model arguments (example: number of hidden layers, type of kernel etc) also play a role in determining the model parameters. Hence, these model arguments are called hyperparameters.

One should optimize these hyperparameters too. Here, we will look at one such example using the QM9 dataset and SVR model

In [1]:
# install rdkit and deepchem
! pip install rdkit
! pip install deepchem

# install Fast-ML
! pip install fast_ml

Collecting rdkit
  Downloading rdkit-2023.9.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.4/34.4 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: rdkit
Successfully installed rdkit-2023.9.4
Collecting deepchem
  Downloading deepchem-2.7.1-py3-none-any.whl (693 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m693.2/693.2 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Collecting scipy<1.9 (from deepchem)
  Downloading scipy-1.8.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (42.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 MB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scipy, deepchem
  Attempting uninstall: scipy
    Found existing installation: scipy 1.11.4
    Uninstalling scipy-1.11.4:
      Successfully uninstalled scipy-1.11.4
[31mERROR: pip's dependency resolver doe

In [2]:
# import that pandas library
import pandas as pd

# load the dataframe as CSV from URL.
df = pd.read_csv("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/qm9.csv")

# we will use 5 % of the dataset for demo
dataset = df[["smiles","gap"]].sample(frac=0.05)

# import depechem and rdkit
import deepchem as dc
from rdkit import Chem

# create the featurizer object
# we will set the radius=2, size=100 as before
featurizer = dc.feat.CircularFingerprint(size=100, radius=2)

# apply to the dataset
dataset["fp"] = dataset["smiles"].apply(featurizer.featurize)

# the fp is an multi-dimensional array but we want to list for training
dataset["fp"] = dataset["fp"].apply(lambda x: list(x[0]))


# import the function to split into train-valid-test
from fast_ml.model_development import train_valid_test_split

# we will split the dataset as train-valid-test = 0.8:0.1:0.1
X_train, y_train, X_valid, y_valid, \
X_test, y_test = train_valid_test_split(dataset[["fp","gap"]], target = "gap", train_size=0.8,
                                        valid_size=0.1, test_size=0.1)

Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead


## Hyperparameter tuning

There are python packages that do this. Here, we use [optuna](https://optuna.org/)

In [3]:
# install optuna
!pip install optuna

Collecting optuna
  Downloading optuna-3.5.0-py3-none-any.whl (413 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m413.4/413.4 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.13.1-py3-none-any.whl (233 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.4/233.4 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting colorlog (from optuna)
  Downloading colorlog-6.8.0-py3-none-any.whl (11 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.3.0-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.6/78.6 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Mako, colorlog, alembic, optuna
Successfully installed Mako-1.3.0 alembic-1.13.1 colorlog-6.8.0 optuna-3.5.0


The code has an objective function which can be minimized or maximized. Here, we will try to maximize the R<sup>2</sup> score. Below is the pseudo code

########################

1. import the libraries

2. define objective function - this should involve train the model with the choose hyperparameters

3. create the study object
4. optimize



In [4]:
# import the model class and optuna
from sklearn.svm import SVR
import optuna

#create objective (essentially training)
def objective(trial):
  # we will have kernel, and C as the hyperparameters
  kernel = trial.suggest_categorical("kernel",["rbf","linear","poly","sigmoid"])
  C = trial.suggest_float("C",0.1,1)

  # create the model and fit
  svr = SVR(kernel=kernel, C=C)
  model = svr.fit(X_train["fp"].values.tolist(),y_train.values.tolist())

  # compute the score on valid dataset
  score = model.score(X_valid["fp"].values.tolist(),y_valid.values.tolist())

  return score

Let's start optimization

In [5]:
# create the study object
study = optuna.create_study(direction='maximize')

# run optimization
study.optimize(objective, n_trials=10)

[I 2024-01-09 17:22:55,895] A new study created in memory with name: no-name-8532b5de-eb30-4159-89fc-6f2382e1fa8d
[I 2024-01-09 17:22:55,979] Trial 0 finished with value: 0.17453615647587895 and parameters: {'kernel': 'poly', 'C': 0.2354012397479017}. Best is trial 0 with value: 0.17453615647587895.
[I 2024-01-09 17:23:00,531] Trial 1 finished with value: -297.5149984579547 and parameters: {'kernel': 'sigmoid', 'C': 0.2963463332250755}. Best is trial 0 with value: 0.17453615647587895.
[I 2024-01-09 17:23:04,943] Trial 2 finished with value: -3088.154417039526 and parameters: {'kernel': 'sigmoid', 'C': 0.9508891707312666}. Best is trial 0 with value: 0.17453615647587895.
[I 2024-01-09 17:23:04,998] Trial 3 finished with value: 0.2410063671033833 and parameters: {'kernel': 'rbf', 'C': 0.927505709341099}. Best is trial 3 with value: 0.2410063671033833.
[I 2024-01-09 17:23:05,142] Trial 4 finished with value: 0.4099380073024903 and parameters: {'kernel': 'sigmoid', 'C': 0.19682713707536031

Getting the best model hyperparameters

In [6]:
study.best_value

0.4099380073024903

In [7]:
study.best_trial.params

{'kernel': 'sigmoid', 'C': 0.19682713707536031}