<a href="https://colab.research.google.com/github/vinayak2019/ml_for_molecules/blob/main/Hyperparameter_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Optimal hyperparameters

The model weights are the parameters for the ML model. However, the model arguments (example: number of hidden layers, type of kernel etc) also play a role in determining the model parameters. Hence, these model arguments are called hyperparameters.

One should optimize these hyperparameters too. Here, we will look at one such example using the QM9 dataset and SVR model

In [None]:
# install rdkit and deepchem
! pip install rdkit
! pip install deepchem

# install Fast-ML
! pip install fast_ml

In [None]:
# import that pandas library
import pandas as pd

# load the dataframe as CSV from URL. 
df = pd.read_csv("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/qm9.csv")

# we will use 5 % of the dataset for demo
dataset = df[["smiles","gap"]].sample(frac=0.05)

# import depechem and rdkit
import deepchem as dc
from rdkit import Chem

# create the featurizer object
# we will set the radius=2, size=100 as before
featurizer = dc.feat.CircularFingerprint(size=100, radius=2)

# apply to the dataset
dataset["fp"] = dataset["smiles"].apply(featurizer.featurize)

# the fp is an multi-dimensional array but we want to list for training
dataset["fp"] = dataset["fp"].apply(lambda x: list(x[0]))


# import the function to split into train-valid-test
from fast_ml.model_development import train_valid_test_split

# we will split the dataset as train-valid-test = 0.8:0.1:0.1
X_train, y_train, X_valid, y_valid, \
X_test, y_test = train_valid_test_split(dataset[["fp","gap"]], target = "gap", train_size=0.8,
                                        valid_size=0.1, test_size=0.1) 

## Hyperparameter tuning

There are python packages that do this. Here, we use [optuna](https://optuna.org/)

In [None]:
# install optuna
!pip install optuna

The code has an objective function which can be minimized or maximized. Here, we will try to maximize the R<sup>2</sup> score. Below is the pseudo code

########################

1. import the libraries

2. define objective function - this should involve train the model with the choose hyperparameters

3. create the study object
4. optimize



In [None]:
# import the model class and optuna
from sklearn.svm import SVR
import optuna

#create objective (essentially training)
def objective(trial):
  # we will have kernel, and C as the hyperparameters
  kernel = trial.suggest_categorical("kernel",["rbf","linear","poly","sigmoid"])
  C = trial.suggest_float("C",0.1,1)

  # create the model and fit
  svr = SVR(kernel=kernel, C=C)
  model = svr.fit(X_train["fp"].values.tolist(),y_train.values.tolist())

  # compute the score on valid dataset
  score = model.score(X_valid["fp"].values.tolist(),y_valid.values.tolist())

  return score

Let's start optimization

In [None]:
# create the study object
study = optuna.create_study(direction='maximize')

# run optimization
study.optimize(objective, n_trials=10)

Getting the best model hyperparameters

In [None]:
study.best_value

In [None]:
study.best_trial.params