### Training classical ML models

In the QM9 dataset, all entries have the HOMO-LUMO gap which are continuous values. So, we adopt supervised learning method with regression task.

The classical ML models include linear models, support vector machines, decision tress etc. A list of algorithms avialable in ``scikit-learn`` package can be found [here](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning).

Here, we will train some of those ML models to predict the HOMO-LUMO gap.

In [None]:
# import that pandas library
import pandas as pd

# load the dataframe as CSV from URL. 
df = pd.read_csv("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/qm9.csv")

# we will use 20% of the dataset for demo
dataset = df[["smiles","gap"]].sample(frac=0.2)

### Molecular Representation

We will use the molecular fingerprints as the representation for the molecules. We will use the featurizer from deepchem for this operation.

In [None]:
# install rdkit and deepchem
! pip install rdkit
! pip install deepchem

In [None]:
# import depechem and rdkit
import deepchem as dc
from rdkit import Chem

# create the featurizer object
# we will set the radius=2, size=100 as before
featurizer = dc.feat.CircularFingerprint(size=100, radius=2)

# apply to the dataset
dataset["fp"] = dataset["smiles"].apply(featurizer.featurize)

# the fp is an multi-dimensional array but we want to list for training
dataset["fp"] = dataset["fp"].apply(lambda x: list(x[0]))

We will use a random split of the dataset using Fast-ML

In [None]:
# install Fast-ML
! pip install fast_ml

In [None]:
# import the function to split into train-valid-test
from fast_ml.model_development import train_valid_test_split

# we will split the dataset as train-valid-test = 0.8:0.1:0.1
X_train, y_train, X_valid, y_valid, \
X_test, y_test = train_valid_test_split(dataset[["fp","gap"]], target = "gap", train_size=0.8,
                                        valid_size=0.1, test_size=0.1) 

# look at the dataset
X_train

### Linear regression model
We see that the new X dataframes have additional column with fingerprint. We will use those as input for training the ML models.

Let us begin with ``Linear Regression`` model. This is the least squares method. You can find more details [here](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares)

In [None]:
# import the model
from sklearn.linear_model import LinearRegression

#create the model object
lr = LinearRegression()

# fit the model with x=fp and y=gap
model = lr.fit(X_train["fp"].values.tolist(),y_train.values.tolist())

To check the accuracy of the linear fit, we can use the valid dataset. The ``score`` function computes the R<sup>2</sup> value. R<sup>2</sup> close to 1 is better. 

In [None]:
model.score(X_valid["fp"].values.tolist(),y_valid.values.tolist())

Let the model predit 10 values from the test dataset

In [None]:
model.predict(X_test["fp"].values.tolist())[:10]

The corresponding HOMO-LUMO gaps in the test dataset are -

In [None]:
y_test.values[:10]

### Support vector machine regression (SVR) model

Not much change in the code, using ``SVR`` instead of ``LinearRegression``.

In [None]:
# import the model class
from sklearn.svm import SVR

#create the model object
svr = SVR()

# fit the model with x=fp and y=gap
model = svr.fit(X_train["fp"].values.tolist(),y_train.values.tolist())

Again computing the R<sup>2</sup>

In [None]:
model.score(X_valid["fp"].values.tolist(),y_valid.values.tolist())

The R<sup>2</sup> is low with SVR. We can change the model parameters to see if we get any improvement. The model parameters can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR).

We will change the *kernel* to *linear* and see if that helps. Default is *rbf*

In [None]:
svr = SVR(kernel="linear")
model = svr.fit(X_train["fp"].values.tolist(),y_train.values.tolist())
model.score(X_valid["fp"].values.tolist(),y_valid.values.tolist())