<a href="https://colab.research.google.com/github/henrymoss/BOSS/blob/master/Molecule_prediction_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Molecule Prediction Demo with String Kernels on a GPU

## This notebook is designed to be ran on Google colab

Demonstration of GPU support for the subset string kernel. Remember to turn on the colab GPU!!!

We fit our string kernel to approx 600 strings of length 85

In [1]:
!git clone https://github.com/henrymoss/BOSS
!pip install gpflow

import numpy as np
import pandas as pd
import pandas as pd
from matplotlib import pyplot as plt
from time import time
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.preprocessing import StandardScaler
import gpflow
from gpflow.mean_functions import Constant
from gpflow import set_trainable
from gpflow.utilities import positive
from sklearn.model_selection import train_test_split
from BOSS.boss.code.GPflow_wrappers.Batch_SSK import Batch_SSK

Cloning into 'BOSS'...
remote: Enumerating objects: 662, done.[K
remote: Counting objects: 100% (662/662), done.[K
remote: Compressing objects: 100% (427/427), done.[K
remote: Total 662 (delta 402), reused 485 (delta 225), pack-reused 0[K
Receiving objects: 100% (662/662), 7.67 MiB | 2.33 MiB/s, done.
Resolving deltas: 100% (402/402), done.
Collecting gpflow
[?25l  Downloading https://files.pythonhosted.org/packages/dd/21/63557b5ba63e3b8c9ca6a82989e5b1b91ac41a5d593482a4c5ce8360b0e6/gpflow-2.1.2-py3-none-any.whl (253kB)
[K     |████████████████████████████████| 256kB 9.2MB/s 
Collecting multipledispatch>=0.6
  Downloading https://files.pythonhosted.org/packages/89/79/429ecef45fd5e4504f7474d4c3c3c4668c267be3370e4c2fd33e61506833/multipledispatch-0.6.0-py3-none-any.whl
Installing collected packages: multipledispatch, gpflow
Successfully installed gpflow-2.1.2 multipledispatch-0.6.0


### Download and prep data

In [3]:
# download data
df = pd.read_csv("BOSS/example_data/FreeSolv.csv")
smiles_full = df['smiles'].to_list()
property_vals = df['expt'].to_numpy()

# Delete NaN values 
smiles_full = list(np.delete(np.array(smiles_full), np.argwhere(np.isnan(property_vals))))
y_full = np.delete(property_vals, np.argwhere(np.isnan(property_vals)))

# remove all molecules with long strings and format for string kernel
smiles=[]
y=[]
for i in range(len(smiles_full)):
    # only keep strings with less than 85 characters (all but one datapoint)
    if len(smiles_full[i])<=85:
        # split all characters with a space
        smile = " ".join(smiles_full[i])
        # map multi-character expressions to single characters
        smile = smile.replace("B r","x")
        smile = smile.replace("C l","y")
        smiles.append(smile)
        y.append(y_full[i])
smiles=np.array(smiles,dtype=object).reshape(-1,1)
y=np.array(y).reshape(-1,1)

### Split data and fit model

In [6]:
# scale and split data
y_scaler = StandardScaler()
y_scaled = y_scaler.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(smiles, y_scaled, test_size=0.2, random_state=42)
y_test = y_scaler.inverse_transform(y_test)

# set up string kernel model
max_subsequence_length=5
alphabet = list(set("".join([x[0] for x in X_train])))
k = Batch_SSK(batch_size=3000,gap_decay=0.99,match_decay=0.53,alphabet=alphabet,max_subsequence_length = max_subsequence_length, maxlen=85)
cst = gpflow.kernels.Constant(1.77)
m = gpflow.models.GPR(data=(X_train, y_train), mean_function=Constant(0.2), kernel= cst*k, noise_variance=0.003)
loss=m.log_marginal_likelihood()

# fit model (turned off for quick demo, good hyper-parameters are already selected)
# optimizer = gpflow.optimizers.Scipy()
# optimizer.minimize(m.training_loss , m.trainable_variables,options=dict(ftol=0.0001),compile=False)

# make predictions
y_pred, y_var = m.predict_f(X_test)
y_pred = y_scaler.inverse_transform(y_pred)
print(f"Test RMSE is {np.sqrt(mean_squared_error(y_test, y_pred))}")

Test RMSE is 1.209136637820394
