<a href="https://colab.research.google.com/github/williamtbarker/ML4Molecules/blob/main/exercise_3_complete.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task

For this task you will use the QM9 dataset with HOMO as the target value. Perform the following -

1. split the dataset with `RandomSplitter`, `ScaffoldSplitter` and `MolecularWeightSplitter`from deepchem. You can limit the split to train-test split with 80:20 split.
2. for each of the above splits, compare the score on the the test dataset with `LinearRegression` model. Does the splitting method affect the model performance?



For the tasks that follow, use the BBBP dataset with p_np as the target value.

3. split the dataset with `RandomSplitter`, `ScaffoldSplitter` and `MolecularWeightSplitter`from deepchem. You can limit the split to train-test split with 80:20 split.
4. for each of the above splits, compare the score on the the test dataset with `SVC` model. Does the splitting method affect the model performance?

In [None]:
import deepchem as dc
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Load the QM9 dataset with raw SMILES strings
tasks, datasets, transformers = dc.molnet.load_qm9(featurizer='Raw', splitter=None)
qm9_dataset = datasets[0]

# Initialize the splitters
random_splitter = dc.splits.RandomSplitter()
scaffold_splitter = dc.splits.ScaffoldSplitter()
mw_splitter = dc.splits.MolecularWeightSplitter()

# Split the dataset using ScaffoldSplitter and MolecularWeightSplitter
train_scaffold, test_scaffold = scaffold_splitter.train_test_split(qm9_dataset)
train_mw, test_mw = mw_splitter.train_test_split(qm9_dataset)

# Featurizer
featurizer = dc.feat.CircularFingerprint(size=1024, radius=2)

# Function to featurize datasets
def featurize_data(dataset):
    features = featurizer.featurize(dataset.ids)
    return dc.data.NumpyDataset(X=features, y=dataset.y)

# Featurize the datasets
train_data_scaffold = featurize_data(train_scaffold)
test_data_scaffold = featurize_data(test_scaffold)
train_data_mw = featurize_data(train_mw)
test_data_mw = featurize_data(test_mw)

# For RandomSplitter, use a pre-featurized dataset
qm9_featurized = dc.molnet.load_qm9(featurizer='ECFP', splitter=None)[0][0]
train_random, test_random = random_splitter.train_test_split(qm9_featurized)

# Define a function for training and evaluating a model
def train_and_evaluate(train_dataset, test_dataset):
    model = LinearRegression()
    model.fit(train_dataset.X, train_dataset.y.flatten())
    predictions = model.predict(test_dataset.X)
    mse = mean_squared_error(test_dataset.y.flatten(), predictions)
    return np.sqrt(mse)  # RMSE

# Evaluate the model for each split
rmse_random = train_and_evaluate(train_random, test_random)
rmse_scaffold = train_and_evaluate(train_data_scaffold, test_data_scaffold)
rmse_mw = train_and_evaluate(train_data_mw, test_data_mw)

# Print the RMSE values
print("RMSE with Random Split: ", rmse_random)
print("RMSE with Scaffold Split: ", rmse_scaffold)
print("RMSE with Molecular Weight Split: ", rmse_mw)


Remember to install deepchem. You can set `frac_valid = 0` here.

Hint: You can use matplotlib or seaborn to plot. [This](https://youtu.be/hNNRVRmZO1s?t=8315) workshop video shows how to plot.

In [9]:
# 4. SVC model for the 3 splits
import deepchem as dc
import pandas as pd
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the BBBP dataset
tasks, datasets, transformers = dc.molnet.load_bbbp(featurizer='ECFP', split='index')
bbbp_dataset = datasets[0]

# Initialize the splitters
random_splitter = dc.splits.RandomSplitter()
scaffold_splitter = dc.splits.ScaffoldSplitter()
mw_splitter = dc.splits.MolecularWeightSplitter()

# Function to create train and test datasets using indices
def create_datasets(dataset, split_indices):
    train_indices, _, test_indices = split_indices
    train_dataset = dc.data.NumpyDataset(X=dataset.X[train_indices], y=dataset.y[train_indices])
    test_dataset = dc.data.NumpyDataset(X=dataset.X[test_indices], y=dataset.y[test_indices])
    return train_dataset, test_dataset

# Split the dataset using each splitter
split_random = random_splitter.split(bbbp_dataset)
split_scaffold = scaffold_splitter.split(bbbp_dataset)
split_mw = mw_splitter.split(bbbp_dataset)

# Create datasets
train_random, test_random = create_datasets(bbbp_dataset, split_random)
train_scaffold, test_scaffold = create_datasets(bbbp_dataset, split_scaffold)
train_mw, test_mw = create_datasets(bbbp_dataset, split_mw)

# Define a function for training and evaluating a model
def train_and_evaluate(train_dataset, test_dataset):
    model = SVC()
    model.fit(train_dataset.X, train_dataset.y.flatten())
    predictions = model.predict(test_dataset.X)
    accuracy = accuracy_score(test_dataset.y.flatten(), predictions)
    return accuracy

# Evaluate the model for each split
accuracy_random = train_and_evaluate(train_random, test_random)
accuracy_scaffold = train_and_evaluate(train_scaffold, test_scaffold)
accuracy_mw = train_and_evaluate(train_mw, test_mw)

# Print the accuracy values
print("Accuracy with Random Split: ", accuracy_random)
print("Accuracy with Scaffold Split: ", accuracy_scaffold)
print("Accuracy with Molecular Weight Split: ", accuracy_mw)


[02:19:35] Explicit valence for atom # 1 N, 4, is greater than permitted
    rdkit.Chem.rdmolfiles.CanonicalRankAtoms(NoneType)
did not match C++ signature:
    CanonicalRankAtoms(RDKit::ROMol mol, bool breakTies=True, bool includeChirality=True, bool includeIsotopes=True)
[02:19:35] Explicit valence for atom # 6 N, 4, is greater than permitted
    rdkit.Chem.rdmolfiles.CanonicalRankAtoms(NoneType)
did not match C++ signature:
    CanonicalRankAtoms(RDKit::ROMol mol, bool breakTies=True, bool includeChirality=True, bool includeIsotopes=True)
[02:19:35] Explicit valence for atom # 6 N, 4, is greater than permitted
    rdkit.Chem.rdmolfiles.CanonicalRankAtoms(NoneType)
did not match C++ signature:
    CanonicalRankAtoms(RDKit::ROMol mol, bool breakTies=True, bool includeChirality=True, bool includeIsotopes=True)
[02:19:35] Explicit valence for atom # 11 N, 4, is greater than permitted
    rdkit.Chem.rdmolfiles.CanonicalRankAtoms(NoneType)
did not match C++ signature:
    CanonicalRankAto

Accuracy with Random Split:  0.8597560975609756
Accuracy with Scaffold Split:  0.5487804878048781
Accuracy with Molecular Weight Split:  0.5426829268292683


Note, this is a classification dataset. So, we are using a classification model.