### Splitting dataset

Once the dataset is cleaned up, we can create the train, validation and test splits.

There are libraries available to split the dataset based on the output value, molecular weight, scaffold etc. This approach requires converting the CSV file to the library-dependent which is sometimes cumbersome. 

For simplicity, we will first randomly split the dataset. We will use the QM9 dataset with ```gap``` as the output (target). 

In [None]:
# import pandas library
import pandas as pd

# load the dataframe as CSV from URL. 
# If you upload the file to Colab, replace the URL with the file name 
df = pd.read_csv("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/qm9.csv")

# look at the top 5 entries
df.head()

[Fast-ML](https://pypi.org/project/fast-ml/) package has in-built functionalities to analyze the datasets but is not Chemistry-aware. As we are randomly spiltting the dataset, we can use this package.

In [None]:
# install Fast-ML
! pip install fast_ml

In [None]:
# import the function to split into train-valid-test
from fast_ml.model_development import train_valid_test_split

In [None]:
# we will split the dataset as train-valid-test = 0.8:0.1:0.1
X_train, y_train, X_valid, y_valid, \
X_test, y_test = train_valid_test_split(df[["smiles","gap"]], target = "gap", train_size=0.8,
                                        valid_size=0.1, test_size=0.1) 


In [None]:
X_test

In [None]:
y_test

In case of more Chemistry-aware dataset splitting, pacakages like [deepchem](https://deepchem.readthedocs.io/en/latest/index.html) can be used. However, the CSV dataset must be converted into a dataset class before the splitting can be performed.

Let's try splitting the dataset based on molecular weight in deepchem.

In [None]:
# install deepchem
!pip install deepchem

In [None]:
import deepchem as dc

As the kernal restarted, we will reload the QM9 dataset.

In [None]:
# import that pandas library
import pandas as pd

# load the dataframe as CSV from URL. 
# If you upload the file to Colab, replace the URL with the file name 
df = pd.read_csv("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/qm9.csv")

We will use the ``smiles`` and ``gap`` values from the dataset as before and create the ``NumpyDataset`` object in deepchem. The documentation for dataset in deepchem can be found [here](https://deepchem.readthedocs.io/en/latest/api_reference/data.html#datasets)

In [None]:
# create the deepchem dataset object
# note ids arg is necessary for splitting
dataset = dc.data.NumpyDataset.from_dataframe(df[["smiles","gap"]],
                                              X="smiles",y="gap", ids="smiles")

One can look as the ``X`` and ``y`` values to ensure proper loading of the dataset.

In [None]:
dataset.y

In [None]:
dataset.X

We will perform molecular weight based split. More documentation on splitting methods in deepchem can be found [here](https://deepchem.readthedocs.io/en/latest/api_reference/splitters.html)

In [None]:
# create the molecular weight splitter object
molecularweightsplitter = dc.splits.MolecularWeightSplitter()

train_dataset, valid_dataset, test_dataset \
 = molecularweightsplitter.train_valid_test_split(
    dataset=dataset, frac_train = 0.8, frac_valid = 0.1,
    frac_test = 0.1
 )

We can convert the dataset objects back to pandas dataframe with ``to_dataframe`` for easy analysis, if needed.

In [None]:
train_dataset, valid_dataset, test_dataset\
 = train_dataset.to_dataframe(), valid_dataset.to_dataframe(),\
  test_dataset.to_dataframe()

In [None]:
test_dataset