<a href="https://colab.research.google.com/github/williamtbarker/ML4Molecules/blob/main/03_Molecular_Representation_complete.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Featurizers

There are several featurizers available in python packages like ```deepchem```. Here, we will look at some of those. A detailed documentation can be found [here](https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html)

Unlike dataset splitters, we do not need to convert our dataset into a deepchem object. We apply the featurizers on the pandas dataframe.

In [1]:
# install deepchem and rdkit
! pip install deepchem
! pip install rdkit

Collecting deepchem
  Downloading deepchem-2.7.1-py3-none-any.whl (693 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m693.2/693.2 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Collecting scipy<1.9 (from deepchem)
  Downloading scipy-1.8.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (42.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 MB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rdkit (from deepchem)
  Downloading rdkit-2023.9.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.3/34.3 MB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scipy, rdkit, deepchem
  Attempting uninstall: scipy
    Found existing installation: scipy 1.11.4
    Uninstalling scipy-1.11.4:
      Successfully uninstalled scipy-1.11.4
[31mERROR: pip's dependency resolver does not currently take into account all the packag

As before, we will use the QM9 dataset with HOMO-LUMO gap as the target. We will apply the featurizer to entire dataset and then split it randomly.

In [2]:
# import that pandas library
import pandas as pd

# load the dataframe as CSV from URL.
# If you upload the file to Colab, replace the URL with the file name
df = pd.read_csv("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/qm9.csv")

# create the dataset with only smiles and gap and 10% dataset
dataset = df[["smiles","gap"]].sample(frac=0.1)

We will use ``CircularFingerprint``(Morgan fingerprint) and ``RDKitDescriptors`` featurizer from deepchem. You can look for documentation on available featurizers [here](https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html)  

### CircularFingerprints

In [3]:
# import depechem and rdkit
import deepchem as dc
from rdkit import Chem

# create the featurizer object
# we will set the radius=2, size=100 as before
featurizer = dc.feat.CircularFingerprint(size=100, radius=2)

Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead


To test, we will apply the featurizer to ethane.

In [4]:
featurizer.featurize("CC")

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0.]])

We see that the output is an array and the code is less sophisticated than the pure RDKit implementation. We can now apply the featurizer to the dataset. This may take a while.

In [5]:
dataset["fp"] = dataset["smiles"].apply(featurizer.featurize)

In [6]:
# looking at the top 5 entries
dataset.head()

Unnamed: 0,smiles,gap,fp
27498,NC1=CC(=O)NC(=N)O1,0.2158,"[[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0,..."
117814,CCCC(C)(CC)C=O,0.223,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,..."
33206,COCCc1ccco1,0.2359,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0,..."
94952,CC1COC2C1NC2=O,0.2558,"[[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0,..."
132855,CC1=CNC(=N)C(F)=C1,0.1615,"[[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0,..."


### RDKitDescriptors

This gives a list of chemical descriptors like molecular weight, number of valence electrons, maximum and minimum partial charge, etc using RDKit. By default, the length of the list is 208.

The code below shows featurizing of ethane.

In [7]:
# create the featurizer
featurizer = dc.feat.RDKitDescriptors()

# apply it on ethane
featurizer.featurize("CC")

array([[ 2.        ,  2.        ,  2.        ,  2.        ,  0.37278556,
         3.        , 30.07      , 24.022     , 30.04695019, 14.        ,
         0.        , -0.06826238, -0.06826238,  0.06826238,  0.06826238,
         1.        ,  1.        ,  1.        , 13.011     , 11.011     ,
         0.93173762, -1.06826238,  1.1441    , -0.8559    ,  3.503     ,
         1.503     ,  1.        ,  1.        ,  0.        ,  2.        ,
         2.        ,  2.        ,  1.        ,  1.        ,  1.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  1.        ,  2.        ,  0.        ,
         0.        , 15.10419314,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        , 13.8474744 ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        , 13.8474744 ,  0.        ,  0. 