# Quick Start

To get started with ChemicalDice you can install the package by following the installation guide. You can use your own data containing smiles and their labels in a csv file. For example we can use free solvation data from moleculenet.

In [2]:
import pandas as pd
df = pd.read_csv("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/SAMPL.csv")
df

Unnamed: 0,iupac,smiles,expt,calc
0,"4-methoxy-N,N-dimethyl-benzamide",CN(C)C(=O)c1ccc(cc1)OC,-11.01,-9.625
1,methanesulfonyl chloride,CS(=O)(=O)Cl,-4.87,-6.219
2,3-methylbut-1-ene,CC(C)C=C,1.83,2.452
3,2-ethylpyrazine,CCc1cnccn1,-5.45,-5.809
4,heptan-1-ol,CCCCCCCO,-4.21,-2.917
...,...,...,...,...
637,methyl octanoate,CCCCCCCC(=O)OC,-2.04,-3.035
638,pyrrolidine,C1CCNC1,-5.48,-4.278
639,4-hydroxybenzaldehyde,c1cc(ccc1C=O)O,-8.83,-10.050
640,1-chloroheptane,CCCCCCCCl,0.29,1.467


Input file must contain SMILES column to generate descriptors and labels column should contain value of the molecular property.

In [3]:
df = df[['smiles', 'expt']]
df.rename(columns = {'smiles':'SMILES', 'expt':'labels'}, inplace =True)
df.to_csv("freesolv.csv", index=False)
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns = {'smiles':'SMILES', 'expt':'labels'}, inplace =True)


Unnamed: 0,SMILES,labels
0,CN(C)C(=O)c1ccc(cc1)OC,-11.01
1,CS(=O)(=O)Cl,-4.87
2,CC(C)C=C,1.83
3,CCc1cnccn1,-5.45
4,CCCCCCCO,-4.21
...,...,...
637,CCCCCCCC(=O)OC,-2.04
638,C1CCNC1,-5.48
639,c1cc(ccc1C=O)O,-8.83
640,CCCCCCCCl,0.29


We saved the "freesolv.csv" file thet we will use in next step for generation of embeddings and descriptors.

In [4]:
from ChemicalDice import smiles_preprocess, bioactivity, chemberta, Grover, ImageMol, chemical, quantum

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/631 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/13.7M [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at DeepChem/ChemBERTa-77M-MLM and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/6.96k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/420 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.26k [00:00<?, ?B/s]



## Quantum descriptors

To calculate quantum descriptors first we need to generate 3D structure of molecule. This will save mol2 file in a directory temp_data.

In [5]:
smiles_preprocess.create_mol2_files(input_file = "freesolv.csv")

  self.pid = os.fork()
100%|██████████| 642/642 [07:31<00:00,  1.42it/s]


For quantum descriptors calculation we need MOPAC(Molecular Orbital PACkage). The function quantum.get_mopac_prerequisites() will download the mopac executable.

In [6]:
quantum.get_mopac_prerequisites()

Mopac is downloaded
Morse is compiled


Create a directory where we can store our descriptors files.

In [7]:
import os
os.mkdir("data")

Now we set for the calculation of quantum descriptors. The function `quantum.descriptor_calculator` takes two arguments  input file path and output file path.

In [8]:
quantum.descriptor_calculator(input_file = "freesolv.csv", output_file="data/mopac.csv")

## Mordred Descriptors

Mordred descriptors needs sdf files to calculate descriptors. The smiles_preprocess will create sdf file from mol2 files.

In [9]:
smiles_preprocess.create_sdf_files(input_file = "freesolv.csv")

making directory  temp_data/sdffiles


642it [00:00, 890.37it/s]


The function `chemical.descriptor_calculator` calculates modred descriptors.

In [10]:
chemical.descriptor_calculator(input_file = "freesolv.csv", output_file="data/mordred.csv")

642it [02:51,  3.75it/s]


## ChemBERTa embeddings

The large language model ChemBERTa embeddings needs canonical SMILES, the function `smiles_preprocess.add_canonical_smiles` adds canonical smiles to input file.

In [11]:
smiles_preprocess.add_canonical_smiles(input_file = "freesolv.csv")

The function `chemberta.smiles_to_embeddings` generates embeddings from the canonical SMILES.

In [12]:
chemberta.smiles_to_embeddings(input_file = "freesolv.csv", output_file = "data/Chemberta.csv")

100%|██████████| 642/642 [00:06<00:00, 102.77it/s]


## Signaturizer bioactivity Signatures

The function `bioactivity.calculate_descriptors` generates bioactivity signatures from canonical SMILES.

In [13]:
bioactivity.calculate_descriptors(input_file = "freesolv.csv", output_file = "data/Signaturizer.csv")

Parsing SMILES: 642it [00:00, 12875.81it/s]
Generating signatures:   0%|          | 0/6 [00:00<?, ?it/s]



Generating signatures:  17%|█▋        | 1/6 [00:02<00:13,  2.68s/it]



Generating signatures:  33%|███▎      | 2/6 [00:03<00:06,  1.70s/it]



Generating signatures:  50%|█████     | 3/6 [00:04<00:04,  1.49s/it]



Generating signatures:  67%|██████▋   | 4/6 [00:06<00:03,  1.50s/it]



Generating signatures:  83%|████████▎ | 5/6 [00:09<00:01,  1.94s/it]



Generating signatures: 100%|██████████| 6/6 [00:15<00:00,  2.67s/it]


Descictors saved to  data/Signaturizer.csv


## ImageMol embeddings

 The function `ImageMol.image_to_embeddings` function generates 2D images and then uses ImageMol model to gererate embeddings.

In [14]:
ImageMol.image_to_embeddings(input_file = "freesolv.csv", output_file_name="data/ImageMol.csv")

making directory  temp_data/images/
ImageMol model is downloaded


  self.pid = os.fork()


## Grover embeddings

The function `Grover.get_embeddings` generates graph embeddings using canonical smiles.

In [15]:
Grover.get_embeddings(input_file = "freesolv.csv",  output_file_name="data/Grover.csv")

Downloading temp_data/grover_large.tar.gz: 100%|██████████| 381M/381M [01:09<00:00, 5.74MB/s]


Grover model is downloaded
Grover model is extracted


100%|██████████| 642/642 [00:45<00:00, 14.05it/s]
Total size = 642
Generating...


Loading data


Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_q.act_func.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_q.W_h.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_k.act_func.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_k.W_h.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_v.act_func.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_v.W_h.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.1.mpn_q.act_func.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.1.mpn_q.W_h.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.1.mpn_k.act_func.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.1.mpn_k.W_h.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.1.mpn_v.act_func.weight".
Loading pretr

## Data loading

To evaluate different fusion techniques we need to load data to ChemicalDice fusionData class.  

In [18]:
from ChemicalDice.fusionData import fusionData

data_paths = {
    "Chemberta":"data/Chemberta.csv",
    "Grover":"data/Grover.csv",
    "mopac":"data/mopac.csv",
    "mordred":"data/mordred.csv",
    "Signaturizer":"data/Signaturizer.csv",
    "ImageMol": "data/ImageMol.csv"
}

fusiondata = fusionData(data_paths = data_paths, label_file_path="freesolv.csv")

Successfully loaded, processed for 'Chemberta'.
Successfully loaded, processed for 'Grover'.
Successfully loaded, processed for 'mopac'.
Successfully loaded, processed for 'mordred'.
Successfully loaded, processed for 'Signaturizer'.
Successfully loaded, processed for 'ImageMol'.


## Data cleaning

Fusion of data we need all descriptors but some SMILES for which one or more descriptors are not present we can remove those samples using method `keep_common_samples`.

In [19]:
fusiondata.keep_common_samples()

To check for missing values in the dataset we can use method `ShowMissingValues`.

In [20]:
fusiondata.ShowMissingValues()

Dataframe name: Chemberta
Missing values: 0

Dataframe name: Grover
Missing values: 0

Dataframe name: mopac
Missing values: 0

Dataframe name: mordred
Missing values: 279306

Dataframe name: Signaturizer
Missing values: 0

Dataframe name: ImageMol
Missing values: 0



To remove empty features `remove_empty_features`, this method looks for percentage of NA value in features and also removes features that commonly found empty.

In [21]:
fusiondata.remove_empty_features()

In [22]:
fusiondata.ShowMissingValues()

Dataframe name: Chemberta
Missing values: 0

Dataframe name: Grover
Missing values: 0

Dataframe name: mopac
Missing values: 0

Dataframe name: mordred
Missing values: 108534

Dataframe name: Signaturizer
Missing values: 0

Dataframe name: ImageMol
Missing values: 0



For imputation of the missing values we can use method ImputeData. We can specify which method to use for imputation.

In [23]:
fusiondata.ImputeData(method="knn")

Imputation Done


In [24]:
fusiondata.ShowMissingValues()

Dataframe name: Chemberta
Missing values: 0

Dataframe name: Grover
Missing values: 0

Dataframe name: mopac
Missing values: 0

Dataframe name: mordred
Missing values: 0

Dataframe name: Signaturizer
Missing values: 0

Dataframe name: ImageMol
Missing values: 0



To normalize the data we can use method `scale_data` and specify scaling type standardize.

In [25]:
fusiondata.scale_data(scaling_type = 'standardize')

## Evaluation of data fusion

To evaluate the fusion technique we can use method `evaluate_fusion_models_scaffold_split` and specify which methods to use, number of component, AER embedding dimension 4096, regression task as True and scaffold split split type random.

In [None]:
fusiondata.evaluate_fusion_models_scaffold_split( methods= ["pca","cca","kpca","plsda"],
                                                  n_components = 10,
                                                  AER_dim = 4096,
                                                  regression = True,
                                                  split_type = "random")

In [31]:
fusiondata.scaffold_split_result

(                           Model Model type  R2 Score           MSE  \
 10  plsda_Support Vector Machine     linear  0.976584  3.077124e-01   
 4                    plsda_Ridge     linear  0.976958  3.028010e-01   
 3        plsda_Linear Regression     linear  0.976958  3.028010e-01   
 6               plsda_ElasticNet     linear  0.976958  3.028041e-01   
 5                    plsda_Lasso     linear  0.976942  3.030061e-01   
 8        plsda_Gradient Boosting     linear  0.989780  1.343062e-01   
 0                      plsda_MLP     linear  0.969295  4.035031e-01   
 9                 plsda_AdaBoost     linear  0.910396  1.177515e+00   
 7            plsda_Decision Tree     linear  1.000000  0.000000e+00   
 2             plsda_Kernel Ridge     linear  0.994067  7.797090e-02   
 11             plsda_K Neighbors     linear  1.000000  0.000000e+00   
 1         plsda_Gaussian Process     linear  1.000000  2.458359e-19   
 
             RMSE           MAE  
 10  5.547183e-01  4.025797e