<a href="https://colab.research.google.com/github/quantaosun/covid_classification/blob/main/classification_chemprop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Please note that this notebook is intended to be run in Google Colab rather than as a Jupyter notebook on your local machine. Please click the "Open in Colab" button.

## Setup

In [None]:
#@title Install dependencies
!pip install chemprop
!pip install rdkit-pypi  # should be included in above after Chemprop v1.6 release

# Download test files from GitHub
!apt install subversion
!svn export https://github.com/chemprop/chemprop.git/trunk/tests/data

import chemprop
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.offsetbox import AnchoredText
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.decomposition import PCA

In [2]:
#@title ### **Import Google Drive**
from google.colab import drive

drive.flush_and_unmount()
drive.mount('/content/drive', force_remount=True)

Drive not mounted, so nothing to flush and unmount.
Mounted at /content/drive


## Copy example files into google drive

- Two covid related files namely, the ```train_preprocessed.csv``` and ```test_nolabel.csv``` will be downloaded and copy to your google drive


In [13]:
%cd /content/drive/MyDrive
!mkdir chemprop_covid_calssification
%cd chemprop_covid_calssification
!git clone https://github.com/quantaosun/covid_classification.git

/content/drive/MyDrive
mkdir: cannot create directory ‘chemprop_covid_calssification’: File exists
/content/drive/MyDrive/chemprop_covid_calssification
Cloning into 'covid_classification'...
remote: Enumerating objects: 24, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 24 (delta 9), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (24/24), 179.17 KiB | 3.14 MiB/s, done.
Resolving deltas: 100% (9/9), done.


## Please check there is already a new folder creaed in your drive before moving on to next step

- /content/drive/MyDrive/chemprop_covid_calssification/covid_classification

- In a real procedure, you want to replace this path with your input instead of the examples

## CSV format normalisation
- There is an extra column in example CSV file called ID, the code below is to drop it to make sure the first column is the SMILES
- Modify the code below if your CSV file contain some special columns, for this example you can click directly without any issue.

In [14]:
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/chemprop_covid_calssification/covid_classification/train_preprocessed.csv').drop('ID', axis=1)
df.to_csv('/content/drive/MyDrive/chemprop_covid_calssification/covid_classification/train_preprocessed_processed.csv', index=False)


## Training
- The default parameters are defined in offical chemprop site, like the default training epoch is 30 etc. You can modify if necessary.
- At the end of training, you would get something like
```
Model 0 test auc = 0.932412
Ensemble test auc = 0.932412
1-fold cross validation
	Seed 0 ==> test auc = 0.932412
Overall test auc = 0.932412 +/- 0.000000
Elapsed time = 0:09:25
```

In [15]:
#https://chemprop.readthedocs.io/en/latest/tutorial.html#within-a-python-script
import chemprop

arguments = [
    '--data_path', '/content/drive/MyDrive/chemprop_covid_calssification/covid_classification/train_preprocessed_processed.csv',
    '--dataset_type', 'classification',
    '--save_dir', '/content/drive/MyDrive/chemprop_covid_calssification/covid_classification'
]

args = chemprop.args.TrainArgs().parse_args(arguments)
mean_score, std_score = chemprop.train.cross_validate(args=args, train_func=chemprop.train.run_training)

Command line
python /usr/local/lib/python3.10/dist-packages/ipykernel_launcher.py -f /root/.local/share/jupyter/runtime/kernel-1cf99942-4a42-4231-971e-c474bfeb06f3.json
Args
{'activation': 'ReLU',
 'adding_bond_types': True,
 'adding_h': False,
 'aggregation': 'mean',
 'aggregation_norm': 100,
 'atom_constraints': [],
 'atom_descriptor_scaling': True,
 'atom_descriptors': None,
 'atom_descriptors_path': None,
 'atom_descriptors_size': 0,
 'atom_features_size': 0,
 'atom_messages': False,
 'atom_targets': [],
 'batch_size': 50,
 'bias': False,
 'bias_solvent': False,
 'bond_constraints': [],
 'bond_descriptor_scaling': True,
 'bond_descriptors': None,
 'bond_descriptors_path': None,
 'bond_descriptors_size': 0,
 'bond_features_size': 0,
 'bond_targets': [],
 'cache_cutoff': 10000,
 'checkpoint_dir': None,
 'checkpoint_frzn': None,
 'checkpoint_path': None,
 'checkpoint_paths': None,
 'class_balance': False,
 'config_path': None,
 'constraints_path': None,
 'crossval_index_dir': None,
 '

## Prediction

## CSV file format normalisation
- Similar idea to what have done to training files, here we need to make sure the fist column is SMILES


In [16]:
df = pd.read_csv('/content/drive/MyDrive/chemprop_covid_calssification/covid_classification/test_nolabel.csv').drop('ID', axis=1)
df.to_csv('/content/drive/MyDrive/chemprop_covid_calssification/covid_classification/test_nolabel_processed.csv', index=False)

## Predict for the new molecules
- A new CSV file called test_nolabel_predicted.csv will be created, with predicted calssification lables.

In [17]:
import chemprop

arguments = [
    '--test_path', '/content/drive/MyDrive/chemprop_covid_calssification/covid_classification/test_nolabel_processed.csv',
    '--preds_path', '/content/drive/MyDrive/chemprop_covid_calssification/covid_classification/test_nolabel_predicted.csv',
    '--checkpoint_dir', '/content/drive/MyDrive/chemprop_covid_calssification/covid_classification'
]

args = chemprop.args.PredictArgs().parse_args(arguments)
preds = chemprop.train.make_predictions(args=args)

Loading training args
Setting molecule featurization parameters to default.
Loading data


1059it [00:00, 120119.20it/s]
100%|██████████| 1059/1059 [00:00<00:00, 75594.27it/s]


Validating SMILES
Test size = 1,059


  0%|          | 0/1 [00:00<?, ?it/s]

Loading pretrained parameter "encoder.encoder.0.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.0.W_i.weight".
Loading pretrained parameter "encoder.encoder.0.W_h.weight".
Loading pretrained parameter "encoder.encoder.0.W_o.weight".
Loading pretrained parameter "encoder.encoder.0.W_o.bias".
Loading pretrained parameter "readout.1.weight".
Loading pretrained parameter "readout.1.bias".
Loading pretrained parameter "readout.4.weight".
Loading pretrained parameter "readout.4.bias".
Moving model to cuda



  0%|          | 0/22 [00:00<?, ?it/s][A
  5%|▍         | 1/22 [00:02<00:50,  2.39s/it][A
  9%|▉         | 2/22 [00:03<00:28,  1.41s/it][A
 27%|██▋       | 6/22 [00:03<00:05,  2.76it/s][A
 41%|████      | 9/22 [00:03<00:03,  3.69it/s][A
 45%|████▌     | 10/22 [00:03<00:02,  4.06it/s][A
 77%|███████▋  | 17/22 [00:04<00:00,  9.34it/s][A
100%|██████████| 1/1 [00:04<00:00,  4.74s/it]

Saving predictions to /content/drive/MyDrive/chemprop_covid_calssification/covid_classification/test_nolabel_predicted.csv
Elapsed time = 0:00:05





## Print part of predictions

- By default the label is the probability of being calssification of 1

- Modify the code if you want output 0 or 1 instead

In [18]:
df = pd.read_csv('/content/drive/MyDrive/chemprop_covid_calssification/covid_classification/test_nolabel_predicted.csv')
df

Unnamed: 0,SMILES,label
0,COc1nc2ncc(C(C(=O)NCCc3cccc(F)c3)N(C(=O)c3cocn...,0.035909
1,CC(C)(O)c1ccc(N(C(=O)c2cocn2)C(C(=O)NCCc2cccc(...,0.028230
2,N#CC(=C(/N)Sc1ccccc1N)/C(C#N)=C(\N)Sc1ccccc1N,0.016414
3,COCCNCCc1cccc(Nc2ncc3c(n2)-c2ccccc2[C@H](c2ccc...,0.005765
4,Cc1ccncc1NC(=O)Cc1ccccc1N,0.003111
...,...,...
1054,CC(C)(C)c1cnc(N(C(=O)c2cocn2)C(C(=O)NCCc2cccc(...,0.035341
1055,Cc1nc(C2CCN(C(=O)c3ccc(Br)cc3)CC2)n[nH]1,0.006545
1056,Nc1ncnc2c1ncn2C1CCCO1,0.002533
1057,O=C(/C=C/c1cn(CC(O)CN2CCOCC2)c2ccccc12)c1cccs1,0.006571


## Print out those molecules with a probability greater than 0.5, which will be classified as label **1**

In [19]:
# Read the CSV file
df = pd.read_csv('/content/drive/MyDrive/chemprop_covid_calssification/covid_classification/test_nolabel_predicted.csv')

# Filter rows where the 'label' column is greater than 0.5
filtered_df = df[df['label'] > 0.5]

# Print the filtered data
print(filtered_df)


                                                SMILES     label
70        COCCOC(C(=O)Nc1cncc2ccccc12)c1ccc(Cl)c(Cl)c1  0.754378
79   O=[N+]([O-])c1cnc(Sc2nnc(O)n2-c2ccc3c(c2)OCCO3)s1  0.944961
130                  COc1ccccc1OCCC(=O)Nc1cncc2ccccc12  0.591564
246            O=C(CCl)N1CCCc2cc(NCc3ccc(F)cc3Cl)ccc21  0.551699
279                     CCOC(=O)c1nc(S(C)(=O)=O)ncc1Cl  0.983466
290  CC(NC(=O)CN)(C(=O)Nc1cncc2ccccc12)c1ccc(Cl)c(C...  0.680544
411         N#CC(C(=O)Nc1cncc2ccccc12)c1ccc(Cl)c(Cl)c1  0.636940
517          COC(C(=O)Nc1cncc2ccccc12)c1ccc(Cl)c(Cl)c1  0.865813
522          CNC(C(=O)Nc1cncc2ccccc12)c1ccc(Cl)c(Cl)c1  0.826046
536                       O=c1n(Cl)c(=O)n(Cl)c(=O)n1Cl  0.511992
602                            S=C(S)OC1CC2CC1C1CCCC21  0.625702
703  O=c1[nH]nc(Sc2ncc([N+](=O)[O-])s2)n1-c1ccc2c(c...  0.965370
735  O=[N+]([O-])c1cnc(Sc2nnc(-c3ccco3)n2-c2ccc(OCc...  0.752426
805  [C][C]([C])OC(=O)C(C(=O)O[C]([C])[C])=C1S[C]=[...  0.743927
874  CC(C)(C)OC(=O)NC1=CC