# Setup


If you are running this generator locally(i.e. in a jupyter notebook in conda, just make sure you installed:
- RDKit
- DeepChem 2.5.0 & above
- Tensorflow 2.4.0 & above

Then, please skip the following part and continue from `Data Preparations`.

To increase efficiency, we recommend running this molecule generator in Colab.

Then, we'll first need to run the following lines of code, these will download conda with the deepchem environment in colab.

In [47]:
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3501  100  3501    0     0  33342      0 --:--:-- --:--:-- --:--:-- 33342


python version: 3.7.10
remove current miniconda
fetching installer from https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
done
installing miniconda to /root/miniconda
done
installing rdkit, openmm, pdbfixer
added conda-forge to channels
added omnia to channels
done
conda packages installation finished!


# conda environments:
#
base                  *  /root/miniconda



In [49]:
!pip install --pre deepchem
import deepchem
deepchem.__version__



'2.6.0.dev'

# Data Preparations

Now we are ready to import some useful functions/packages, along with our model.

## Import Data

In [3]:
import model##our model

In [4]:
from rdkit import Chem
from rdkit.Chem import AllChem

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [8]:
import deepchem as dc 

Then, we are ready to import our dataset for training. 

Here, for demonstration, we'll be using this dataset of in-vitro assay that detects inhibition of SARS-CoV 3CL protease via fluorescence.

The dataset is originally from [PubChem AID1706](https://pubchem.ncbi.nlm.nih.gov/bioassay/1706), previously handled by [JClinic AIcure](https://www.aicures.mit.edu/) team at MIT into this [binarized label form](https://github.com/yangkevin2/coronavirus_data/blob/master/data/AID1706_binarized_sars.csv).

In [10]:
df = pd.read_csv('AID1706_binarized_sars.csv')

In [11]:
df.head()

Unnamed: 0,smiles,activity
0,CC1=CC=C(O1)C(C(=O)NCC2=CC=CO2)N(C3=CC=C(C=C3)...,1
1,CC1=CC=C(C=C1)S(=O)(=O)N2CCN(CC2)S(=O)(=O)C3=C...,1
2,CC1=CC2=C(C=C1)NC(=O)C(=C2)CN(CCC3=CC=CC=C3)CC...,1
3,CC1=CC=C(C=C1)CN(C(C2=CC=CS2)C(=O)NCC3=CC=CO3)...,1
4,CCN1C2=NC(=O)N(C(=O)C2=NC(=N1)C3=CC=CC=C3)C,1


In [47]:
df.groupby('activity').count()

Unnamed: 0_level_0,smiles
activity,Unnamed: 1_level_1
0,290321
1,405


Observe the data above, it contains a 'smiles' column, which stands for the smiles representation of the molecules. There is also an 'activity' column, in which it is the label specifying whether that molecule is considered as hit for the protein.

Here, we only need those 405 molecules considered as hits, and we'll be extracting features from them to generate new molecules that may as well be hits.

In [12]:
true = df[df['activity']==1]

## Set Minimum Length for molecules

Since we'll be using graphic neural network, it might be more helpful and efficient if our graph data are of the same size, thus, we'll eliminate the molecules from the training set that are shorter(i.e. lacking enough atoms) than our desired minimum size.

In [13]:
num_atoms = 6 #here the minimum length of molecules is 6

In [14]:
input_df = true['smiles']
df_length = []
for _ in input_df:
    df_length.append(Chem.MolFromSmiles(_).GetNumAtoms() )

In [15]:
true['length'] = df_length #create a new column containing each molecule's length

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [16]:
true = true[true['length']>num_atoms] #Here we leave only the ones longer than 6
input_df = true['smiles']
input_df_smiles = input_df.apply(Chem.MolFromSmiles) #convert the smiles representations into rdkit molecules


Now, we are ready to apply the `featurizer` function to our molecules to convert them into graphs with nodes and edges for training.

In [20]:
#input_df = input_df.apply(Chem.MolFromSmiles) 
train_set = input_df_smiles.apply( lambda x: model.featurizer(x,max_length = num_atoms))

In [21]:
train_set

0      ([6, 6, 6, 6, 6, 8], [[0, 2, 0, 0, 0, 0], [2, ...
1      ([6, 6, 6, 6, 6, 6], [[0, 2, 0, 0, 0, 0], [2, ...
2      ([6, 6, 6, 6, 6, 6], [[0, 2, 0, 0, 0, 0], [2, ...
3      ([6, 6, 6, 6, 6, 6], [[0, 2, 0, 0, 0, 0], [2, ...
4      ([6, 6, 7, 6, 7, 6], [[0, 2, 0, 0, 0, 0], [2, ...
                             ...                        
400    ([6, 6, 8, 6, 6, 8], [[0, 2, 0, 0, 0, 2], [2, ...
401    ([6, 8, 6, 8, 6, 6], [[0, 2, 0, 0, 0, 0], [2, ...
402    ([6, 8, 6, 6, 6, 6], [[0, 2, 0, 0, 0, 0], [2, ...
403    ([6, 7, 6, 6, 6, 7], [[0, 2, 0, 0, 0, 0], [2, ...
404    ([6, 6, 8, 6, 7, 16], [[0, 2, 0, 0, 2, 0], [2,...
Name: smiles, Length: 405, dtype: object

In [22]:
nodes_train, edges_train = list(zip(*train_set) )

# Training

Now, we're finally ready for generating new molecules. We'll first import some necessay functions from tensorflow.

In [23]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

The network here we'll be using is Generative Adversarial Network, here's a great [introduction](https://machinelearningmastery.com/what-are-generative-adversarial-networks-gans/). 

In [None]:
![]()

In [175]:
no, ed = generator(np.random.randint(0,2, size =(1,100)))

In [None]:
discriminator(generator(np.random.randint(0,2, size =(1,100))))

In [204]:
cat = de_featurizer(abs(no.numpy()*10).astype(int).reshape(11), abs(ed.numpy()*10).astype(int).reshape(11,11))

In [207]:
cat

<rdkit.Chem.rdchem.RWMol at 0x7fde641b1c70>

In [205]:
Chem.MolToSmiles(cat)

'[Li]123~b45678~B9%10%11%12%13~[H]4%14%15%16~[h]5=194%17%18~[li]=6159c=46%19(~B#%14=%10=%1724=[H]#%11=7316#c%18#%19#%12=%15=5=4)[H]=9=8%16%13'

In [206]:
abs(no.numpy()*10).astype(int).reshape(11), abs(ed.numpy()*10).astype(int).reshape(11,11)

(array([5, 1, 1, 6, 1, 1, 3, 6, 3, 5, 5]),
 array([[1, 1, 2, 4, 2, 4, 0, 0, 0, 1, 3],
        [3, 3, 1, 3, 2, 0, 8, 6, 7, 2, 4],
        [1, 1, 0, 5, 0, 0, 3, 3, 1, 5, 3],
        [1, 0, 1, 2, 9, 4, 0, 4, 3, 0, 3],
        [0, 2, 2, 7, 2, 8, 0, 2, 3, 3, 0],
        [0, 1, 0, 3, 5, 3, 2, 2, 2, 3, 3],
        [3, 2, 4, 0, 0, 4, 3, 0, 0, 1, 2],
        [2, 5, 4, 0, 3, 2, 4, 0, 5, 0, 1],
        [0, 1, 5, 5, 0, 2, 1, 4, 3, 3, 0],
        [4, 4, 6, 3, 0, 6, 7, 2, 7, 1, 0],
        [4, 0, 4, 5, 9, 2, 4, 7, 3, 7, 0]]))