# Combinatorial library data preparation

## Aim of this notebook

Extract information from the combinatorial library `json` file:

- Number of ligands
- Number of atoms for each recombined ligand
- Number of ligands that fulfill Lipinski's rule of five (Ro5)
- Number of ligands that fulfill the Ro5 criteria (i) molecular weight <= 500Da, (ii) number of hydrogen bond donors <= 5, (iii) number of hydrogen bond acceptors <= 10, and (iv) logP value <= 5 
- Number of ligands per subpocket combination
- Ligands with exact matches in original KLIFS ligands
- Ligands with substructure matches in original KLIFS ligands
- Ligands with exact matches in ChEMBL
- Most similar ligand in ChEMBL for each recombined ligand (molecule ChEMBL ID and similarity value)

Since the `json` file holds mulitple millions of ligands, we do this data processing once here at the beginning and save the results to separate files which will be used for analysis/visualization in the following notebooks.

## Table of contents

1. Combinatorial library data
2. Get properties from `json` file
3. Show results overview
4. Save properties to `csv` file

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from datetime import datetime
from pathlib import Path

import ijson
import matplotlib.pyplot as plt
import pandas as pd
from rdkit import Chem

from utils import get_original_ligands, get_ro5_from_mol, read_fragment_library

## 1. Combinatorial library data

### Download data

The combinatorial library comes as large `json` file (3.3 GB).

**Note 1**: Due to its size, this file is not included in this GitHub repository but can be downloaded alongside all output files from this notebook from https://osf.io/cy7tr/. Save files to `../data/combinatorial_library/` to run this notebook as-is.

**Note 2**: This notebook prepares data for the subsequent analysis notebooks. Since all output files from this notebook are included in the download, is it not necessary to rerun this notebook (takes about 20 minutes) to continue with the analysis notebooks. 

### Explain the data format

This `json` file contains a list of dictionaries, each describing a recombined ligand's properties:

`[ligand_1_dict, ligand_2_dict, ..., ligand_n_dict]`

This is an example dictionary:

```
{
    "bond_ids": [["AP_9", "SE_6"], ["GA_14", "AP_10"], ["B1_8", "GA_13"]], 
    "fragment_ids": ["B1_13", "SE_138", "GA_77", "AP_12"], 
    "hba": 1, 
    "hbd": 1, 
    "mwt": 1, 
    "logp": 1, 
    "n_atoms": 36, 
    "chembl_exact": 0, 
    "chembl_most_similar": ["CHEMBL4089123", 0.52], 
    "original_exact": 0, 
    "original_substructure": 0, 
    "inchi": "InChI=1S/C27H21FN6O2/c1-16-21(27(36)31-32(16)2)12-24(35)33-11-10-20-23(33)9-8-19(25(20)28)22-14-30-34-15-18(13-29-
26(22)34)17-6-4-3-5-7-17/h3-11,13-15H,12H2,1-2H3,(H,31,36)"
}
```

Each dictionary key contains the following value:

- `bond_ids` and `fragment_ids`: 
  - Bond IDs (`bond_ids`), e.g. `[["AP_9", "SE_6"], ["GA_14", "AP_10"], ["B1_8", "GA_13"]]`: Atom (`<subpocket>_<atom ID>`) pairs per fragment bond
  - Fragment IDs (`fragment_ids`), e.g. `["B1_13", "SE_138", "GA_77", "AP_12"]` (`<subpocket>_<fragment index in subpocket pool>`)
  - With this information it is possible to construct the recombined ligand from the fragment library
- `hba`, `hbd`, `mwt`, and `logp`: Ligand fulfills Lipinski's rule of five criteria? (`0` or `1`)
  - Number of hydrogen bond acceptors (`hba`) <= 10
  - Number of hydrogen bond donors (`hbd`) <= 5
  - Molecular weight (`mwt`) <= 500
  - LogP value (`logp`) <= 5 
- `n_atoms`: Number of heavy atoms
- `chembl_exact`: Ligand has exact match in ChEMBL? (`0` or `1`)
- `chembl_most_similar`: Most similar molecule in ChEMBL, e.g. `["CHEMBL4089123", 0.52]` (`[<molecule ChEMBL ID>, <Tanimoto similarity>]` 
- `original_exact`: Ligand has exact match in original ligands? (`0` or `1`)
- `original_substructure`: Ligand is substructure of original ligands? (`0` or `1`)
- `inchi`: InChI 

### Set file/folder paths

In [3]:
PATH_FRAGMENT_LIBRARY = Path('../data/fragment_library/')
PATH_COMBINATORIAL_LIBRARY = Path('../data/combinatorial_library/combinatorial_library_deduplicated.json')

In order to access/filter the ligands' properties efficiently (time and memory), we use the `ijson` library:

> Ijson is an iterative JSON parser with standard Python iterator interfaces.

https://pypi.org/project/ijson/

## 2. Get properties from `json` file

In [4]:
def get_properties(path_combinatorial_library):
    """
    Get a set of properties from the combinatorial library json file:
    
    - Number of ligands
    - Number of atoms for each recombined ligand
    - Number of ligands that fulfill Lipinski's rule of five (Ro5)
    - Number of ligands that fulfill the Ro5 criteria (i) molecular weight <= 500Da, (ii) number of hydrogen bond donors <= 5, (iii) number of hydrogen bond acceptors <= 10, and (iv) logP value <= 5 
    - Number of ligands per subpocket combination
    - Ligands with exact matches in original KLIFS ligands
    - Ligands with substructure matches in original KLIFS ligands
    - Ligands with exact matches in ChEMBL
    - Most similar ligand in ChEMBL for each recombined ligand (molecule ChEMBL ID and similarity value)
    
    Parameters
    ----------
    path_combinatorial_library : pathlib.Path
        Path to combinatorial library json file.
    
    Returns
    -------
    dict
        Combinatorial library properties.
    """
    
    print(datetime.now())

    # Get object generator from json
    f = open(path_combinatorial_library, 'rb')
    objects = ijson.items(f, 'item')

    # Filter objects
    properties = {
        'n_ligands': 0,
        'n_atoms': [],
        'lipinski': 0,
        'mw': 0, 
        'logp': 0, 
        'hbd': 0, 
        'hba': 0, 
        'subpockets': {},
        'original_exact': [], 
        'original_substructure': [], 
        'chembl_exact': [],
        'chembl_most_similar': []
    }
    
    for o in objects:
        
        # Get number of ligands and number of atoms per ligand
        properties['n_ligands'] += 1
        properties['n_atoms'].append(o['n_atoms'])
        properties['chembl_most_similar'].append([o['fragment_ids'], o['bond_ids'], o['chembl_most_similar'][0], o['chembl_most_similar'][1]])
        
        # Get number of subpocket combinations
        subpocket_key = "-".join(sorted([subpocket[:2] for subpocket in o['fragment_ids']]))
        if subpocket_key in properties['subpockets'].keys():
            properties['subpockets'][subpocket_key] +=1
        else:
            properties['subpockets'][subpocket_key] = 1
        
        # Get Lipinski's rule of five + criteria
        if o['mwt'] == 1:
            properties['mw'] += 1
        if o['logp'] == 1:
            properties['logp'] += 1
        if o['hbd'] == 1:
            properties['hbd'] += 1
        if o['hba'] == 1:
            properties['hba'] += 1
        if o['hba']+o['hbd']+o['mwt']+o['logp'] >= 3:
            properties['lipinski'] += 1
            
        # Get KLIFS and ChEMBL matches
        if o['original_exact'] == 1:
            properties['original_exact'].append(o)
        if o['original_substructure'] == 1:
            properties['original_substructure'].append(o)
        if o['chembl_exact'] == 1:
            properties['chembl_exact'].append(o)

    properties['original_exact'] = pd.DataFrame(properties['original_exact'])
    properties['original_substructure'] = pd.DataFrame(properties['original_substructure'])
    properties['chembl_exact'] = pd.DataFrame(properties['chembl_exact'])
    properties['chembl_most_similar'] = pd.DataFrame(
        properties['chembl_most_similar'], 
        columns=['fragment_ids', 'bond_ids', 'chembl_id', 'similarity']
    )
        
    print(datetime.now())
    
    return properties

In [5]:
# Takes up to 20 minutes
properties = get_properties(PATH_COMBINATORIAL_LIBRARY)

2020-07-16 22:00:08.906837
2020-07-16 22:24:57.114274


## 3. Show results overview

In [6]:
properties['n_ligands']

6720637

In [7]:
len(properties['n_atoms'])

6720637

In [8]:
properties['lipinski']

4260776

In [9]:
properties['chembl_exact'].shape[0]

298

In [10]:
properties['chembl_exact'].head()

Unnamed: 0,bond_ids,fragment_ids,hba,hbd,mwt,logp,n_atoms,chembl_exact,chembl_most_similar,original_exact,original_substructure,inchi
0,"[[GA_10, B1_7], [GA_9, AP_11], [AP_12, SE_13]]","[B1_19, GA_35, AP_6, SE_1]",1,1,0,0,40,1,"[CHEMBL2347516, 1.0]",0,0,InChI=1S/C32H28ClFN4O2/c33-29-18-27(9-11-31(29...
1,"[[SE_8, AP_8], [FP_6, AP_9]]","[AP_99, SE_16, FP_25]",1,1,1,1,22,1,"[CHEMBL265923, 1.0]",0,0,InChI=1S/C16H13N3O2S/c1-21-13-6-4-12(5-7-13)18...
2,"[[FP_6, AP_9], [AP_8, SE_6]]","[AP_99, FP_103, SE_138]",1,1,1,1,20,1,"[CHEMBL1446166, 0.9400000000000001]",0,0,InChI=1S/C16H12N2OS/c19-15(17-13-9-5-2-6-10-13...
3,"[[AP_8, SE_7], [FP_6, AP_9]]","[AP_99, FP_103, SE_38]",1,1,1,1,21,1,"[CHEMBL1446166, 1.0]",0,0,InChI=1S/C16H11ClN2OS/c17-12-6-8-13(9-7-12)18-...
4,"[[GA_10, B1_7], [GA_9, AP_11], [AP_12, SE_14]]","[GA_35, SE_2, AP_6, B1_19]",1,1,0,0,41,1,"[CHEMBL2347518, 1.0]",0,0,InChI=1S/C32H26ClFN4O3/c33-28-18-26(9-11-30(28...


In [11]:
properties['chembl_most_similar'].shape[0]

6720637

In [12]:
properties['chembl_most_similar'].head()

Unnamed: 0,fragment_ids,bond_ids,chembl_id,similarity
0,"[B1_13, SE_138, GA_77, AP_12]","[[AP_9, SE_6], [GA_14, AP_10], [B1_8, GA_13]]",CHEMBL4089123,0.52
1,"[FP_3, SE_2, GA_41, AP_14]","[[FP_7, GA_11], [AP_7, FP_6], [SE_14, AP_8]]",CHEMBL3641905,0.61
2,"[GA_9, AP_11, B1_1, SE_100]","[[GA_9, B1_11], [GA_10, AP_12], [AP_11, SE_13]]",CHEMBL3355540,0.63
3,"[FP_3, AP_21, SE_116, GA_16]","[[FP_7, GA_8], [SE_11, AP_14], [FP_6, AP_13]]",CHEMBL149168,0.5700000000000001
4,"[GA_49, SE_138, FP_130, AP_14]","[[FP_11, GA_13], [AP_7, FP_12], [AP_8, SE_6]]",CHEMBL3759442,0.65


In [13]:
properties['original_exact'].shape[0]

35

In [14]:
properties['original_exact'].head()

Unnamed: 0,bond_ids,fragment_ids,hba,hbd,mwt,logp,n_atoms,chembl_exact,chembl_most_similar,original_exact,original_substructure,inchi
0,"[[FP_10, AP_8], [SE_10, AP_9]]","[AP_95, SE_22, FP_14]",1,1,1,1,28,1,"[CHEMBL2203552, 0.71]",1,1,InChI=1S/C21H17N5O2/c1-28-18-11-24-20(13-8-12-...
1,"[[GA_10, B1_7], [GA_9, AP_12]]","[AP_134, GA_35, B1_19]",1,1,1,1,28,0,"[CHEMBL205652, 0.6000000000000001]",1,1,InChI=1S/C22H17ClFN3O/c1-14-5-7-20-18(9-14)22(...
2,"[[GA_10, AP_10]]","[AP_117, GA_37]",1,1,1,1,20,1,"[CHEMBL1354522, 0.6900000000000001]",1,1,InChI=1S/C13H10ClFN4O/c14-9-3-1-4-10(15)8(9)7-...
3,"[[SE_13, AP_8], [AP_7, FP_8]]","[AP_24, FP_180, SE_1]",1,1,1,1,28,1,"[CHEMBL3971067, 1.0]",1,1,InChI=1S/C21H26N6O/c1-15(2)27-14-20(24-25-27)1...
4,"[[SE_9, AP_10], [GA_6, AP_9]]","[GA_83, AP_69, SE_82]",1,1,1,1,24,1,"[CHEMBL3425867, 1.0]",1,1,InChI=1S/C17H18N4O2S/c1-22-15-9-11(2-3-13(15)1...


In [15]:
properties['original_substructure'].shape[0]

324

In [16]:
properties['original_substructure'].head()

Unnamed: 0,bond_ids,fragment_ids,hba,hbd,mwt,logp,n_atoms,chembl_exact,chembl_most_similar,original_exact,original_substructure,inchi
0,"[[FP_7, GA_4], [SE_10, AP_8], [AP_7, FP_6]]","[FP_3, GA_40, AP_14, SE_18]",1,1,1,1,26,0,"[CHEMBL1922009, 0.72]",0,1,InChI=1S/C18H23N5O3/c1-25-14-6-5-13(10-15(14)2...
1,"[[AP_8, SE_7], [AP_7, FP_8], [FP_7, GA_6]]","[FP_128, GA_45, AP_14, SE_27]",1,1,1,1,27,0,"[CHEMBL3356000, 0.77]",0,1,InChI=1S/C21H23N5O/c22-15-8-10-16(11-9-15)25-2...
2,"[[SE_10, FP_14], [AP_7, GA_3], [AP_6, SE_9]]","[SE_73, GA_14, AP_18, FP_83]",1,1,1,1,32,0,"[CHEMBL1222710, 0.71]",0,1,InChI=1S/C22H23FN8O/c1-13(2)15-11-19(28-27-15)...
3,"[[SE_12, AP_8], [AP_7, FP_6], [FP_7, GA_4]]","[FP_3, SE_15, GA_31, AP_14]",1,1,1,1,29,0,"[CHEMBL1922009, 0.71]",0,1,InChI=1S/C20H27N5O4/c1-21-19(26)13-6-5-9-25(12...
4,"[[FP_7, GA_4], [AP_7, FP_6], [B1_6, GA_5]]","[FP_3, GA_31, B1_23, AP_14]",1,1,1,1,23,0,"[CHEMBL370861, 0.6900000000000001]",0,1,InChI=1S/C17H21N5O/c18-17-19-9-8-15(21-17)22-1...


In [17]:
properties['subpockets']

{'AP-B1-GA-SE': 255096,
 'AP-FP-GA-SE': 5016284,
 'AP-B2-GA-SE': 359534,
 'AP-B2-FP-GA': 178027,
 'AP-B1-FP-GA': 209843,
 'AP-B1-B2-GA': 3498,
 'AP-GA-SE': 102733,
 'AP-FP-SE': 512671,
 'AP-FP-GA': 71885,
 'AP-B2-GA': 1812,
 'AP-B1-GA': 1279,
 'AP-GA': 682,
 'AP-FP': 5924,
 'AP-SE': 1369}

## 4. Save properties to `csv`/`json` files

In [18]:
properties['original_exact'].to_json(
    '../data/combinatorial_library/original_exact.json'
)

In [19]:
properties['original_substructure'].to_json(
    '../data/combinatorial_library/original_substructure.json'
)

In [20]:
properties['chembl_exact'].to_json(
    '../data/combinatorial_library/chembl_exact.json'
)

In [21]:
# Takes a moment (20s)
properties['chembl_most_similar'].to_json(
    '../data/combinatorial_library/chembl_most_similar.json'
)

In [22]:
# Takes a moment (20s)
pd.DataFrame(properties['n_atoms']).to_csv(
    '../data/combinatorial_library/n_atoms.csv',
    index=None,
    header=None
)

In [23]:
subpockets = pd.DataFrame.from_dict(properties['subpockets'], orient="index")
subpockets = subpockets.rename(columns={0: 'count'})
subpockets.to_csv(
    '../data/combinatorial_library/subpockets.csv'
)

In [24]:
pd.Series(
    {key: properties[key] for key in ['mw', 'logp', 'hbd', 'hba', 'lipinski', 'n_ligands']}
).to_csv(
    '../data/combinatorial_library/properties.csv',
    header=None
)