# Combinatorial library data

## Aim of this notebook

Extract information from the combinatorial library `json` file:

- Number of ligands
- Number of ligands that fulfill Lipinski's rule of five (Ro5)
- Number of ligands that fulfill the Ro5 criteria (i) molecular weight <= 500Da, (ii) number of hydrogen bond donors <= 5, (iii) number of hydrogen bond acceptors <= 10, and (iv) logP value <= 5 
- Ligands with exact matches in ChEMBL
- Ligands with exact matches in original KLIFS ligands
- Ligands with substructure matches in original KLIFS ligands

Since the `json` file holds mulitple millions of ligands, we do this data processing once here at the beginning and save the results to separate files which will be used for analysis/visualization in the following notebooks.

## Table of contents

1. Combinatorial library data
2. Get properties from `json` file
3. Show results overview
4. Save properties to `csv` file

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from datetime import datetime
from pathlib import Path

import ijson
import matplotlib.pyplot as plt
import pandas as pd
from rdkit import Chem

from utils import get_original_ligands, get_ro5_from_mol, read_fragment_library

In [3]:
PATH_FRAGMENT_LIBRARY = Path('../data/fragment_library/')
PATH_COMBINATORIAL_LIBRARY = Path('../data/combinatorial_library/combinatorial_library.json')

## 1. Combinatorial library data

The combinatorial library comes as large JSON file. In order to access/filter ligands efficiently (time and memory), we use the `ijson` library:

> Ijson is an iterative JSON parser with standard Python iterator interfaces.

https://pypi.org/project/ijson/

Here, we take a look at the first entry in the combinatorial library:

In [4]:
# open file
f = open(PATH_COMBINATORIAL_LIBRARY, 'rb')

# yield native Python objects out of a JSON stream located under a prefix
objects = ijson.items(f, 'item')

next(objects)

{'bond_ids': [['GA_14', 'AP_10'], ['B1_8', 'GA_13'], ['AP_9', 'SE_6']],
 'fragment_ids': ['GA_77', 'AP_12', 'B1_13', 'SE_138'],
 'hba': 1,
 'hbd': 1,
 'mwt': 1,
 'logp': 1,
 'n_atoms': 36,
 'chembl_exact': 0,
 'original_exact': 0,
 'original_substructure': 0,
 'inchi': 'InChI=1S/C27H21FN6O2/c1-16-21(27(36)31-32(16)2)12-24(35)33-11-10-20-23(33)9-8-19(25(20)28)22-14-30-34-15-18(13-29-26(22)34)17-6-4-3-5-7-17/h3-11,13-15H,12H2,1-2H3,(H,31,36)'}

The combinatorial library contains the following information:
- `bond_ids` and `fragment_ids`: 
  - Bond IDs (`bond_ids`), e.g. `[["FP_6", "AP_10"], ["AP_11", "SE_13"]]`: Atom (`<subpocket>_<atom ID>`) pairs per fragment bond
  - Fragment IDs (`fragment_ids`), e.g. `["SE_2", "AP_0", "FP_2"]` (`<subpocket>_<fragment index in subpocket pool>`)
  - With this information it is possible to construct the recombined ligand from the fragment library
- `hba`, `hbd`, `mwt`, and `logp`: Ligand fulfills Lipinski's rule of five criteria? (`0` or `1`)
  - Number of hydrogen bond acceptors (`hba`) <= 10
  - Number of hydrogen bond donors (`hbd`) <= 5
  - Molecular weight (`mwt`) <= 500
  - LogP value (`logp`) <= 5 
- `n_atoms`: Number of heavy atoms
- `chembl_exact`: Ligand has exact match in ChEMBL? List of ChEMBL molecule indicies (`[]` if no match)
- `chembl_most_similar`: Most similar molecule in ChEMBL, e.g. `[500021, 0.34]` (`[<ChEMBL ID without "CHEMBL" prefix>, <Tanimoto similarity>]` 
- `original_exact`: Ligand has exact match in original ligands? List of original ligand indicies (`[]` if no match)
- `original_substructure`: Ligand is substructure of original ligands? List of original ligand indicies (`[]` if no match)
- `inchi`: InChI 

## 2. Get properties from `json` file

In [6]:
def get_properties(path_combinatorial_library):
    
    print(datetime.now())

    # get object generator from json
    f = open(path_combinatorial_library, 'rb')
    objects = ijson.items(f, 'item')

    # filter objects
    properties = {
        'mw': 0, 
        'logp': 0, 
        'hbd': 0, 
        'hba': 0, 
        'lipinski': 0,
        'n_ligands': 0,
        'n_atoms': [],
        'original_exact': [], 
        'original_substructure': [], 
        'chembl_exact': []
    }
    
    for o in objects:
        
        properties['n_ligands'] += 1
        properties['n_atoms'].append(o['n_atoms'])
        
        if o['mwt'] == 1:
            properties['mw'] += 1
        if o['logp'] == 1:
            properties['logp'] += 1
        if o['hbd'] == 1:
            properties['hbd'] += 1
        if o['hba'] == 1:
            properties['hba'] += 1
        if o['hba']+o['hbd']+o['mwt']+o['logp'] >= 3:
            properties['lipinski'] += 1
            
        if o['original_exact'] == 1:
            properties['original_exact'].append(o)
        if o['original_substructure'] == 1:
            properties['original_substructure'].append(o)
        if o['chembl_exact'] == 1:
            properties['chembl_exact'].append(o)

    properties['original_exact'] = pd.DataFrame(properties['original_exact'])
    properties['original_substructure'] = pd.DataFrame(properties['original_substructure'])
    properties['chembl_exact'] = pd.DataFrame(properties['chembl_exact'])
        
    print(datetime.now())
    
    return properties

In [7]:
properties = get_properties(PATH_COMBINATORIAL_LIBRARY)

2020-06-23 09:08:52.059705
2020-06-23 09:27:28.450966


## 3. Show results overview

In [8]:
properties['n_ligands']

6752232

In [9]:
len(properties['n_atoms'])

6752232

In [10]:
properties['lipinski']

4285753

In [11]:
properties['chembl_exact'].shape[0]

299

In [12]:
properties['original_exact'].shape[0]

35

In [13]:
properties['original_substructure'].shape[0]

325

## 4. Save properties to `csv`/`json` files

In [14]:
properties['original_exact'].to_json(
    '../data/combinatorial_library/original_exact.json'
)
del properties['original_exact']

In [15]:
properties['original_substructure'].to_json(
    '../data/combinatorial_library/original_substructure.json'
)
del properties['original_substructure']

In [16]:
properties['chembl_exact'].to_json(
    '../data/combinatorial_library/chembl_exact.json'
)
del properties['chembl_exact']

In [22]:
# Takes a moment (20s)
pd.DataFrame(properties['n_atoms']).to_csv(
    '../data/combinatorial_library/n_atoms.csv',
    index=None,
    header=None
)
del properties['n_atoms']

In [25]:
pd.Series(properties).to_csv(
    '../data/combinatorial_library/properties.csv',
    header=None
)