# Combinatorial library data

## Aim of this notebook

Extract information from the combinatorial library `json` file:

- Number of ligands
- Number of ligands that fulfill Lipinski's rule of five (Ro5)
- Number of ligands that fulfill the Ro5 criteria (i) molecular weight <= 500Da, (ii) number of hydrogen bond donors <= 5, (iii) number of hydrogen bond acceptors <= 10, and (iv) logP value <= 5 
- Ligands with exact matches in ChEMBL
- Ligands with exact matches in original KLIFS ligands
- Ligands with substructure matches in original KLIFS ligands

Since the `json` file holds mulitple millions of ligands, we do this data processing once here at the beginning and save the results to separate files which will be used for analysis/visualization in the following notebooks.

## Table of contents

1. Combinatorial library data
2. Get properties from `json` file
3. Show results overview
4. Save properties to `csv` file

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from datetime import datetime
from pathlib import Path

import ijson
import matplotlib.pyplot as plt
import pandas as pd
from rdkit import Chem

from utils import get_original_ligands, get_ro5_from_mol, read_fragment_library

In [3]:
PATH_FRAGMENT_LIBRARY = Path('../data/fragment_library/')
PATH_COMBINATORIAL_LIBRARY = Path('../data/combinatorial_library/combinatorial_library_deduplicated.json')

## 1. Combinatorial library data

The combinatorial library comes as large JSON file. In order to access/filter ligands efficiently (time and memory), we use the `ijson` library:

> Ijson is an iterative JSON parser with standard Python iterator interfaces.

https://pypi.org/project/ijson/

Here, we take a look at the first entry in the combinatorial library:

The combinatorial library contains the following information:
- `bond_ids` and `fragment_ids`: 
  - Bond IDs (`bond_ids`), e.g. `[["FP_6", "AP_10"], ["AP_11", "SE_13"]]`: Atom (`<subpocket>_<atom ID>`) pairs per fragment bond
  - Fragment IDs (`fragment_ids`), e.g. `["SE_2", "AP_0", "FP_2"]` (`<subpocket>_<fragment index in subpocket pool>`)
  - With this information it is possible to construct the recombined ligand from the fragment library
- `hba`, `hbd`, `mwt`, and `logp`: Ligand fulfills Lipinski's rule of five criteria? (`0` or `1`)
  - Number of hydrogen bond acceptors (`hba`) <= 10
  - Number of hydrogen bond donors (`hbd`) <= 5
  - Molecular weight (`mwt`) <= 500
  - LogP value (`logp`) <= 5 
- `n_atoms`: Number of heavy atoms
- `chembl_exact`: Ligand has exact match in ChEMBL? List of ChEMBL molecule indicies (`[]` if no match)
- `chembl_most_similar`: Most similar molecule in ChEMBL, e.g. `[500021, 0.34]` (`[<ChEMBL ID without "CHEMBL" prefix>, <Tanimoto similarity>]` 
- `original_exact`: Ligand has exact match in original ligands? List of original ligand indicies (`[]` if no match)
- `original_substructure`: Ligand is substructure of original ligands? List of original ligand indicies (`[]` if no match)
- `inchi`: InChI 

## 2. Get properties from `json` file

In [4]:
def get_properties(path_combinatorial_library):
    
    print(datetime.now())

    # get object generator from json
    f = open(path_combinatorial_library, 'rb')
    objects = ijson.items(f, 'item')

    # filter objects
    properties = {
        'mw': 0, 
        'logp': 0, 
        'hbd': 0, 
        'hba': 0, 
        'lipinski': 0,
        'n_ligands': 0,
        'n_atoms': [],
        'subpockets': {},
        'original_exact': [], 
        'original_substructure': [], 
        'chembl_exact': []
    }
    
    for o in objects:
        
        # Get number of ligands and number of atoms per ligand
        properties['n_ligands'] += 1
        properties['n_atoms'].append(o['n_atoms'])
        
        # Get number of subpocket combinations
        subpocket_key = "-".join(sorted([subpocket[:2] for subpocket in o['fragment_ids']]))
        if subpocket_key in properties['subpockets'].keys():
            properties['subpockets'][subpocket_key] +=1
        else:
            properties['subpockets'][subpocket_key] = 1
        
        # Get Lipinski's rule of five + criteria
        if o['mwt'] == 1:
            properties['mw'] += 1
        if o['logp'] == 1:
            properties['logp'] += 1
        if o['hbd'] == 1:
            properties['hbd'] += 1
        if o['hba'] == 1:
            properties['hba'] += 1
        if o['hba']+o['hbd']+o['mwt']+o['logp'] >= 3:
            properties['lipinski'] += 1
            
        # Get KLIFS and ChEMBL matches
        if o['original_exact'] == 1:
            properties['original_exact'].append(o)
        if o['original_substructure'] == 1:
            properties['original_substructure'].append(o)
        if o['chembl_exact'] == 1:
            properties['chembl_exact'].append(o)

    #properties['subpockets'] = pd.DataFrame(properties['subpockets'])
    properties['original_exact'] = pd.DataFrame(properties['original_exact'])
    properties['original_substructure'] = pd.DataFrame(properties['original_substructure'])
    properties['chembl_exact'] = pd.DataFrame(properties['chembl_exact'])
        
    print(datetime.now())
    
    return properties

In [5]:
# Takes up to 20 minutes
properties = get_properties(PATH_COMBINATORIAL_LIBRARY)

2020-07-01 21:44:29.912817
2020-07-01 22:02:36.904108


## 3. Show results overview

In [6]:
properties['n_ligands']

6720637

In [7]:
len(properties['n_atoms'])

6720637

In [8]:
properties['lipinski']

4260776

In [9]:
properties['chembl_exact'].shape[0]

298

In [10]:
properties['original_exact'].shape[0]

35

In [11]:
properties['original_substructure'].shape[0]

324

In [12]:
properties['subpockets']

{'AP-B1-GA-SE': 255096,
 'AP-FP-GA-SE': 5016284,
 'AP-B2-GA-SE': 359534,
 'AP-B2-FP-GA': 178027,
 'AP-B1-FP-GA': 209843,
 'AP-B1-B2-GA': 3498,
 'AP-GA-SE': 102733,
 'AP-FP-SE': 512671,
 'AP-FP-GA': 71885,
 'AP-B2-GA': 1812,
 'AP-B1-GA': 1279,
 'AP-GA': 682,
 'AP-FP': 5924,
 'AP-SE': 1369}

## 4. Save properties to `csv`/`json` files

In [13]:
properties['original_exact'].to_json(
    '../data/combinatorial_library/original_exact.json'
)
del properties['original_exact']

In [14]:
properties['original_substructure'].to_json(
    '../data/combinatorial_library/original_substructure.json'
)
del properties['original_substructure']

In [15]:
properties['chembl_exact'].to_json(
    '../data/combinatorial_library/chembl_exact.json'
)
del properties['chembl_exact']

In [16]:
# Takes a moment (20s)
pd.DataFrame(properties['n_atoms']).to_csv(
    '../data/combinatorial_library/n_atoms.csv',
    index=None,
    header=None
)
del properties['n_atoms']

In [17]:
properties['subpockets']

{'AP-B1-GA-SE': 255096,
 'AP-FP-GA-SE': 5016284,
 'AP-B2-GA-SE': 359534,
 'AP-B2-FP-GA': 178027,
 'AP-B1-FP-GA': 209843,
 'AP-B1-B2-GA': 3498,
 'AP-GA-SE': 102733,
 'AP-FP-SE': 512671,
 'AP-FP-GA': 71885,
 'AP-B2-GA': 1812,
 'AP-B1-GA': 1279,
 'AP-GA': 682,
 'AP-FP': 5924,
 'AP-SE': 1369}

In [18]:
subpockets = pd.DataFrame.from_dict(properties['subpockets'], orient="index")
subpockets = subpockets.rename(columns={0: 'count'})
subpockets.to_csv(
    '../data/combinatorial_library/subpockets.csv'
)
del properties['subpockets']

In [19]:
pd.Series(properties).to_csv(
    '../data/combinatorial_library/properties.csv',
    header=None
)