# Filter-based likelihood

2019.12.24
Stephen Wu

This sample code demonstrates how to construct likelihood used for filtering molecules based on fragments or other specific physical properties. The created likelihood can be combined with other likelihood through the BaseLogLikelihoodSet class and be used in the iQSPR inverse design loop.

In [1]:
# import basic libraries

import pandas as pd
import numpy as np

In [2]:
# examples of molecules used for testing

smis = ['CC(=O)OC(CC(=O)[O-])C[N+](C)(C)C',
'CC(=O)OC(CC(=O)O)C[N+](C)(C)C',
'C1=CC(C(C(=C1)C(=O)O)O)O',
'CC(CN)O',
'C(C(=O)COP(=O)(O)O)N',
'C1=CC(=C(C=C1[N+](=O)[O-])[N+](=O)[O-])Cl',
'CCN1C=NC2=C1N=CN=C2N',
'CCC(C)(C(C(=O)O)O)O',
'C1(C(C(C(C(C1O)O)OP(=O)(O)O)O)O)O',
'C1C(NC2=C(N1)NC(=NC2=O)N)CN(C=O)C3=CC=C(C=C3)C(=O)NC(CCC(=O)O)C(=O)O',
'C(CCl)Cl',
'C1=C(C=C(C(=C1O)O)O)O',
'C1=CC(=C(C=C1Cl)Cl)Cl',
'CCCCCC(=O)C=CC1C(CC(=O)C1CCCCCCC(=O)O)O',
'CC12CCC(=O)CC1CCC3C2CCC4(C3CCC4O)C',
'C1CCC(=O)NCCCCCC(=O)NCC1',
'C1C=CC(=NC1C(=O)O)C(=O)O',
'C(C(C(C(=O)C(=O)C(=O)O)O)O)O',
'C1=CC(=C(C(=C1)O)O)C(=O)O',
'C1=CC(=C(C(=C1)O)O)CCC(=O)O']

Due to the flexibility of XenonPy, there are many ways to achieve our goal. Here, we show only one possible way that may be easier for re-using in different projects or problem setup. 

For example, assume that we want to pick out molecules with a benzene ring and a carbonyl group while the total molecular weight is less than 200. First, we can make corresponding BaseFeaturizers classes such that when the conditions are met, the featurizer will return 1 (True), and 0 (False) otherwise. Considering reusability, we make separated featurizer for checking molecular weight and fragments, then we combine them with the BaseDescriptor class in XenonPy. Second, we will make a BaseLogLikelihood to convert the False values to negative infinity (or a very large negative value) and the True values to 0.

First, let us try to make the featurizers.

In [3]:
# BaseFeaturizer for checking molecular weight

from rdkit import Chem
from rdkit.Chem import rdMolDescriptors as rdMol
from xenonpy.descriptor.base import BaseFeaturizer

class MolWeight_binary(BaseFeaturizer):

    def __init__(self,
                 n_jobs=-1,
                 *,
                 target=None,
                 input_type='any',
                 on_errors='raise',
                 return_type='any'):
        """
        RDKit fingerprint.
        Parameters
        ----------
        n_jobs: int
            The number of jobs to run in parallel for both fit and predict.
            Can be -1 or # of cups. Set -1 to use all cpu cores (default).
        target: (float, float)
            Lower and upper bound of the molecular weight
        input_type: string
            Set the specific type of transform input.
            Set to ``mol`` (default) to ``rdkit.Chem.rdchem.Mol`` objects as input.
            When set to ``smlies``, ``transform`` method can use a SMILES list as input.
            Set to ``any`` to use both.
            If input is SMILES, ``Chem.MolFromSmiles`` function will be used inside.
            for ``None`` returns, a ``ValueError`` exception will be raised.
        on_errors: string
            How to handle exceptions in feature calculations. Can be 'nan', 'keep', 'raise'.
            When 'nan', return a column with ``np.nan``.
            The length of column corresponding to the number of feature labs.
            When 'keep', return a column with exception objects.
            The default is 'raise' which will raise up the exception.
        """
        super().__init__(n_jobs=n_jobs, on_errors=on_errors, return_type=return_type)
        self.input_type = input_type
        if target is None:
            raise RuntimeError('<target> is empty')
        else:
            self.target = target
        self.__authors__ = ['Stephen Wu']

    def featurize(self, x):
        if self.input_type == 'smiles':
            x_ = x
            x = Chem.MolFromSmiles(x)
            if x is None:
                raise ValueError('can not convert Mol from SMILES %s' % x_)
        if self.input_type == 'any':
            if not isinstance(x, Chem.rdchem.Mol):
                x_ = x
                x = Chem.MolFromSmiles(x)
                if x is None:
                    raise ValueError('can not convert Mol from SMILES %s' % x_)
                    
        tmp = rdMol.CalcExactMolWt(x)
        
        return (tmp > self.target[0]) and (tmp < self.target[1])

    @property
    def feature_labels(self):
        return ["MolW_bin"]
    

In [4]:
# testing if it works

fea = MolWeight_binary(target=(-np.inf,200), on_errors='nan', return_type='df')
fea.transform(smis)


Unnamed: 0,MolW_bin
0,False
1,False
2,True
3,True
4,True
5,False
6,True
7,True
8,False
9,False


In [5]:
# BaseFeaturizer for checking existence of fragments

from rdkit import Chem
from xenonpy.descriptor.base import BaseFeaturizer

class Frag_binary(BaseFeaturizer):

    def __init__(self,
                 n_jobs=-1,
                 *,
                 target=None,
                 input_type='any',
                 on_errors='raise',
                 return_type='any'):
        """
        RDKit fingerprint.
        Parameters
        ----------
        n_jobs: int
            The number of jobs to run in parallel for both fit and predict.
            Can be -1 or # of cups. Set -1 to use all cpu cores (default).
        target: list(str)
            List of fragment in SMARTS.
        input_type: string
            Set the specific type of transform input.
            Set to ``mol`` (default) to ``rdkit.Chem.rdchem.Mol`` objects as input.
            When set to ``smlies``, ``transform`` method can use a SMILES list as input.
            Set to ``any`` to use both.
            If input is SMILES, ``Chem.MolFromSmiles`` function will be used inside.
            for ``None`` returns, a ``ValueError`` exception will be raised.
        on_errors: string
            How to handle exceptions in feature calculations. Can be 'nan', 'keep', 'raise'.
            When 'nan', return a column with ``np.nan``.
            The length of column corresponding to the number of feature labs.
            When 'keep', return a column with exception objects.
            The default is 'raise' which will raise up the exception.
        """
        super().__init__(n_jobs=n_jobs, on_errors=on_errors, return_type=return_type)
        self.input_type = input_type
        if target is None:
            raise RuntimeError('<target> is empty')
        elif isinstance(target, str):
            self.target = [target]
        else:
            self.target = target
        self.mol = [Chem.MolFromSmarts(x) for x in self.target]
        if any([x is None for x in self.mol]):
            raise RuntimeError('At least one <target> is invalid SMARTS')
        self.__authors__ = ['Stephen Wu']

    def featurize(self, x):
        if self.input_type == 'smiles':
            x_ = x
            x = Chem.MolFromSmiles(x)
            if x is None:
                raise ValueError('can not convert Mol from SMILES %s' % x_)
        if self.input_type == 'any':
            if not isinstance(x, Chem.rdchem.Mol):
                x_ = x
                x = Chem.MolFromSmiles(x)
                if x is None:
                    raise ValueError('can not convert Mol from SMILES %s' % x_)
        
        return np.array([x.HasSubstructMatch(z) for z in self.mol])

    @property
    def feature_labels(self):
        return self.target
    

In [6]:
# testing if it works

fea_frag = Frag_binary(target=['c1ccccc1','C=O'], on_errors='nan', return_type='df')
fea_frag.transform(smis)


Unnamed: 0,c1ccccc1,C=O
0,False,True
1,False,True
2,False,True
3,False,False
4,False,True
5,True,False
6,False,False
7,False,True
8,False,False
9,True,True


In [7]:
# BaseDescriptor that combines the two featurizer we made

from xenonpy.descriptor.base import BaseDescriptor

class custom_binary(BaseDescriptor):
    """
    Collect custom made binary descriptors of organic molecules.
    """

    def __init__(self,
                 n_jobs=-1,
                 *,
                 featurizers='all',
                 on_errors='raise',
                 return_type='any'):
        """
        Parameters
        ----------
        n_jobs: int
            The number of jobs to run in parallel for both fit and predict.
            Can be -1 or # of cpus. Set -1 to use all cpu cores (default).
        input_type: string
            Set the specific type of transform input.
            Set to ``mol`` (default) to ``rdkit.Chem.rdchem.Mol`` objects as input.
            When set to ``smlies``, ``transform`` method can use a SMILES list as input.
            Set to ``any`` to use both.
            If input is SMILES, ``Chem.MolFromSmiles`` function will be used inside.
            for ``None`` returns, a ``ValueError`` exception will be raised.
        on_errors: string
            How to handle exceptions in feature calculations. Can be 'nan', 'keep', 'raise'.
            When 'nan', return a column with ``np.nan``.
            The length of column corresponding to the number of feature labs.
            When 'keep', return a column with exception objects.
            The default is 'raise' which will raise up the exception.
        """

        super().__init__(featurizers=featurizers)
        self.n_jobs = n_jobs

        self.mol = MolWeight_binary(target=(-np.inf,200), on_errors=on_errors, return_type=return_type)
        self.mol = Frag_binary(target=['c1ccccc1','C=O'], on_errors=on_errors, return_type=return_type)
        

In [8]:
# testing if it works

fea_custom = custom_binary(on_errors='nan', return_type='df')
fea_custom.transform(smis)


Unnamed: 0,MolW_bin,c1ccccc1,C=O
0,False,False,True
1,False,False,True
2,True,False,True
3,True,False,False
4,True,False,True
5,False,True,False
6,True,False,False
7,True,False,True
8,False,False,False
9,False,True,True


Second, we will make a BaseLogLikelihood to handle the binary featurizers.

In [9]:
from typing import Union
import numpy as np
import pandas as pd
from xenonpy.descriptor.base import BaseDescriptor, BaseFeaturizer
from xenonpy.inverse.base import BaseLogLikelihood

class BinaryLogLikelihood(BaseLogLikelihood):
    def __init__(self, descriptor: Union[BaseFeaturizer, BaseDescriptor], *, log_0=-1000.0):
        """
        Binary loglikelihood.
        Parameters
        ----------
        descriptor: BaseFeaturizer or BaseDescriptor
            assume the descriptor always return 1 or 0 (True or False), where NaN is considered 0.
        log_0: float
            value to take for log_0 probability instead of -inf.
        """
        self._log_0 = log_0
        
        if not isinstance(descriptor, (BaseFeaturizer, BaseDescriptor)):
            raise TypeError('<descriptor> must be a subclass of <BaseFeaturizer> or <BaseDescriptor>')
        self._descriptor = descriptor
        self._descriptor.on_errors = 'nan'

    def predict(self, x, **kwargs):
        return self._descriptor.transform(x, return_type='df')

    # log_likelihood returns a dataframe of log-likelihood values of each property & sample
    def log_likelihood(self, x, *, log_0=None):

        if log_0 is None:
            log_0 = self._log_0
            
        ll = self.predict(x)
        check = ll == 1
        ll[~check] = log_0
        ll[check] = 0

        return ll

In [10]:
# testing if it works

prd_mdl = BinaryLogLikelihood(descriptor=fea_custom)
prd_mdl(smis)


Unnamed: 0,MolW_bin,c1ccccc1,C=O
0,-1000.0,-1000.0,0.0
1,-1000.0,-1000.0,0.0
2,0.0,-1000.0,0.0
3,0.0,-1000.0,-1000.0
4,0.0,-1000.0,0.0
5,-1000.0,0.0,-1000.0
6,0.0,-1000.0,-1000.0
7,0.0,-1000.0,0.0
8,-1000.0,-1000.0,-1000.0
9,-1000.0,0.0,0.0


Now, we have succeeded to make a log-likelihood class that will return 0 when desired conditions are met (controlled by the featurizers) and -1000.0 otherwise. These are the molecules that met our conditions.

In [15]:
tmp = prd_mdl(smis).sum(axis=1) == 0
[x for i, x in enumerate(smis) if tmp[i]]


['C1=CC(=C(C(=C1)O)O)C(=O)O', 'C1=CC(=C(C(=C1)O)O)CCC(=O)O']