Cleaning up a file with structures is a common task. One of the most popular methods is to simply remove records from a larger file which do not meet specific criteria. The problem is then to perform this task without disrupting the remaining records or general record structure. We additionally like to do this without reading and re-writing records, because something always changes in these in unexpected ways.

For this exercise, we want to delete all structures from a test file which
<ol>
<li>Contain more than one molecular fragment, assuming this indicates a salt, mixture, etc.
<li>Do not have an overall formal charge of zero (assuming again that this would indicate a salt)
<li>Posses stereogenic centers without defined stereochemistry
</ol>

First, lets write a test file as SDF and SMILES. The choice of these formats is no accident. Clean record deletion is supported by the toolkit for formats which are either
<ol>
<li>Simple text files, where records are simply concatenated. SMILES and SDF are examples of this.
<li>Files where the format I/O module implements special functions for record deletion. Examples are the BDB and CBS accelerated query file formats (see tutorial on fast file scans).
</ol>
    
If none of these conditions are met, it gets messier. This is also true if a supported format is not stored in its plain form. Compressed SD-files and SMILES files in a ZIP file cannot be handled directly.

In [1]:
# create a test dataset. Passing molecules are the first (simple pyridine), fourth (E-2-butene) and last (alanine)
# Record 2 contains two fragments,  3 an unblanced charge, 5 undefined bond stereochemistry and
# 6 undefined atom stereochemistry
d=Dataset('c1ncccc1','c1[nH+]cccc1.[Cl-]','c1[nH+]cccc1','C/C=C/C','CC=CC','CC(N)C(=O)O','C[C@H](N)C(=O)O')
# Default write of a SMILES file
Molfile.Write('test.smi',d)
# For the SD file, we complicate things by writing them with an reduced implicit hydrogen set
# The 'strip' write mode removes hydrogens which are not usually displayed, 
# and not significant for stereochemistry, etc.
mf=Molfile('test.sdf','w',{'hydrogens':'strip'})
mf.write(d)
mf.close()

1

How can we test for the conditions?

Fragment count and overall charge are simple, these are simple checks of built-in property values:

In [2]:
for e in d:
    print(f'Frags OK: {e.E_NMOLECULES==1} Charge OK: {e.E_CHARGE==0}')

Frags OK: True Charge OK: True
Frags OK: False Charge OK: True
Frags OK: True Charge OK: False
Frags OK: True Charge OK: True
Frags OK: True Charge OK: True
Frags OK: True Charge OK: True
Frags OK: True Charge OK: True


For the stereo check, this is a little bit more complicated. Let's first do it manually on the atom and bond levels:

In [3]:
def astereocheck(e):
    for a in e.atoms():
        if a.A_STEREOGENIC=='yes' and a.A_LABEL_STEREO=='undef':
            return False
    return True

def bstereocheck(e):
    for b in e.bonds():
        if b.B_STEREOGENIC=='yes' and b.B_LABEL_STEREO=='undef':
            return False
    return True     

The stereogenicity and stereo properties have enumerated values and are presented to the Python interface as strings. The possible values can be queried:

In [4]:
print(Prop.Get('A_STEREOGENIC','enum'))
print(Prop.Get('B_STEREOGENIC','enum'))
print(Prop.Get('A_LABEL_STEREO','enum'))
print(Prop.Get('B_LABEL_STEREO','enum'))

no:maybe:checkno:yes,checkyes:ringct
no:maybe:checkno:yes,checkyes:ringct
M,-=-1:undef=0:P,+=1:U,C=2:Z,N=3:X=4
M,MI,-=-1:undef=0:P,PL,+=1


The values you are likely to see in normal molecules are _no_ and _yes_.

With these functions, we can test the structures:

In [5]:
for e in d:
    print(f'Atomstereo OK: {astereocheck(e)} Bondstereo OK: {bstereocheck(e)}')

Atomstereo OK: True Bondstereo OK: True
Atomstereo OK: True Bondstereo OK: True
Atomstereo OK: True Bondstereo OK: True
Atomstereo OK: True Bondstereo OK: True
Atomstereo OK: True Bondstereo OK: False
Atomstereo OK: False Bondstereo OK: True
Atomstereo OK: True Bondstereo OK: True


But there is also a direct property check, though it is a field in a more complex property. Its data type is an integer vector. We access the relevant fields either by index, or use the toolkit's built-in property field access. Since on the Python level this is a simple vector tuple, named access with standard Python syntax is not possible. This is however supported for compound properties.

In [6]:
print(Prop.Get('E_STEREO_COUNT','fields'))
for e in d:
    print(f'Stereo OK: {e.E_STEREO_COUNT[2]==0 and e.E_STEREO_COUNT[5]==0}')
    print(f'Stereo OK: {e.get("E_STEREO_COUNT(aundefined)")==0 and e.get("E_STEREO_COUNT(bundefined)")==0}')

(('apossible', 'int'), ('adefined', 'int'), ('aundefined', 'int'), ('bpossible', 'int'), ('bdefined', 'int'), ('bundefined', 'int'))
Stereo OK: True
Stereo OK: True
Stereo OK: True
Stereo OK: True
Stereo OK: True
Stereo OK: True
Stereo OK: True
Stereo OK: True
Stereo OK: False
Stereo OK: False
Stereo OK: False
Stereo OK: False
Stereo OK: True
Stereo OK: True


One way to clean a file is to copy passed records to a new simple text file. The toolkit has a method for verbatim record output, without reading and re-writing data. So one possible cleanup looks like this:

In [7]:
mf_in=Molfile('test.smi')
f_out=open('filtered.smi','w')
cnt=0

def write_filtered(e):
    global cnt
    if (e.E_NMOLECULES!=1):
        return
    if (e.E_CHARGE!=0):
        return
    if (e.get("E_STEREO_COUNT(aundefined)")!=0 or e.get("E_STEREO_COUNT(bundefined)")!=0):
        return
    cnt += 1
    mf_in.copy(f_out,startrecord=-1)

mf_in.loop(write_filtered)
mf_in.close()
f_out.close()
print('passed %d records'%cnt)

passed 3 records


The copy() method copies records directly from the record begin to the record end, to another output channel or into a string. A negative start record indicates an offset to the current record of the input file. Since we already read the structure for testing, we need to backspace one record for the copy operation.

Still, this is a lot of explicitly scripted testing. Is there a way to optimize it, and to maybe to it all in-place without a duplicate file? Yes, there is. First, write up the structure property test as a toolkit scan query. The syntax was originally designed to be convenient for processing with Tcl.

In [8]:
query='or {E_NMOLECULES != 1} {E_CHARGE != 0} {E_STEREO_COUNT(aundefined) != 0} {E_STEREO_COUNT(bundefined) != 0}'
mf=Molfile('test.sdf','r',{'hydrogens':'add'})
delrecs = mf.scan(query,mode='recordlist')
print(delrecs)
mf.close()

[2, 3, 5, 6]


1

The record list matches our expectations. Can be apply it in some way directly to a file?

In [9]:
from shutil import copyfile
copyfile('test.sdf','filtered.sdf')
# We need to open the file for updating. Simple read access is insufficient. 
mf=Molfile('filtered.sdf','u')
mf.delete(delrecs)
print(mf.count())
mf.close()

3


1

And this can be done in an even terser fashion, with a special scan mode.

In [10]:
from shutil import copyfile
copyfile('test.sdf','filtered2.sdf')
mf=Molfile('filtered2.sdf','u',{'hydrogens':'add'})
mf.scan(query,mode='delete')
print(mf.count())
mf.close()

3


1