OK, you are no longer satisfied with holding your structure data in huge SD files, or optimized query files (see tutorial), but want to go full database?

Let's build a structure database on Mysql from the Emolecules sample set, and make it searchable by the Cactvs cartridge.

Typical structure databases contain at the center a table with a standard structure representation and additional information. The structure source column is then used to build fast query tables around it. We will follow this generic approch and create a core table with an SDF record and the SDF fields first.

This tutorial assumes you have a Mysql database server running, with the Cactvs cartridge installed (see _sql_ subdirectory of normal distributions) and the rights to create a database and tables. We are not using explicit database user IDs and passwords here, but this can be easily added (see documentation on the _dbase_ command).

Create the database:

In [48]:
# Whatever your database uses to connect. IP connections are also possible.
socket='/var/lib/mysql/mysql.sock'
db=Dbase('dbtype','mysql','socket',socket)

Remove old database, create new one

In [9]:
db.exec('DROP DATABASE IF EXISTS emolecules')
db.exec('CREATE DATABASE emolecules')

True

Create a table with the SD record, and the SD fields. We are not directly loading the
Mysql database row by row, but capture 10K rows in a toolkit table which is then loaded
as batch in a block transaction. Our script table is automatically used as template for
the creation of a corresponding database table.

In [10]:
mf=Molfile('../CommonData/emolecules_sample_100000.sdf.gz')
# Peek at the SD field set
mf.peek()
print(mf.fields)
# Optimize data types - we know these are ints
Prop.Set('E_*EMOL_VERSION_ID*','datatype','int')
Prop.Set('E_*EMOL_PARENT_ID*','datatype','int')
Prop.Set('E_*EMOL_LINK*','datatype','url')
t=Table()
# for convenience, use the standard SD record property. We could also use a simple string field
t.addcol('E_SDF_STRING','SDrecord')
# Add the SD data fields as additional columns
for f in mf.fields:
    t.addcol(f,f.originalname)
print(t.colnames)
# Prevent automatic computation of SQL column field length, since we write in blocks, the size
# automatically determined from the first block may be too small
t.setcol('EMOL_LINK','fieldlength',255)
# We know this is a unique value
t.setcol('EMOL_VERSION_ID','dbflags','primarykey')

(E_*EMOL_VERSION_ID*, E_*EMOL_PARENT_ID*, E_*EMOL_LINK*)
('SDrecord', 'EMOL_VERSION_ID', 'EMOL_PARENT_ID', 'EMOL_LINK')


table5

OK, we have the table. Now lets fill it block by block, and upload.

In [11]:
# This will run for about 3 or 4 minutes
import time
firstblock=True
blocksize=10000
nrows=0
tstart = time.time()
while True:
    try:
        # We want the SD record without any interpretation, so we just copy it out block by block
        sdfblob=mf.copy()
        # Create a structure object by string decoding. This an easy way to get the field data
        e=Ens(sdfblob)
        t.addrow(celldata=(sdfblob,e.EMOL_VERSION_ID,e.EMOL_PARENT_ID,e.EMOL_LINK))
        e.delete()
        nrows+=1
        if t.nrows>=blocksize:
            if firstblock:
                # Store table in database table 'coredata', with automatic definition
                t.write('mysql://localhost/emolecules/coredata?socket='+socket)
                firstblock=False
            else:
                # Append to existing database table
                t.write('mysql://localhost/emolecules/coredata?socket='+socket,None,'mode','a')
            t.delrows('all')
    except:
        # The copy function fails at EOF. For now, forego more detailed error analysis
        break
if t.nrows>0:
    if firstblock:
        t.write('mysql://localhost/emolecules/coredata?socket='+socket)
    else:
        t.write('mysql://localhost/emolecules/coredata?socket='+socket,None,'mode','a')
tstop = time.time()
print('wrote %d data rows'%nrows)
print('execution time %d secs'%(tstop-tstart))
mf.close()
t.delete()

wrote 100000 data rows
execution time 214 secs


1

Now for the definition of the parallel structure query table. There is a standard set of columns. The only item which needs customization is the reference to the main table. The table definition is close to the sample definition for making a ChEMBL database searchable found in the sql/ directory of standard toolkit installations.

In [12]:
db.database='emolecules'
db.exec('DROP TABLE IF EXISTS compound_query')
db.exec('''
create table compound_query (
    screen binary(252) not null
        comment 'filled with binary property E_QUERY_SCREEN, 244 bytes screen bits plus 4 bytes header plus 4 bytes set bit count',
    molecule blob not null comment 'filled with binary property E_MINIMOL',
    superscreen binary(252) not null
        comment 'filled with binary property E_NO_HYDROGEN_QUERY_SCREEN, 244 bytes screen bits plus 4 bytes header plus 4 bytes set bit count',
    simscreen binary(120) not null
        comment 'filled with binary property E_SCREEN, 112 bytes screen bits plus 4 bytes header plus 4 bytes set bit count',
    EMOL_VERSION_ID int not null primary key,
    atoms int not null comment 'atom count',
    heavyatoms int not null comment 'heavy atom count',
    weight float not null comment 'molecular weight',
    simplehash char(16) not null comment 'filled with string property E_HASHY',
    stereohash char(16) not null comment 'filled with string property E_HASHSY',
    isotopehash char(16) not null comment 'filled with string property E_HASHIY',
    isotopestereohash char(16) not null comment 'filled with string property E_HASHISY',
    formula varbinary(238) not null comment 'filled with binary property E_ELEMENT_COUNT, max length 2*(oganesson118+1)',
    index (simplehash),
    index (stereohash),
    index (isotopehash),
    index (isotopestereohash),
    constraint foreign key (EMOL_VERSION_ID) references coredata(EMOL_VERSION_ID) on delete cascade on update cascade
);
''')

True

The new table can be filled from within the database with the Cactvs cartridge functionality.

In [13]:
# This runs for 25 mins or so
tstart = time.time()
db.exec('''
insert into compound_query(screen,molecule,superscreen,simscreen,EMOL_VERSION_ID,atoms,heavyatoms,weight,
    simplehash,stereohash,isotopehash,isotopestereohash,formula)
    select
        ens_blob_property(SDrecord,'E_QUERY_SCREEN'),
        ens_blob_property(SDrecord,'E_MINIMOL'),
        ens_blob_property(SDrecord,'E_NO_HYDROGEN_QUERY_SCREEN'),
        ens_blob_property(SDrecord,'E_SCREEN'),
        EMOL_VERSION_ID,
        ens_long_property(SDrecord,'E_NATOMS'),
        ens_long_property(SDrecord,'E_HEAVY_ATOM_COUNT'),
        ens_double_property(SDrecord,'E_WEIGHT'),
        ens_string_property(SDrecord,'E_HASHY'),
        ens_string_property(SDrecord,'E_HASHSY'),
        ens_string_property(SDrecord,'E_HASHIY'),
        ens_string_property(SDrecord,'E_HASHISY'),
        ens_blob_property(SDrecord,'E_ELEMENT_COUNT')
        from coredata;

''')
tstop = time.time()
print('execution time %d secs'%(tstop-tstart))

execution time 1506 secs


Now we can issue queries using cartridge functionality.

In [18]:
# No, we do not need to set the database name anew for each statement. This is just to allow
# us to continue if we lost the database connection and want to resume execution after re-establishing the
# connection via the very first code block above
db.database='emolecules'
print(db.colquery('''
select EMOL_VERSION_ID from compound_query where 
    match_substructure(screen,molecule,"c(C)1nccnc1C!")>0 
    order by EMOL_VERSION_ID
'''))

(27345, 165205, 165803, 165811, 165933, 167347, 167353, 170317, 170395, 173923, 176787, 208823, 210519, 213625, 215515, 223365, 253803, 253811, 253815, 253817, 253835, 253839, 253841, 253851, 256763, 270881, 287054, 287056, 287058, 293206, 294164, 294974, 295066, 295200, 295330, 295474, 299722, 301048, 301050, 301550, 301552, 380988, 387228)


And we can process the results into different display formats. Get a full result table to play with:

In [73]:
db.database='emolecules'
t=db.tablequery('''
select coredata.EMOL_VERSION_ID as ID,EMOL_LINK as Link,SDrecord from coredata,compound_query where
   coredata.EMOL_VERSION_ID = compound_query.EMOL_VERSION_ID and
   match_substructure(screen,molecule,"c(C)1nccnc1C!")>0
   order by coredata.EMOL_VERSION_ID limit 10
''')
print(t.nrows)
print(t.colnames)

10
('ID', 'Link', 'SDrecord')


Let's massage this table into something human-readable:

In [74]:
# Change the column datatype from type string for HTML output
t.colset('Link','datatype','url','headerformat','+center')
# Add an image column
t.addcol('E_SVG_IMAGE',name='Structure')
t.colset('Structure','headerformat','+center')
# Links open in new tab
t.linktarget = '_blank'
# Global image generation parameter adjustment
Prop.Setparam('E_SVG_IMAGE',{'frame':False,'asymbol':'compact'})

# Define a table row processing function
def edittable(t,row,rowtuple,objtuple):
    # Copy ID text as link text to Link column
    t.setcell(row,'Link','linktext',rowtuple['ID'])
    # Decode structure for image generation
    e=Ens(rowtuple['SDrecord'])
    # Fill in image
    t.setcell(row,'Structure',e.E_SVG_IMAGE)
    # We do not need the structure object any longer
    e.delete()

# Loop over the table, calling a function for every row. Row data is passed as a dictionary.
# In the function, we copy the version ID data as link text for the retrieval URL, and 
# create an SVG image from the structure data, which we decode from the SD record data.
t.dictloop(edittable)
# The version information was saved as link text in the Link column, we do not need this column any longer
# Same for the SD record - this now exists as image
t.delcols('ID','SDrecord')
# Write as HTML table to the temporary directory (the None filename resolves to a temp file name)
filename = t.write(None,'html',{"colblocksize":3})

3
4
4
4
4


RuntimeError: illegal column range index "'ID'"

Show the HTML block as part of the Jupyter notebook

In [54]:
from IPython.core.display import HTML
HTML(filename)

Link,Structure,Link.1,Structure.1,Link.2,Structure.2
27345,,165205.0,,165803.0,
165811,,165933.0,,167347.0,
167353,,170317.0,,170395.0,
173923,,,,,


Cleanup...

In [46]:
Molfile.Close('all')
Table.Delete('all')
Dbase.Close('all')

1