New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chemical JSON format #1137
Comments
A few years ago I added support for JSON formats to Open Babel: The two example formats I implemented were the ChemDoodle JSON format and the JSON output of the PubChem PUG REST API. There is also the OpenChemistry Chemical JSON project: |
I don't think there's enough value simply replacing the storage format itself. Yes it's slightly easier to parse JSON than the row column based SDF format, but that by itself isn't sufficient. I think it's also really important to define the scope of this (else things like query values start to creep in.) One of the things that would be a real value-add is coming to some consensus on defining a minimally complete representation distinct from computed properties, so as to minimize possible inconsistencies. For example, the treatment of stereochemistry in SDF/MOLBLOCK can be inconsistent between the calculated parity value (R/S) and the atomic coordinates with wedgd bonds. Another thing I'd like, but may get flamed on is to avoid any kind of explicit ordering on atom indices since the underlying graph structure and its associated properties should be invariant under isomorphism. That is, things that dependent on a proper index should be labelled explicitly (eg: atom mappings). Practically this means that we shouldn't guarantee an iteration order over atoms/bonds. |
We should take a look at things like JSON-LD, HAL, Json+Collections whether they might be helpful to create proper media types and/or make the molecule format straight available for Web APIs. The chemicaljson link mentioned by Matt already mentions this, too. This here might be a starting point: https://sookocheff.com/post/api/on-choosing-a-hypermedia-format/ which concludes """ |
@proteneer : what's motivating the desire to avoid atom indices? I'm pretty sure that they make the file more (human) readable. If we treat the indices as an convenience feature for the input format but not something that's guaranteed to be preserved on parsing the file, does that help? |
@greglandrum - my point was simply that the spec itself (which is separate from the actual concrete JSON implementation) should not guarantee consistent ordering. So basically as you mentioned, one way to do this is on the implementation's serialization level (i.e. serialization and deserialization may permute the ordering). As an example, there's a particular format (which I won't mention here) that prefers to put explicit hydrogens at the end of the molblock for "convenience" sake. This is great until the inevitable molblock violating this guarantee comes along and everything breaks. Note that I'm fine with an implementation that actually uses a list of atoms. I do agree with you that it's far more accessible to read, even if we run the risk of implementers assuming consistent ordering. |
There should be a column for atom index. Users are free to ignore it. You don't have to write the atoms out in that order, but if you do, people might be able to use simple stupid tools like diff to quickly compare two molecule files. This may not sound very useful to a chemist but when I run a batch job over 25K ligands, diff'ing the ins and outs and flagging only the ones that changed -- or didn't, depending, -- for a closer look is a very useful feature trivially coded in a one-line shell post-script. There should also be a column for atom label because I don't know of any algorithm that can label atoms C-alpha, H-beta-21, etc. for the two dozen molecules that use those. Every piece of code here has the atom tables for the "common" residues, each with its own typos and who knows what. We wouldn't have to do that if our exchange formats didn't throw away protons, atom labels and indexes, and everything else that is "obvious to a chemist". |
A separate issue is that JSON itself is not a streaming format. Valid JSON has to be a single string that gets loaded into RAM in order to be parsed into a single "javascript object". Consider the size of the string describing a hundred "best models" for a moderately-sized polymer. |
Thanks for the mention. I am happy to answer any questions about SciData when/if you get to look at it. Stuart On Oct 31, 2016, at 9:14 AM, Matt Swain <notifications@github.commailto:notifications@github.com> wrote: On the topic of JSON-LD, there is SciDatahttp://stuchalk.github.io/scidata/ recently published by @stuchalkhttps://github.com/stuchalk, which seems like it would be worth looking at, even if the scope is slightly different from what is relevant to RDKit. — |
@dmaziuk Never comfort user that use "stupid tools" - that is their own responsibility. A big advantage of json is that there are well-tested parser for basically any language and environment (even for Unix shells if you really need it). And, there are also json streaming solutions for basically all important languages (conceptually there is no big difference between parsing xml or json). |
Uh-huh. Well I'll stick to formats that let me use tools that actually work. It's a good thing by now I can write a format converter with my eyes closed and one hand tied behind my back. |
For me, the key thing about the format is that it supports multiple conformers of the same molecule efficiently. That's what kicked the discussion off in the original rdkit-discuss thread. I would imagine that means 1 block defining the chemistry, and then multiple sets of co-ordinates for the conformations. If there could be 2D co-ordinates labelled distinct from 3D ones, that would be helpful though it might create problems in the RDKit molecule object. |
There is an advantage to storing a table of atoms & bonds as delimited text: you can load it in Excel. Do not underestimate the power of Excel. (And other stupid tools.) If you define the data structure, you can write it out as Protocol Buffer Definition and dump it into binary. Or e.g. as a Data Type Definition and dump it into XML. It's only a matter of picking up the appropriate library and feeding it your data structure in the way it understands. |
+1 @DavidACosgrove general comments about multi-conformer support and an additional note: In my experience reading conformers efficiently does come down to reading the coordinates efficiently (once chemistry perception is out of the way), and that means reading binary. Luckily, we don't need to come up with a new format to handle binary once we decide on the JSON structure. MsgPack is a 1-to-1 encoding from text JSON to binary. With support for as many languages that support JSON: http://msgpack.org/index.html The RCSB is going down this exact same route for macro-molecule representation as well: That format focuses heavily on compressing large macro-molecules for efficient transmission. So I doubt we want to use it for small molecules, but I could be wrong. An .mmtf reader would be a useful addition to RDKit regardless. @dmaziuk Do Protocol Buffers have a 1-to-1 mapping to JSON like MsgPack? I am unfamiliar of the pros and cons of each. |
Protocol buffers 3 does indeed have a JSON encoding in addition to the On Wednesday, November 2, 2016, Brian Cole notifications@github.com wrote:
Yutong Zhao |
Protobuf is the schema, aka DTD, plus translator. AFAICT MsgPack just packs the bytes and lets the reader sort them out. IME people who didn't sit through Algorithms and Data Strucutres 101 tend to view the lack of the schema as a feature whereas Comp. Sci. types call it a bug. A table of coordinates would be a few bytes smaller in a binary format than in CSV: no comma delimiters, but the overhead is minimal. The CSV, OTOH, can be directly loaded into a database, edited with |
This is a good discussion but I'm afraid that we are heading a bit off into in the weeds here. I think it would be more productive to figure out what information we need to capture and then to think about the technology (format) that we need to store that information. I suspect that we will actually end up with multiple formats in order to be able to balance robustness, portability, and performance. |
@dmaziuk: The reason for favouring a binary format for these purposes is not size, it's speed. With a binary format, numerical contents can be read directly into a float or int, with any ASCII format something that ultimately calls atof will have to be used, which imposes a significant overhead on reading. I think you may be mis-counting the size difference, however. An int in binary format will normally be 4 bytes whatever the value of the integer being stored, in ASCII it can be anywhere between 1 and 10 bytes. @greglandrum has a point however - let's first decide what should be in the file! |
IME speed has never been a practical problem. By the time it starts biting you, there's three more next generations of hardware out there and your computer is long overdue for upgrade. (We're unzip/untar'ing text files on 3-7yo hardware fast enough to saturate the SATA bus and hang the machine and I have to configure I think one of the things missing from Greg's requirements is intended audience. Who is going to use the format and for what purpose. And also why does RDKit need another format. JSON is the web's darling du jour, now that XML has settled into its niche and we all moved on, but it's only really good for what it was intended for: sending small snippets of JavaScript directly into the browser. If the intended audience is not the browser, RDKit is not JavaScript, and the data is not small... |
The other thing is you can spin the math either way: you're not going to represent "ALA", "CA", etc. in binary any more efficiently that in ASCII/UTF-8. 12.3 takes up 4 bytes in UTF-8 and 8 bytes in double precision IEEE 754. Plus the round-off error: if you really want to do it right, you want to send a "significant digits" integer alongside so that your users could tell if 12.000019287547965 is actually 12 or 12.000. If you send it as text you can off-load the decision to the user: they can stare at "12.000" and try to figure out if it is actually accurate to the 3rd digit, or the programmer just printed it as "%7.3f" because the numbers line up pretty that way. |
related discussion at alchemistry.org https://github.com/alchemistry/fileformat |
Hi, I am one of the MMTF developers. One of the things we are thinking about are ways to flexible add more metadata to the format. |
Why a new format? My answer to this would be:
|
@dmaziuk "... but it's only really good for what it was intended for: sending small snippets of JavaScript directly into the browser. " Hmm, I tend to disagree - a lot of web services use json as exchange format nowadays for big amount of data, and have you come in touch with NoSQL world with things like Lucene, Solr, Elasticsearch which all pretty strongly rely on or support json? Json is also natively understood by Javascript, which has a growing relevance on the server side of web services, and it is almost natively understood by python (the javascript guys actually stole the python dictionary data type when they developed json). |
@dmaziuk is absolutely right: being a bit more explicit about what we want to accomplish with the format as well as who the intended users are is a good idea. I will put some more meat on this later, but I'm primarily looking for an efficient and flexible format for storing and exchanging data about small molecules. It should be both machine and human readable (or at least have an easy way to get a human-readable form) and support optional toolkit-dependent information (like ring information, aromaticity, etc.) that can be ignored (or not) by other toolkits. I'm really not looking to create the one-format-to-rule-them-all and my focus at the moment is almost entirely on having something for the RDKit, though I want to be very sure that it's easily useable by other toolkits as well. My biases on this one:
|
Please consider defining a data model first, and then a data format as an implementation of this data model. I understand that the focus of this discussion is on JSON, for good technical reasons. But the technical requirements for data formats vary: one person needs JSON, another one needs XML, a third one needs HDF5. There will always be many formats for the same kind of data because of technical imperatives. And that means format conversion, which we all love to do, right? Format conversion is actually not much of a problem if its lossless in both directions, i.e. if the conversion happens between two formats that represent the same data. And that common abstract definition is the data model. Think of it as a high-level format description. For more details, see this article. You might also want to look at my MOSAIC data model/format for computational chemistry, and read the paper that explains the rationals behind its design. You might be able to actually use MOSAIC by adding a JSON implementation. Or extend MOSAIC to your needs. But the most important aspect of MOSAIC is the two-level design as a data model with multiple implementations. |
Wow, I don't need to come back and flesh out what I was thinking too much, @khinsen just said a lot of it for me, and better than I probably would have. Restating, hopefully accurately, using a somewhat different vocabulary: we should really be defining a schema that describes the information we're trying to capture and then worry about details of the physical representation (i.e. JSON, protobuff, msgpack, etc.) |
@greglandrum Exactly. In my experience, the best approach to defining a data model is a hierarchical one, just like for program design. At the highest level, you may want to describe a molecule as a graph, for example, and decide which attributes you want to attach to vertices and edges. Next, you could define how to represent that graph plus its attributes in terms of more basic data structures such as arrays of strings, numbers, etc. The last step is the concrete data format. |
@dmaziuk Your example concerning numbers is a nice illustration of what should be defined in a data model, and why it is important to have one. At the data model level, it matters if you want to represent a measured or computed value with an attached precision, or a raw floating-point value from a computation. You probably don't want to off-load that decision to the user, but even if that's what you want, this choice is part of your data model. If you start from the other end, e.g. the efficiency of representation, you will probably end up defining a format that is impossible to convert to anything else without losing information or, worse, having to make up information. BTW, if you need to represent raw floating-point numbers in a text-based format, e.g. for continuing a computation at a precise state saved in a file, a decimal representation is a sure recipe for having to worry about round-off errors. A byte sequence in IEEE format is error-free and very portable, it's just not human-readable. As a compromise, you can consider floating-point notation in base 8 or 16, which permits error-free conversion to and from IEEE. |
To try and keep it on topic of data and not format:
@greglandrum, how generalizable is this requirement? Is this as simple as the Tripos atom name field? i.e., a fixed size string. Or something that can hold any arbitrary key-value data? Hopefully the latter, and I would generalize it to both the molecule and the bonds. Something like the following to serialize RDKit properties: {
"_Name" : "CorpID",
"foo" : "bar",
"atoms" : [{"partial charge" : 1.23, "force" : [0.1, 0.2, 0.3], ... }, ...],
"bonds" : [{"highlighted" : true, ... }]
} Being able to add arbitrary properties on the molecule, atoms, and bonds would be very powerful. And matches RDKit's property system since I think targeting at just RDKit is just fine for now as well. |
@coleb : I intended to cover that with:
|
@greglandrum good, very cool. :-) So what is "can include atom labels" then? How is that different? |
Ah, right. That is, in my mind, the equivalent of the "CA" or"CB" in a PDB file. |
@khinsen not sure what you mean by IEEE being error-free: as I recall the entire first chapter of our Sci.Comp. 201 textbook was about error control. @greglandrum My vote would be for segmented data model with an atom/bond table and a completely separate coordinate table, and so on. There has to be a core section that is mandatory (and once you define it and people start using it, it'll be very hard to change), conformers are optional; etc. You can tar/zip them and call the resulting archive .rdk (RPM and DEB packages, among others, are that). Or concatenate them in one file with section delimiters. On the end-user side IME number crunching typically involves tables: matrices and such, and pulling out subsets works well with tables e.g. loaded into sqlite. Table-based is good, implicit column headers (numbers) -- not so much, but if I had to choose between that and JSON list of maps (rows), I'd probably go for numbered columns. |
For me having a schema where properties (their names and what data they hold) are explicitly defined seems more and more important. Having fields with arbitrary (though typed as text/float/...) user data is a use case but for interoperability different consumers of the format need to "discover" the properties to actually use them. Properties can be optional with a required core to allow for slim files. |
@dmaziuk It's the transmission (encoding/decoding) of floating-point numbers that is error-free if you use a binary representation. Computations are a different story. |
Which binary encoding? IEEE binary encoding will turn 0.3 into 0.30000000000000004. Transmission errors: noise, bit flips, etc. affect unicode binary bits exactly the same way as ieee binary bits. Forgive me for having difficulties with the meaning of "error free" in this context. |
@dmaziuk Ouch, there are too many distinct meanings of "binary" in this context! I am thinking of the IEEE binary formats, which are by far the most used ones. Error-free conversion from and to text representations is possible only for (1) raw byte dumps, or (2) a base-2/8/16 representation. Your example proves my point: you can't convert decimal "0.3" to IEEE binary float formats without error. |
A request: the lack of threading in these comment threads makes it difficult enough to track long discussions, let's please try to stay on topic here and not continue the discussion about binary vs text (or other details of what the eventual physical format may be). |
@khinsen no.
@greglandrum the relevant point is whether you want to add the "num significant digits" field to every floating-point field in your data model. |
I've not seen a good summary of requirements so far. I'd like to see included: user-specifiable structure-level properties per structure Some properties might be built-in, perhaps by using reserved keywords to specify them; examples: formal charge (on an atom), partial charge (on an atom), bond order (on a bond), and so on. These could include properties that are always be required to be present as well as properties that are sufficiently commonly used that standard names would be desirable. Since this whole discussion started out on the rdkit-discuss list as a way to store conformers (not just multiple molecules), it would be good if there were a way to take advantage of any storage savings that might be possible for a sequence of conformations. I'm not sure that's a requirement, though. In certain situations, there might be associated guarantees, as well. For example, a molecule known to be a conformer beyond the first one in a sequence of molecules might share all properties (ct, atom, bond) specified for the leading conformer in the sequence unless overridden in the later conformer. So any conformer is in effect specified by difference from the first conformer in the sequence. |
PDB chem comp (ligand) model includes a list of structure-level properties as well as tables of atoms and bonds (with properties). One of the reasons they (and we) use STAR is because it's about the only format that lets you combine tables and key-value pairs in a reasonable fashion. (Don't get me started on shortcomings of STAR.) JSON does not have a built-in table data type. |
In json you could emulate a table with an array of strings, each if which
is the row of a csv, first of which would be a row of headers. It could get
a bit more elaborate to facilitate recognition and parsing, but it is
probably workable. User code would have to supply a convenient api.
…-P.
Sent from a cell phone. Please forgive brvty and m1St@kes.
On Dec 5, 2016 1:51 PM, "dmaziuk" ***@***.***> wrote:
PDB chem comp (ligand) model includes a list of structure-level properties
as well as tables of atoms and bonds (with properties). One of the reasons
they (and we) use STAR is because it's about the only format that lets you
combine tables and key-value pairs in a reasonable fashion. (Don't get me
started on shortcomings of STAR.)
JSON does not have a built-in table data type.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1137 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AC_lrz47N-EVJjejsalrHuvZsYIS2SXnks5rFF0qgaJpZM4Kk-HO>
.
|
... or a list of lists, or a |
Mhh, why would I want to have another table-based file format where white
spaces and tab/line locations "encode" the semantic?
…On Mon, Dec 5, 2016 at 9:55 PM, dmaziuk ***@***.***> wrote:
... or a list of lists, or a { "head" : [ ...], "body" : [[...], ...] }
-- that's my point: there is no one standard way that everybody understands.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1137 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAkJLff0_M8jzO9XCCerfbs4IGfw4C-Oks5rFHpVgaJpZM4Kk-HO>
.
|
there is no one standard way that everybody understands
That's correct, but there need not be. Only those who maintain the code to read and write the format need to understand it.
Any arbitrary JSON is obscure unless you understand the semantics of the fields. You can still pass it on as a JSON and get the contents.
How to interpret the contents is another matter. Once a JSON schema is established and documented, user code can use it, regardless of how the semantics are defined.
I'm aware that this proposal requires user code to be able (for example) to return a table row as a dictionary of name-value pairs, where the names come from the header. That is a level of parsing that JSON users would usually expect to get directly from a JSON API, and it can't be done here. But at a higher level, wrappers could be written to do so in the context of a class that would be designed to support the format. Yet at the JSON level, it could still be passed around in a pure JSON context, which is the main argument for sticking with JSON.
…-P.
On 05 Dec 2016, at 3:55 PM, dmaziuk ***@***.***> wrote:
... or a list of lists, or a { "head" : [ ...], "body" : [[...], ...] } -- that's my point: there is no one standard way that everybody understands.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#1137 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC_lr_FlgZfEua6EnfHjlBvTh8G8dRsDks5rFHpVgaJpZM4Kk-HO>.
|
For clarity, we're not talking about "another table-based format" The format is JSON. The ability to encode tables would be done at the semantic level.
So to translate your question, "Why would you want tables at all?"
My own answer is that they're not absolutely required, but are useful.
When dealing with multiple objects of the same type, like atoms or bonds with their properties, a strictly hierarchical schema would generally encode each property of each atom with a name-value pair. The name often takes up more space than the value. A table allows a single statement of the list of property names and then allows each atom's properties to be specified as a list of corresponding values.
The API, of course, would continue to provide options such as said dictionary to return atom properties as name-value pairs, "as if" the data had been transmitted hierarchically.
…-P.
On 05 Dec 2016, at 4:05 PM, Markus Sitzmann ***@***.***> wrote:
Mhh, why would I want to have another table-based file format where white
spaces and tab/line locations "encode" the semantic?
On Mon, Dec 5, 2016 at 9:55 PM, dmaziuk ***@***.***> wrote:
> ... or a list of lists, or a { "head" : [ ...], "body" : [[...], ...] }
> -- that's my point: there is no one standard way that everybody understands.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#1137 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AAkJLff0_M8jzO9XCCerfbs4IGfw4C-Oks5rFHpVgaJpZM4Kk-HO>
> .
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#1137 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC_lrxYnFo_akYs8RTp5bgYdiIXHJFgTks5rFHyggaJpZM4Kk-HO>.
|
Well, I agree, that the key (name) may take more space than the value in
some cases (xml has the same problem even worse), but I don't see this as a
really big problem.
However, if you add a kind of "uber" semantic" to the json format, you add
a lot of complexity, i.e. a lot of work is required for specifying,
implementing and maintaining appropriate code for reading and writing such
uber-format (if your plan would be to support more than one programming
language this multiplies the effort)
In all likelihood you will also kill a lot of the flexibility json offers
(e.g. extending the format without breaking previous versions), as well as
you probably will lock out the usage of all the new json schema languages
that have been developed recently or are under development.
On the other hand, I am not sure what you really gain - it might be more
space efficient (okay, not a too big argument anymore); and maybe, but only
maybe, you can write/read it by hand a bit easier.
Markus :-)
On Mon, Dec 5, 2016 at 11:33 PM, Peter S. Shenkin <notifications@github.com>
wrote:
… For clarity, we're not talking about "another table-based format" The
format is JSON. The ability to encode tables would be done at the semantic
level.
So to translate your question, "Why would you want tables at all?"
My own answer is that they're not absolutely required, but are useful.
When dealing with multiple objects of the same type, like atoms or bonds
with their properties, a strictly hierarchical schema would generally
encode each property of each atom with a name-value pair. The name often
takes up more space than the value. A table allows a single statement of
the list of property names and then allows each atom's properties to be
specified as a list of corresponding values.
The API, of course, would continue to provide options such as said
dictionary to return atom properties as name-value pairs, "as if" the data
had been transmitted hierarchically.
-P.
> On 05 Dec 2016, at 4:05 PM, Markus Sitzmann ***@***.***>
wrote:
>
> Mhh, why would I want to have another table-based file format where white
> spaces and tab/line locations "encode" the semantic?
>
> On Mon, Dec 5, 2016 at 9:55 PM, dmaziuk ***@***.***>
wrote:
>
> > ... or a list of lists, or a { "head" : [ ...], "body" : [[...], ...] }
> > -- that's my point: there is no one standard way that everybody
understands.
> >
> > —
> > You are receiving this because you commented.
> > Reply to this email directly, view it on GitHub
> > <#1137 (comment)>,
or mute
> > the thread
> > <https://github.com/notifications/unsubscribe-auth/AAkJLff0_
M8jzO9XCCerfbs4IGfw4C-Oks5rFHpVgaJpZM4Kk-HO>
> > .
> >
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub <
#1137 (comment)>, or
mute the thread <https://github.com/notifications/unsubscribe-
auth/AC_lrxYnFo_akYs8RTp5bgYdiIXHJFgTks5rFHyggaJpZM4Kk-HO>.
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1137 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAkJLeCBSTidG2bNwcvwXqM3GUUISMWyks5rFJEcgaJpZM4Kk-HO>
.
|
It can be if you're storing multiple conformers for a larger molecule. Coupled with JSON's requirement to read the whole string in memory at once, it has a potential to be... suboptimal. |
closing this because there's now (and has been for a while) an implementation of commonchem and an rdkit-specific extension of that in rdMolInterchange: http://rdkit.org/docs/source/rdkit.Chem.rdMolInterchange.html |
Discussion Document
It'd be great to have a chemical JSON format in the RDKit. We're collecting ideas here.
Please include ideas and/or pointers to other attempts at this in the comments below. I will integrate them up here.
Limitations
Features that won't be in the first version, but that might come
Requirements
The text was updated successfully, but these errors were encountered: