Chemical JSON format #1137

greglandrum · 2016-10-31T12:40:35Z

Discussion Document

It'd be great to have a chemical JSON format in the RDKit. We're collecting ideas here.

Please include ideas and/or pointers to other attempts at this in the comments below. I will integrate them up here.

Limitations

Let's limit ourselves to specified structures without query features for the moment, but keep in mind that we may want to include queries later.

Features that won't be in the first version, but that might come

Query features
Reaction support

Requirements

[] CTAB-like atoms and connectivity info
[] 2D (or 3D) coordinates not required in order to have correct stereochemistry
[] multi-conformer
[] can include atom labels
[] flexible stereochemistry model:
- absolute and relative stereochemistry
- flexible enough to accommodate atropisomers
[] agnostic to chemistry model (no aromaticity)
[] optional toolkit-specific fields for perceived properties and toolkit-specific info
[] supports flexible (and dynamic) properties attached to molecules, atoms, bonds, conformers

mcs07 · 2016-10-31T12:47:49Z

A few years ago I added support for JSON formats to Open Babel:
https://github.com/openbabel/openbabel/tree/master/src/formats/json

The two example formats I implemented were the ChemDoodle JSON format and the JSON output of the PubChem PUG REST API.

There is also the OpenChemistry Chemical JSON project:
https://github.com/OpenChemistry/chemicaljson

proteneer · 2016-10-31T12:59:00Z

I don't think there's enough value simply replacing the storage format itself. Yes it's slightly easier to parse JSON than the row column based SDF format, but that by itself isn't sufficient. I think it's also really important to define the scope of this (else things like query values start to creep in.)

One of the things that would be a real value-add is coming to some consensus on defining a minimally complete representation distinct from computed properties, so as to minimize possible inconsistencies. For example, the treatment of stereochemistry in SDF/MOLBLOCK can be inconsistent between the calculated parity value (R/S) and the atomic coordinates with wedgd bonds.

Another thing I'd like, but may get flamed on is to avoid any kind of explicit ordering on atom indices since the underlying graph structure and its associated properties should be invariant under isomorphism. That is, things that dependent on a proper index should be labelled explicitly (eg: atom mappings). Practically this means that we shouldn't guarantee an iteration order over atoms/bonds.

markussitzmann · 2016-10-31T13:06:14Z

We should take a look at things like JSON-LD, HAL, Json+Collections whether they might be helpful to create proper media types and/or make the molecule format straight available for Web APIs. The chemicaljson link mentioned by Matt already mentions this, too.

This here might be a starting point:

https://sookocheff.com/post/api/on-choosing-a-hypermedia-format/

which concludes

"""
If you are augmenting existing API responses choose JSON-LD. If you are keeping it simple choose HAL. If you are looking for a full featured media type choose Collection+JSON.
"""

mcs07 · 2016-10-31T13:14:27Z

On the topic of JSON-LD, there is SciData recently published by @stuchalk, which seems like it would be worth looking at, even if the scope is slightly different from what is relevant to RDKit.

greglandrum · 2016-10-31T13:16:30Z

@proteneer : what's motivating the desire to avoid atom indices? I'm pretty sure that they make the file more (human) readable. If we treat the indices as an convenience feature for the input format but not something that's guaranteed to be preserved on parsing the file, does that help?

proteneer · 2016-10-31T13:33:20Z

@greglandrum - my point was simply that the spec itself (which is separate from the actual concrete JSON implementation) should not guarantee consistent ordering. So basically as you mentioned, one way to do this is on the implementation's serialization level (i.e. serialization and deserialization may permute the ordering).

As an example, there's a particular format (which I won't mention here) that prefers to put explicit hydrogens at the end of the molblock for "convenience" sake. This is great until the inevitable molblock violating this guarantee comes along and everything breaks.

Note that I'm fine with an implementation that actually uses a list of atoms. I do agree with you that it's far more accessible to read, even if we run the risk of implementers assuming consistent ordering.

dmaziuk · 2016-10-31T17:27:49Z

There should be a column for atom index. Users are free to ignore it. You don't have to write the atoms out in that order, but if you do, people might be able to use simple stupid tools like diff to quickly compare two molecule files.

This may not sound very useful to a chemist but when I run a batch job over 25K ligands, diff'ing the ins and outs and flagging only the ones that changed -- or didn't, depending, -- for a closer look is a very useful feature trivially coded in a one-line shell post-script.

There should also be a column for atom label because I don't know of any algorithm that can label atoms C-alpha, H-beta-21, etc. for the two dozen molecules that use those. Every piece of code here has the atom tables for the "common" residues, each with its own typos and who knows what. We wouldn't have to do that if our exchange formats didn't throw away protons, atom labels and indexes, and everything else that is "obvious to a chemist".

dmaziuk · 2016-10-31T17:32:02Z

A separate issue is that JSON itself is not a streaming format. Valid JSON has to be a single string that gets loaded into RAM in order to be parsed into a single "javascript object". Consider the size of the string describing a hundred "best models" for a moderately-sized polymer.

stuchalk · 2016-10-31T18:30:20Z

Thanks for the mention. I am happy to answer any questions about SciData when/if you get to look at it.
Now that I have gotten SciData out I am going to work more on Chemical JSON…I have lots of ideas…

Stuart

On Oct 31, 2016, at 9:14 AM, Matt Swain <notifications@github.com mailto:notifications@github.com> wrote:

On the topic of JSON-LD, there is SciDatahttp://stuchalk.github.io/scidata/ recently published by @stuchalkhttps://github.com/stuchalk, which seems like it would be worth looking at, even if the scope is slightly different from what is relevant to RDKit.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/1137#issuecomment-257290264, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAg99gVRkB-K4PB1fBTVqK-MXPhXk0Njks5q5em1gaJpZM4Kk-HO.

markussitzmann · 2016-10-31T19:54:23Z

@dmaziuk Never comfort user that use "stupid tools" - that is their own responsibility. A big advantage of json is that there are well-tested parser for basically any language and environment (even for Unix shells if you really need it).

And, there are also json streaming solutions for basically all important languages (conceptually there is no big difference between parsing xml or json).

dmaziuk · 2016-10-31T20:11:19Z

Uh-huh. Well I'll stick to formats that let me use tools that actually work. It's a good thing by now I can write a format converter with my eyes closed and one hand tied behind my back.

DavidACosgrove · 2016-11-01T16:12:20Z

For me, the key thing about the format is that it supports multiple conformers of the same molecule efficiently. That's what kicked the discussion off in the original rdkit-discuss thread. I would imagine that means 1 block defining the chemistry, and then multiple sets of co-ordinates for the conformations. If there could be 2D co-ordinates labelled distinct from 3D ones, that would be helpful though it might create problems in the RDKit molecule object.
Whilst a compact JSON format would be better than multiple MolBlocks for this, a binary format would be faster still - no need to convert ASCII/Unicode to integers and floats, for example. Is there any enthusiasm for defining a binary multi-conformer format at the same time as the JSON one? I would think it could have very similar features but written out in binary format without all the JSON plumbing.

dmaziuk · 2016-11-01T17:02:42Z

There is an advantage to storing a table of atoms & bonds as delimited text: you can load it in Excel. Do not underestimate the power of Excel. (And other stupid tools.)

If you define the data structure, you can write it out as Protocol Buffer Definition and dump it into binary. Or e.g. as a Data Type Definition and dump it into XML. It's only a matter of picking up the appropriate library and feeding it your data structure in the way it understands.

coleb · 2016-11-02T09:48:20Z

+1 @DavidACosgrove general comments about multi-conformer support and an additional note:

In my experience reading conformers efficiently does come down to reading the coordinates efficiently (once chemistry perception is out of the way), and that means reading binary. Luckily, we don't need to come up with a new format to handle binary once we decide on the JSON structure. MsgPack is a 1-to-1 encoding from text JSON to binary. With support for as many languages that support JSON: http://msgpack.org/index.html

The RCSB is going down this exact same route for macro-molecule representation as well:

That format focuses heavily on compressing large macro-molecules for efficient transmission. So I doubt we want to use it for small molecules, but I could be wrong. An .mmtf reader would be a useful addition to RDKit regardless.

@dmaziuk Do Protocol Buffers have a 1-to-1 mapping to JSON like MsgPack? I am unfamiliar of the pros and cons of each.

proteneer · 2016-11-02T13:35:56Z

Protocol buffers 3 does indeed have a JSON encoding in addition to the
standard binary encoding.

On Wednesday, November 2, 2016, Brian Cole notifications@github.com wrote:

+1 @DavidACosgrove https://github.com/DavidACosgrove general comments
about multi-conformer support and an additional note:

In my experience reading conformers efficiently does come down to reading
the coordinates efficiently (once chemistry perception is out of the way),
and that means reading binary. Luckily, we don't need to come up with a new
format to handle binary once we decide on the JSON structure. MsgPack is a
1-to-1 encoding from text JSON to binary. With support for as many
languages that support JSON: http://msgpack.org/index.html

The RCSB is going down this exact same route for macro-molecule
representation as well:

http://mmtf.rcsb.org/faq.html

https://github.com/rcsb/mmtf/blob/master/spec.md

That format focuses heavily on compressing large macro-molecules for
efficient transmission. So I doubt we want to use it for small molecules,
but I could be wrong. An .mmtf reader would be a useful addition to RDKit
regardless.

@dmaziuk https://github.com/dmaziuk Do Protocol Buffers have a 1-to-1
mapping to JSON like MsgPack? I am unfamiliar of the pros and cons of each.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1137 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACLNFNi2_wk2BiPFxYIxUNfXgE4XuEmBks5q6FxmgaJpZM4Kk-HO
.

Yutong Zhao

dmaziuk · 2016-11-02T15:47:51Z

Protobuf is the schema, aka DTD, plus translator. AFAICT MsgPack just packs the bytes and lets the reader sort them out. IME people who didn't sit through Algorithms and Data Strucutres 101 tend to view the lack of the schema as a feature whereas Comp. Sci. types call it a bug.

A table of coordinates would be a few bytes smaller in a binary format than in CSV: no comma delimiters, but the overhead is minimal. The CSV, OTOH, can be directly loaded into a database, edited with sed, and so on. Encoding in JSON as list of lists with a header row on top will add overhead and remove much of the usability of CSV: the worst of both worlds.

greglandrum · 2016-11-02T15:55:29Z

This is a good discussion but I'm afraid that we are heading a bit off into in the weeds here. I think it would be more productive to figure out what information we need to capture and then to think about the technology (format) that we need to store that information. I suspect that we will actually end up with multiple formats in order to be able to balance robustness, portability, and performance.

DavidACosgrove · 2016-11-02T15:57:52Z

@dmaziuk: The reason for favouring a binary format for these purposes is not size, it's speed. With a binary format, numerical contents can be read directly into a float or int, with any ASCII format something that ultimately calls atof will have to be used, which imposes a significant overhead on reading. I think you may be mis-counting the size difference, however. An int in binary format will normally be 4 bytes whatever the value of the integer being stored, in ASCII it can be anywhere between 1 and 10 bytes.

@greglandrum has a point however - let's first decide what should be in the file!

dmaziuk · 2016-11-02T18:19:39Z

IME speed has never been a practical problem. By the time it starts biting you, there's three more next generations of hardware out there and your computer is long overdue for upgrade. (We're unzip/untar'ing text files on 3-7yo hardware fast enough to saturate the SATA bus and hang the machine and I have to configure cgroups on every cluster node to try and limit the i/o. I/m having real hard time thinking of an instance of text files being "too slow", and the problem I happen to have with them right now is the exact opposite of that.)

I think one of the things missing from Greg's requirements is intended audience. Who is going to use the format and for what purpose. And also why does RDKit need another format.

JSON is the web's darling du jour, now that XML has settled into its niche and we all moved on, but it's only really good for what it was intended for: sending small snippets of JavaScript directly into the browser. If the intended audience is not the browser, RDKit is not JavaScript, and the data is not small...

dmaziuk · 2016-11-02T18:33:34Z

The other thing is you can spin the math either way: you're not going to represent "ALA", "CA", etc. in binary any more efficiently that in ASCII/UTF-8. 12.3 takes up 4 bytes in UTF-8 and 8 bytes in double precision IEEE 754. Plus the round-off error: if you really want to do it right, you want to send a "significant digits" integer alongside so that your users could tell if 12.000019287547965 is actually 12 or 12.000.

If you send it as text you can off-load the decision to the user: they can stare at "12.000" and try to figure out if it is actually accurate to the 3rd digit, or the programmer just printed it as "%7.3f" because the numbers line up pretty that way.

arose · 2016-11-02T18:57:43Z

related discussion at alchemistry.org https://github.com/alchemistry/fileformat

arose · 2016-11-02T19:00:22Z

Hi, I am one of the MMTF developers. One of the things we are thinking about are ways to flexible add more metadata to the format.

markussitzmann · 2016-11-02T22:30:41Z

Why a new format? My answer to this would be:

having a chemical structure format that is in the public domain but has a community that is strong enough to create traction and support it (and I think RDKit has the biggest potential in this regard - I know there is CML but I have the feeling it fails the "creating traction and support" part, sorry).
having a chemical structure format that is more flexible than writing a sequential list of molecules and properties - chemical data sets are usually more complex than this and quite often they contain structures that have a certain relationship to each other (e.g. "is stereoisomer of", "is conformer of", "is tautomer of". So there is maybe a way to add such kinds of semantics (not necessarily as part of the file itself which leads to the next point),
creating a chemical structure format that better fits all these nice Web/REST Service and Linked Data formats and mechanisms (i.e. can be serialized to standard representations like XML and json)

markussitzmann · 2016-11-02T22:43:32Z

@dmaziuk "... but it's only really good for what it was intended for: sending small snippets of JavaScript directly into the browser. "

Hmm, I tend to disagree - a lot of web services use json as exchange format nowadays for big amount of data, and have you come in touch with NoSQL world with things like Lucene, Solr, Elasticsearch which all pretty strongly rely on or support json? Json is also natively understood by Javascript, which has a growing relevance on the server side of web services, and it is almost natively understood by python (the javascript guys actually stole the python dictionary data type when they developed json).

greglandrum · 2016-11-03T07:28:11Z

@dmaziuk is absolutely right: being a bit more explicit about what we want to accomplish with the format as well as who the intended users are is a good idea.

I will put some more meat on this later, but I'm primarily looking for an efficient and flexible format for storing and exchanging data about small molecules. It should be both machine and human readable (or at least have an easy way to get a human-readable form) and support optional toolkit-dependent information (like ring information, aromaticity, etc.) that can be ignored (or not) by other toolkits. I'm really not looking to create the one-format-to-rule-them-all and my focus at the moment is almost entirely on having something for the RDKit, though I want to be very sure that it's easily useable by other toolkits as well.

My biases on this one:

it's not going to be XML based (though likely it will be convertible into XML)
it's not going to be unstructured text where column numbering is important (i.e. ctab et al)

khinsen · 2016-11-03T07:28:20Z

Please consider defining a data model first, and then a data format as an implementation of this data model.

I understand that the focus of this discussion is on JSON, for good technical reasons. But the technical requirements for data formats vary: one person needs JSON, another one needs XML, a third one needs HDF5. There will always be many formats for the same kind of data because of technical imperatives. And that means format conversion, which we all love to do, right?

Format conversion is actually not much of a problem if its lossless in both directions, i.e. if the conversion happens between two formats that represent the same data. And that common abstract definition is the data model. Think of it as a high-level format description. For more details, see this article.

You might also want to look at my MOSAIC data model/format for computational chemistry, and read the paper that explains the rationals behind its design. You might be able to actually use MOSAIC by adding a JSON implementation. Or extend MOSAIC to your needs. But the most important aspect of MOSAIC is the two-level design as a data model with multiple implementations.

greglandrum · 2016-11-03T07:33:54Z

Wow, I don't need to come back and flesh out what I was thinking too much, @khinsen just said a lot of it for me, and better than I probably would have.
Thanks for that Konrad, I could hardly agree more.

Restating, hopefully accurately, using a somewhat different vocabulary: we should really be defining a schema that describes the information we're trying to capture and then worry about details of the physical representation (i.e. JSON, protobuff, msgpack, etc.)

khinsen · 2016-11-03T10:19:17Z

@greglandrum Exactly. In my experience, the best approach to defining a data model is a hierarchical one, just like for program design. At the highest level, you may want to describe a molecule as a graph, for example, and decide which attributes you want to attach to vertices and edges. Next, you could define how to represent that graph plus its attributes in terms of more basic data structures such as arrays of strings, numbers, etc. The last step is the concrete data format.

khinsen · 2016-11-03T10:37:09Z

@dmaziuk Your example concerning numbers is a nice illustration of what should be defined in a data model, and why it is important to have one. At the data model level, it matters if you want to represent a measured or computed value with an attached precision, or a raw floating-point value from a computation. You probably don't want to off-load that decision to the user, but even if that's what you want, this choice is part of your data model.

If you start from the other end, e.g. the efficiency of representation, you will probably end up defining a format that is impossible to convert to anything else without losing information or, worse, having to make up information.

BTW, if you need to represent raw floating-point numbers in a text-based format, e.g. for continuing a computation at a precise state saved in a file, a decimal representation is a sure recipe for having to worry about round-off errors. A byte sequence in IEEE format is error-free and very portable, it's just not human-readable. As a compromise, you can consider floating-point notation in base 8 or 16, which permits error-free conversion to and from IEEE.

coleb · 2016-11-03T13:41:38Z

To try and keep it on topic of data and not format:

[] can include atom labels

@greglandrum, how generalizable is this requirement? Is this as simple as the Tripos atom name field? i.e., a fixed size string. Or something that can hold any arbitrary key-value data? Hopefully the latter, and I would generalize it to both the molecule and the bonds. Something like the following to serialize RDKit properties:

{
  "_Name" : "CorpID",
  "foo" : "bar",
  "atoms" : [{"partial charge" : 1.23, "force" : [0.1, 0.2, 0.3], ... }, ...],
  "bonds" : [{"highlighted" : true, ... }]
}

Being able to add arbitrary properties on the molecule, atoms, and bonds would be very powerful. And matches RDKit's property system since I think targeting at just RDKit is just fine for now as well.

greglandrum · 2016-11-03T13:51:11Z

@coleb : I intended to cover that with:

[] supports flexible (and dynamic) properties attached to molecules, atoms, bonds, conformers

coleb · 2016-11-03T13:53:11Z

@greglandrum good, very cool. :-)

So what is "can include atom labels" then? How is that different?

greglandrum · 2016-11-03T13:58:00Z

Ah, right. That is, in my mind, the equivalent of the "CA" or"CB" in a PDB file.
And I'm thinking that it's an actual attribute instead of a property since people keep telling me that molecules should have names. ;-)

dmaziuk · 2016-11-03T15:07:19Z

@khinsen not sure what you mean by IEEE being error-free: as I recall the entire first chapter of our Sci.Comp. 201 textbook was about error control.

@greglandrum My vote would be for segmented data model with an atom/bond table and a completely separate coordinate table, and so on. There has to be a core section that is mandatory (and once you define it and people start using it, it'll be very hard to change), conformers are optional; etc.

You can tar/zip them and call the resulting archive .rdk (RPM and DEB packages, among others, are that). Or concatenate them in one file with section delimiters.

On the end-user side IME number crunching typically involves tables: matrices and such, and pulling out subsets works well with tables e.g. loaded into sqlite. Table-based is good, implicit column headers (numbers) -- not so much, but if I had to choose between that and JSON list of maps (rows), I'd probably go for numbered columns.

arose · 2016-11-03T18:47:51Z

For me having a schema where properties (their names and what data they hold) are explicitly defined seems more and more important. Having fields with arbitrary (though typed as text/float/...) user data is a use case but for interoperability different consumers of the format need to "discover" the properties to actually use them. Properties can be optional with a required core to allow for slim files.

khinsen · 2016-11-03T19:05:50Z

@dmaziuk It's the transmission (encoding/decoding) of floating-point numbers that is error-free if you use a binary representation. Computations are a different story.

dmaziuk · 2016-11-03T19:33:17Z

Which binary encoding? IEEE binary encoding will turn 0.3 into 0.30000000000000004.

Transmission errors: noise, bit flips, etc. affect unicode binary bits exactly the same way as ieee binary bits.

Forgive me for having difficulties with the meaning of "error free" in this context.

khinsen · 2016-11-03T20:34:42Z

@dmaziuk Ouch, there are too many distinct meanings of "binary" in this context!

I am thinking of the IEEE binary formats, which are by far the most used ones. Error-free conversion from and to text representations is possible only for (1) raw byte dumps, or (2) a base-2/8/16 representation.

Your example proves my point: you can't convert decimal "0.3" to IEEE binary float formats without error.

greglandrum · 2016-11-03T20:45:28Z

A request: the lack of threading in these comment threads makes it difficult enough to track long discussions, let's please try to stay on topic here and not continue the discussion about binary vs text (or other details of what the eventual physical format may be).

dmaziuk · 2016-11-03T20:48:49Z

@khinsen no.

dmaziuk@stingray:~$ python
Python 2.7.5 (default, Sep 15 2016, 22:37:39) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 0.1 + 0.2
0.30000000000000004
>>>

@greglandrum the relevant point is whether you want to add the "num significant digits" field to every floating-point field in your data model.

shenkin · 2016-12-04T23:35:52Z

I've not seen a good summary of requirements so far. I'd like to see included:

user-specifiable structure-level properties per structure
user-specifiable atom-level properties per atom
user-specifiable bond-level properties per bond

Some properties might be built-in, perhaps by using reserved keywords to specify them; examples: formal charge (on an atom), partial charge (on an atom), bond order (on a bond), and so on. These could include properties that are always be required to be present as well as properties that are sufficiently commonly used that standard names would be desirable.

Since this whole discussion started out on the rdkit-discuss list as a way to store conformers (not just multiple molecules), it would be good if there were a way to take advantage of any storage savings that might be possible for a sequence of conformations. I'm not sure that's a requirement, though. In certain situations, there might be associated guarantees, as well. For example, a molecule known to be a conformer beyond the first one in a sequence of molecules might share all properties (ct, atom, bond) specified for the leading conformer in the sequence unless overridden in the later conformer. So any conformer is in effect specified by difference from the first conformer in the sequence.

dmaziuk · 2016-12-05T18:51:21Z

PDB chem comp (ligand) model includes a list of structure-level properties as well as tables of atoms and bonds (with properties). One of the reasons they (and we) use STAR is because it's about the only format that lets you combine tables and key-value pairs in a reasonable fashion. (Don't get me started on shortcomings of STAR.)

JSON does not have a built-in table data type.

shenkin · 2016-12-05T20:51:22Z

In json you could emulate a table with an array of strings, each if which is the row of a csv, first of which would be a row of headers. It could get a bit more elaborate to facilitate recognition and parsing, but it is probably workable. User code would have to supply a convenient api.

…

-P. Sent from a cell phone. Please forgive brvty and m1St@kes.

On Dec 5, 2016 1:51 PM, "dmaziuk" ***@***.***> wrote: PDB chem comp (ligand) model includes a list of structure-level properties as well as tables of atoms and bonds (with properties). One of the reasons they (and we) use STAR is because it's about the only format that lets you combine tables and key-value pairs in a reasonable fashion. (Don't get me started on shortcomings of STAR.) JSON does not have a built-in table data type. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1137 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC_lrz47N-EVJjejsalrHuvZsYIS2SXnks5rFF0qgaJpZM4Kk-HO> .

dmaziuk · 2016-12-05T20:55:48Z

... or a list of lists, or a { "head" : [ ...], "body" : [[...], ...] } -- that's my point: there is no one standard way that everybody understands.

markussitzmann · 2016-12-05T21:05:34Z

Mhh, why would I want to have another table-based file format where white spaces and tab/line locations "encode" the semantic?

…

On Mon, Dec 5, 2016 at 9:55 PM, dmaziuk ***@***.***> wrote: ... or a list of lists, or a { "head" : [ ...], "body" : [[...], ...] } -- that's my point: there is no one standard way that everybody understands. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1137 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAkJLff0_M8jzO9XCCerfbs4IGfw4C-Oks5rFHpVgaJpZM4Kk-HO> .

shenkin · 2016-12-05T21:48:03Z

there is no one standard way that everybody understands

That's correct, but there need not be. Only those who maintain the code to read and write the format need to understand it. Any arbitrary JSON is obscure unless you understand the semantics of the fields. You can still pass it on as a JSON and get the contents. How to interpret the contents is another matter. Once a JSON schema is established and documented, user code can use it, regardless of how the semantics are defined. I'm aware that this proposal requires user code to be able (for example) to return a table row as a dictionary of name-value pairs, where the names come from the header. That is a level of parsing that JSON users would usually expect to get directly from a JSON API, and it can't be done here. But at a higher level, wrappers could be written to do so in the context of a class that would be designed to support the format. Yet at the JSON level, it could still be passed around in a pure JSON context, which is the main argument for sticking with JSON.

…

-P.

On 05 Dec 2016, at 3:55 PM, dmaziuk ***@***.***> wrote: ... or a list of lists, or a { "head" : [ ...], "body" : [[...], ...] } -- that's my point: there is no one standard way that everybody understands. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1137 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC_lr_FlgZfEua6EnfHjlBvTh8G8dRsDks5rFHpVgaJpZM4Kk-HO>.

shenkin · 2016-12-05T22:32:57Z

For clarity, we're not talking about "another table-based format" The format is JSON. The ability to encode tables would be done at the semantic level. So to translate your question, "Why would you want tables at all?" My own answer is that they're not absolutely required, but are useful. When dealing with multiple objects of the same type, like atoms or bonds with their properties, a strictly hierarchical schema would generally encode each property of each atom with a name-value pair. The name often takes up more space than the value. A table allows a single statement of the list of property names and then allows each atom's properties to be specified as a list of corresponding values. The API, of course, would continue to provide options such as said dictionary to return atom properties as name-value pairs, "as if" the data had been transmitted hierarchically.

…

-P.

On 05 Dec 2016, at 4:05 PM, Markus Sitzmann ***@***.***> wrote: Mhh, why would I want to have another table-based file format where white spaces and tab/line locations "encode" the semantic? On Mon, Dec 5, 2016 at 9:55 PM, dmaziuk ***@***.***> wrote: > ... or a list of lists, or a { "head" : [ ...], "body" : [[...], ...] } > -- that's my point: there is no one standard way that everybody understands. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#1137 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AAkJLff0_M8jzO9XCCerfbs4IGfw4C-Oks5rFHpVgaJpZM4Kk-HO> > . > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1137 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC_lrxYnFo_akYs8RTp5bgYdiIXHJFgTks5rFHyggaJpZM4Kk-HO>.

markussitzmann · 2016-12-05T23:08:20Z

Well, I agree, that the key (name) may take more space than the value in some cases (xml has the same problem even worse), but I don't see this as a really big problem. However, if you add a kind of "uber" semantic" to the json format, you add a lot of complexity, i.e. a lot of work is required for specifying, implementing and maintaining appropriate code for reading and writing such uber-format (if your plan would be to support more than one programming language this multiplies the effort) In all likelihood you will also kill a lot of the flexibility json offers (e.g. extending the format without breaking previous versions), as well as you probably will lock out the usage of all the new json schema languages that have been developed recently or are under development. On the other hand, I am not sure what you really gain - it might be more space efficient (okay, not a too big argument anymore); and maybe, but only maybe, you can write/read it by hand a bit easier. Markus :-) On Mon, Dec 5, 2016 at 11:33 PM, Peter S. Shenkin <notifications@github.com> wrote:

…

For clarity, we're not talking about "another table-based format" The format is JSON. The ability to encode tables would be done at the semantic level. So to translate your question, "Why would you want tables at all?" My own answer is that they're not absolutely required, but are useful. When dealing with multiple objects of the same type, like atoms or bonds with their properties, a strictly hierarchical schema would generally encode each property of each atom with a name-value pair. The name often takes up more space than the value. A table allows a single statement of the list of property names and then allows each atom's properties to be specified as a list of corresponding values. The API, of course, would continue to provide options such as said dictionary to return atom properties as name-value pairs, "as if" the data had been transmitted hierarchically. -P. > On 05 Dec 2016, at 4:05 PM, Markus Sitzmann ***@***.***> wrote: > > Mhh, why would I want to have another table-based file format where white > spaces and tab/line locations "encode" the semantic? > > On Mon, Dec 5, 2016 at 9:55 PM, dmaziuk ***@***.***> wrote: > > > ... or a list of lists, or a { "head" : [ ...], "body" : [[...], ...] } > > -- that's my point: there is no one standard way that everybody understands. > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub > > <#1137 (comment)>, or mute > > the thread > > <https://github.com/notifications/unsubscribe-auth/AAkJLff0_ M8jzO9XCCerfbs4IGfw4C-Oks5rFHpVgaJpZM4Kk-HO> > > . > > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub < #1137 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe- auth/AC_lrxYnFo_akYs8RTp5bgYdiIXHJFgTks5rFHyggaJpZM4Kk-HO>. > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1137 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAkJLeCBSTidG2bNwcvwXqM3GUUISMWyks5rFJEcgaJpZM4Kk-HO> .

dmaziuk · 2016-12-06T19:00:49Z

On the other hand, I am not sure what you really gain - it might be more space efficient (okay, not a too big argument anymore)

It can be if you're storing multiple conformers for a larger molecule. Coupled with JSON's requirement to read the whole string in memory at once, it has a potential to be... suboptimal.

greglandrum · 2023-05-16T08:34:45Z

closing this because there's now (and has been for a while) an implementation of commonchem and an rdkit-specific extension of that in rdMolInterchange: http://rdkit.org/docs/source/rdkit.Chem.rdMolInterchange.html

arose mentioned this issue Nov 2, 2016

Evaluate alternative formats alchemistry/fileformat#13

Open

rdkit deleted a comment from shenkin Jul 10, 2017

rczerminski-valo mentioned this issue Aug 3, 2021

batch mode does not work? ccsb-scripps/AutoDock-Vina#15

Closed

greglandrum closed this as completed May 16, 2023

Chemical JSON format #1137

Chemical JSON format #1137

Comments

greglandrum commented Oct 31, 2016 • edited

Discussion Document

Limitations

Features that won't be in the first version, but that might come

Requirements

mcs07 commented Oct 31, 2016

proteneer commented Oct 31, 2016

markussitzmann commented Oct 31, 2016

mcs07 commented Oct 31, 2016

greglandrum commented Oct 31, 2016 • edited

proteneer commented Oct 31, 2016

dmaziuk commented Oct 31, 2016

dmaziuk commented Oct 31, 2016

stuchalk commented Oct 31, 2016

markussitzmann commented Oct 31, 2016 • edited

dmaziuk commented Oct 31, 2016

DavidACosgrove commented Nov 1, 2016

dmaziuk commented Nov 1, 2016

coleb commented Nov 2, 2016

proteneer commented Nov 2, 2016

dmaziuk commented Nov 2, 2016

greglandrum commented Nov 2, 2016 • edited

DavidACosgrove commented Nov 2, 2016

dmaziuk commented Nov 2, 2016

dmaziuk commented Nov 2, 2016 • edited

arose commented Nov 2, 2016

arose commented Nov 2, 2016

markussitzmann commented Nov 2, 2016

markussitzmann commented Nov 2, 2016

greglandrum commented Nov 3, 2016

khinsen commented Nov 3, 2016

greglandrum commented Nov 3, 2016

khinsen commented Nov 3, 2016

khinsen commented Nov 3, 2016

coleb commented Nov 3, 2016 • edited

greglandrum commented Nov 3, 2016

coleb commented Nov 3, 2016

greglandrum commented Nov 3, 2016 • edited

dmaziuk commented Nov 3, 2016

arose commented Nov 3, 2016

khinsen commented Nov 3, 2016

dmaziuk commented Nov 3, 2016

khinsen commented Nov 3, 2016

greglandrum commented Nov 3, 2016

dmaziuk commented Nov 3, 2016

shenkin commented Dec 4, 2016 • edited

dmaziuk commented Dec 5, 2016

shenkin commented Dec 5, 2016 via email

dmaziuk commented Dec 5, 2016

markussitzmann commented Dec 5, 2016 via email

shenkin commented Dec 5, 2016 via email

shenkin commented Dec 5, 2016 via email

markussitzmann commented Dec 5, 2016 via email

dmaziuk commented Dec 6, 2016

greglandrum commented May 16, 2023

greglandrum commented Oct 31, 2016 •

edited

greglandrum commented Oct 31, 2016 •

edited

markussitzmann commented Oct 31, 2016 •

edited

greglandrum commented Nov 2, 2016 •

edited

dmaziuk commented Nov 2, 2016 •

edited

coleb commented Nov 3, 2016 •

edited

greglandrum commented Nov 3, 2016 •

edited

shenkin commented Dec 4, 2016 •

edited