Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add aromaticity field + clear up some discrepancies #35

Merged
merged 25 commits into from
Oct 13, 2018

Conversation

danpf
Copy link
Contributor

@danpf danpf commented Aug 31, 2018

This is a first pass at adding the aromaticity field.

Please post any suggestions you might have!

This is meant to clear up:
#34
#33

spec.md Outdated
@@ -1057,6 +1091,7 @@ The `singleLetterCode` is the IUPAC single letter code for [protein](https://dx.
"formalChargeList": [ 0, 0, 0, 0 ],
"bondAtomList": [ 1, 0, 2, 1, 3, 2 ],
"bondOrderList": [ 1, 1, 2 ],
"bondAromaticity": [ 0, 0, 1 ],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing List suffix?

spec.md Outdated
@@ -388,6 +399,7 @@ The following table lists all top level fields, including their [type](#types) a
| [groupList](#grouplist) | [Array](#types) | Y |
| [bondAtomList](#bondatomlist) | [Binary](#types) | |
| [bondOrderList](#bondorderlist) | [Binary](#types) | |
| [bondAromaticityrList](#bondaromaticitylist)| [Binary](#types) | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo (extra r)

spec.md Outdated
@@ -337,7 +346,8 @@ First create a `Array` to hold values that are referable by indices. In the foll
"elementList": [ "N", "C", "C", "O", "C", "C", "O", "O" ],
"formalChargeList": [ 0, 0, 0, 0, 0, 0, 0, 0 ],
"bondAtomList": [ 1, 0, 2, 1, 3, 2, 4, 1, 5, 4, 6, 5, 7, 5 ],
"bondOrderList": [ 1, 1, 2, 1, 1, 2, 1 ]
"bondOrderList": [ 1, 1, 2, 1, 1, 2, 1 ],
"bondAromaticity": [ 0, 0, 1, 0, 0, 1, 1 ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing List suffix?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should change the example to reflect the phenylring in PHE, so the aromaticity example makes sense.

spec.md Outdated
@@ -347,7 +357,8 @@ First create a `Array` to hold values that are referable by indices. In the foll
"elementList": [ "N", "C", "C", "O", "C", "O" ],
"formalChargeList": [ 0, 0, 0, 0, 0, 0 ],
"bondAtomList": [ 1, 0, 2, 1, 3, 2, 4, 1, 5, 4 ],
"bondOrderList": [ 1, 1, 2, 1, 1 ]
"bondOrderList": [ 1, 1, 2, 1, 1 ],
"bondAromaticity": [ 0, 0, 1, 0, 0 ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing List suffix?

spec.md Outdated
@@ -1057,6 +1091,7 @@ The `singleLetterCode` is the IUPAC single letter code for [protein](https://dx.
"formalChargeList": [ 0, 0, 0, 0 ],
"bondAtomList": [ 1, 0, 2, 1, 3, 2 ],
"bondOrderList": [ 1, 1, 2 ],
"bondAromaticity": [ 0, 0, 1 ],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing List suffix?

spec.md Outdated
@@ -1066,7 +1101,8 @@ The `singleLetterCode` is the IUPAC single letter code for [protein](https://dx.
"elementList": [ "N", "C", "C", "O", "C", "C", "O", "O" ],
"formalChargeList": [ 0, 0, 0, 0, 0, 0, 0, 0 ],
"bondAtomList": [ 1, 0, 2, 1, 3, 2, 4, 1, 5, 4, 6, 5, 7, 5 ],
"bondOrderList": [ 1, 1, 2, 1, 1, 2, 1 ]
"bondOrderList": [ 1, 1, 2, 1, 1, 2, 1 ],
"bondAromaticity": [ 0, 0, 1, 0, 0, 1, 1 ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing List suffix?

spec.md Outdated
@@ -1076,7 +1112,8 @@ The `singleLetterCode` is the IUPAC single letter code for [protein](https://dx.
"elementList": [ "N", "C", "C", "O", "C", "O" ],
"formalChargeList": [ 0, 0, 0, 0, 0, 0 ],
"bondAtomList": [ 1, 0, 2, 1, 3, 2, 4, 1, 5, 4 ],
"bondOrderList": [ 1, 1, 2, 1, 1 ]
"bondOrderList": [ 1, 1, 2, 1, 1 ],
"bondAromaticity": [ 0, 0, 1, 0, 0 ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing List suffix?

spec.md Outdated

*Type*: [Binary](#types) data that decodes into an array of 8-bit signed integers.

*Description*: Array of bond orders for bonds in `bondAtomList`. Must be values between 1 and 4, defining single, double, triple, and quadruple bonds.
*Description*: Array of bond orders for bonds in `bondAtomList`. Must be values between 0 and 4, defining unknown, single, double, triple, and quadruple bonds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you feel about -1 being the unknown value? We also use -1 for undefined secondary structure.
Motivation: One day we might discuss adding zero-order bonds.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for consistency, so a -1 would be a good option.

@gtauriello
Copy link

If I understand you correctly, all 3 bond...List fields must be given if bonds are present?

In that case we would need to increment the version to 2.0 since 1.0-compliant files would not be valid anymore. Alternatively, a default behavior for the case of missing bondAromaticityList could be defined (e.g. list of same length as bondOrderList with all values being 0)...

@danpf
Copy link
Contributor Author

danpf commented Aug 31, 2018

@speleo3 Thank you for the comments! I think I fixed all of them. Sorry I missed so much in my first pass.

@pwrose & @speleo3 I updated -1 to be the new unknown bondOrder

@gtauriello I suppose that's what's up for discussion. I tried be as specific as possible for this initial draft, and if everyone else thinks differently then we can change it!

In my opinion it will be easier for everyone involved if all 3 fields exist when bonds are present. Otherwise I can see applications having io code that has a lot of branch points that need to be built to adhere to the spec.

But that's just an opinion, and maybe some people that have deeper integration with mmtf in their applications could speak about some pros/cons from their ends.

I guess this is V2 of the PR now, I made some updates with some more notes/examples. The most important part to me is to make sure that applications don't have to have to account for a bunch of different corner cases when doing io. But you can let me know if I overreached.

Additionally:
Maintainers can make updates as well if you're so inclined to. But I'd be happy to make the changes otherwise.

spec.md Outdated
@@ -1062,7 +1062,7 @@ The mmCIF format allows for so-called micro-heterogeneity on the group-level. Fo
| elementList | [Array](#types) | Array of elements, 0 to 3 character [Strings](#types) | Y |
| bondAtomList | [Array](#types) | Array of bonded atom indices, [Integers](#types) | |
| bondOrderList | [Array](#types) | Array of bond orders as [Integers](#types) between 1 and 4 | |
| bondAromaticityList | [Array](#types) | Array of bond aromaticity as [Integers](#types) between 0 and 4 | |
| bondAromaticityList | [Array](#types) | Array of bond aromaticity as [Integers](#types) -1, 1,2,3 and 3 | |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this read: Array of bond aromaticity as Integers 0 or 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Thanks!

@arose arose self-requested a review August 31, 2018 23:50
@gtauriello
Copy link

@danpf Sounds good to me. Thanks for the update. I agree that application developers shouldn't have to do many if-branches.

For the decoders, on the other hand, we may still need/want(?) to support reading of 1.0-compliant files where we then need some default behavior to deal with missing aromaticities flags. Once the file is decoded, the user won't have to do any branching anymore, since the decoder can guarantee that all 3 fields have full data. But unless we drop backwards compatibility in our decoders, this implies having a default behavior for decoders to deal with the field missing from the file...

@speleo3
Copy link
Contributor

speleo3 commented Sep 3, 2018

-1 for requiring bondAromaticityList if bondAtomList exists. I don't mind the extra if-branch.

@josemduarte
Copy link
Member

Agree with @speleo3 . I'd keep both order and aromaticity optional. There can be applications that have no idea about the bond types at all.

@arose
Copy link
Contributor

arose commented Sep 4, 2018

I am unsure if the run-length encoding of bondAromaticityList is worth it. In the case where nothing is aromatic, sure, but then you can leave it out altogether. For the case of proteins the "runs" can be quite short, i.e. from one aromatic side chain to another. For secondary structure data run-length encoding was generally not worth it and bondAromaticityList seems similar. @danpf did you calculate if it is worth it?

@danpf
Copy link
Contributor Author

danpf commented Sep 14, 2018

@arose
Since the main bondAromaticityList which would be encoded applies to all inter-group-bonds I did:

num_ig_bonds = pdb.map(lambda t: len(t[1].bond_atom_list)/2).reduce(lambda a,b: a + b)

which yeilds: 126575357.0 which should mean that un-encoded it would amount to be:
126575357.0*8/8/1000/1000 = 126.6mb
.126/9.8 = 0.012 so it would be an increase of 1% on the database size about?

I don't really mind either way, but if we're not going to encode bondAromaticityList it doesn't really make sense to encode bondOrderList which i think would have a similar impact on database size.

but I would expect the vast majority of bonds in bondAromaticityList to be peptide bonds between amino acid groups so i think run-length encoding would probably cut that number down a lot

@danpf
Copy link
Contributor Author

danpf commented Sep 14, 2018

Also, I was just thinking... Maybe we shouldn't call it bondAromaticityList but instead bondResonanceList Since aromaticity implies a cyclic functional group. then all peptide bonds could be represented as a resonating single bond.

@danpf
Copy link
Contributor Author

danpf commented Sep 14, 2018

I made some changes to fit what was discussed. let me know your thoughts!

@arose
Copy link
Contributor

arose commented Oct 13, 2018

but I would expect the vast majority of bonds in bondAromaticityList to be peptide bonds between amino acid groups so i think run-length encoding would probably cut that number down a lot

right, ok then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants