You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
flex and bison have supported C++ lexers and parsers for a while now, and they've matured to a point where we shouldn't have to do things like set up input strings, use unions to define token types, and handle cleanup after parsing failures. we should try to utilize the available C++ abstractions to make the smiles parser more readable and maintainable
Describe the solution you'd like
I think the following should be achievable:
more informative error messages when parsing fails
reduce the need to anticipate bad inputs in order to prevent memory leaks
The text was updated successfully, but these errors were encountered:
This refactors the smiles parsing procedure to separate the parsing and
ROMol construction procedures. Given the intricate nature of the mol
construction procedure, a 1D list of mol events made the most sense as
the parsers output. This allowed me to ensure that the order in which
atoms/bonds are created and properties are set follows that from the
previous implementation.
I also added a new project to External/flex (source:
https://github.com/westes/flex) because C++
scanners generated by flex require the FlexLexer.h header. We need this
to be present whether users have flex installed or not, and this was the
easiest way to achieve that.
Summary of changes:
* Updated parsing-related error messages to point to the bad
token and to include more informative messages. Eg.
`
[16:26:38] SMILES Parse Error: check for mistakes around position 13
COc(c1)cccc1C#
-------------^
syntax error
`
`
[16:27:32] SMILES Parse Error: check for mistakes around position 1
[Bg]
-^
unsupported atom symbol
`
* Removed manual memory management from the ROMol construction
procedure by using RAII classes (this doesn't apply to ing closures)
* This also removes bad interactions between consecutive smiles
parsing, which required conversion of bad inputs to follow a
reset global state. See refactored SmilesParse::test.cpp::testFail
* Fixes bugs like:
* preventing hydrogens with defined chiralities
* allowing branch atoms of the form `C1(.C1)`
* allowing ring bonds like `C1.C%01`
* restricting formal charges to -15 <= N <= 15
An important concern for this change is performance, so I ran all of the
tests in SmilesParse::test.cpp x1000 and noticed that this change
increases the runtime by about 5% i.e. from about 82ms to 86ms.
Is your feature request related to a problem? Please describe.
flex and bison have supported C++ lexers and parsers for a while now, and they've matured to a point where we shouldn't have to do things like set up input strings, use unions to define token types, and handle cleanup after parsing failures. we should try to utilize the available C++ abstractions to make the smiles parser more readable and maintainable
Describe the solution you'd like
I think the following should be achievable:
The text was updated successfully, but these errors were encountered: