Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the parser library #176

Open
Bernardo-MG opened this issue Aug 3, 2017 · 2 comments
Open

Change the parser library #176

Bernardo-MG opened this issue Aug 3, 2017 · 2 comments
Assignees

Comments

@Bernardo-MG
Copy link
Collaborator

The main problem for the project is the parsing library, which is good for small projects, and not for parsing huge files.

This article contains a list of Python parsers:
https://tomassetti.me/parsing-in-python/#parserGenerators

The library should support a BNF grammar, which should be easy to create from the CWR specification.

Note that the list includes ANTLR, which does support these grammars.

I also have experience with Ply, but does not seem like a good option for complex grammars.

These projects can be useful as references, as they are my own tests with parsers:
https://github.com/Bernardo-MG/dice-notation-java
https://github.com/Bernardo-MG/dice-notation-python

Based on all this, I think the best course of action would be:

  • Defining a BNF grammar
  • Using Antlr to generate a Python parser
  • Combining the parser with the project

This would require reworking the project, and probably dropping much of the current code.

@Bernardo-MG Bernardo-MG self-assigned this Sep 5, 2017
@Bernardo-MG
Copy link
Collaborator Author

Preparing ANTRL grammar:

https://github.com/Bernardo-MG/cwr-grammar

This can be used to generate a Python parser. Validation rules should be applied to this parser.

@Bernardo-MG
Copy link
Collaborator Author

Bernardo-MG commented Sep 15, 2017

Checked current version of the ANTLR grammar against the test files. It parses them. Except for the 230MB. After increasing memory it is parsed in close to 3 minutes, but some errors are found.

Currently the ANTLR grammar splits the file into transactions and records. But the records are left as the full line, unprocessed.

The next step is generating the Python parser and adding it to the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant