New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add table detection tests and a basic table detector for guessing table regions #53
Add table detection tests and a basic table detector for guessing table regions #53
Conversation
… simple spreadsheet algorithm
Fantastic, @mcharters! Thanks a LOT. Some comments:
|
Sure, I'll add a test for the command line app. My tests don't test the full "truth" of the IDCAR data - only table detection. Table extraction/verification is a whole other can of worms. Dunno if you want to keep #51 around until a full test suite is implemented? |
You're right, spoke too soon. @melisabok, it's up to you if you want to use @mcharters's code as a starting point or keep working on the Scala project. |
I think we can keep working in what @mcharters did, it is not necessary to do it in scala we can continue with java. You can commit it here or in the new repo, is the same for me. |
Why don't move the @mcharters very nice work! if you want you can commit the code related to the test suite here(replacing the scala code): https://github.com/tabulapdf/icdar-testsuite, and then we can improve the tests to get the full 'truth', What do you think? |
@melisabok from briefly reading over some of the papers @jazzido linked to, it seems to me like there are two steps to getting tables from PDFs - first detecting table regions and then extracting the data from those regions. I thought it would be useful to keep them logically separate, that way you could mix and match if you wanted. In fact, if you look at the code that gets run when you pass -g on the command line, I'm doing just that: the SpreadsheetDetectionAlgorithm finds the tables and then the BasicExtractionAlgorithm is used to get data from the page region - the extract method of the SEA never gets called. In fact maybe the opposite should be true: the logic in the SpreadsheetExtractionAlgorithm that relates to table detection should maybe get moved entirely to SpreadsheetDetectionAlgorithm and then get called from the SEA. But I'm just getting to know the code so I'm not sure if that's feasible right now. |
…d tables - this results in a few more tests passing
Hi @mcharters, quick question: How to run the ICDAR tests? |
Hi @jazzido, I tagged the tests with @ignore because so many of them were failures. I didn't know if having a bunch of failing tests would mess up integration testing. To run them just comment out the @ignore tag and then |
Good call. In any case, expecting 0% failures would be unrealistic for the ICDAR dataset, so I guess we should not |
Sounds like a plan. I suppose to start we could track a baseline of failures to make sure changes don’t make more failures happen and alert us when things improve.
|
I opened #54 to track development. |
The changes on this branch do a bunch of stuff:
This should provide a basis for people to start contributing and evaluating different table detection algorithms