Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add table detection tests and a basic table detector for guessing table regions #53

Merged
merged 11 commits into from Feb 2, 2016

Conversation

mcharters
Copy link
Contributor

The changes on this branch do a bunch of stuff:

  • Creates a new "detectors" package in tabula for table detection algorithms
  • Implements a SpreadsheetDetectionAlgorithm there to replicate the simple table detection from tabula (web) and tabula-extractor
  • The command line app now uses the new detector when the -g argument is passed (basic fix for guess option is not working in tabula-java #49)
  • Adds ICDAR 2013 ground truth documents for testing purposes (I know there was a new project set up mentioned in Write tests for the ICDAR 2013 groundtruth dataset #51 but maybe we can migrate the tests over there later?)
  • Adds (currently ignored) tests for testing table detection algorithms (2 out of 67 tests currently pass!)

This should provide a basis for people to start contributing and evaluating different table detection algorithms

@jazzido
Copy link
Contributor

jazzido commented Dec 15, 2015

Fantastic, @mcharters! Thanks a LOT. Some comments:

@mcharters
Copy link
Contributor Author

Sure, I'll add a test for the command line app.

My tests don't test the full "truth" of the IDCAR data - only table detection. Table extraction/verification is a whole other can of worms. Dunno if you want to keep #51 around until a full test suite is implemented?

@jazzido
Copy link
Contributor

jazzido commented Dec 15, 2015

My tests don't test the full "truth" of the IDCAR data - only table detection. Table extraction/verification is a whole other can of worms. Dunno if you want to keep #51 around until all a full test suite is implemented?

You're right, spoke too soon. @melisabok, it's up to you if you want to use @mcharters's code as a starting point or keep working on the Scala project.

@melisabok
Copy link
Contributor

I think we can keep working in what @mcharters did, it is not necessary to do it in scala we can continue with java. You can commit it here or in the new repo, is the same for me.

@melisabok
Copy link
Contributor

Why don't move the SpreadsheetDetectionAlgorithm behavior into the SpreadsheetExtractionAlgorithm extract method. And the extract method decide if it has to detect or not the tables.
I noticed that the extract method is calling to findCells too, so we could refactor it and call it just once. What do you think?

@mcharters very nice work! if you want you can commit the code related to the test suite here(replacing the scala code): https://github.com/tabulapdf/icdar-testsuite, and then we can improve the tests to get the full 'truth', What do you think?

@mcharters
Copy link
Contributor Author

@melisabok from briefly reading over some of the papers @jazzido linked to, it seems to me like there are two steps to getting tables from PDFs - first detecting table regions and then extracting the data from those regions. I thought it would be useful to keep them logically separate, that way you could mix and match if you wanted.

In fact, if you look at the code that gets run when you pass -g on the command line, I'm doing just that: the SpreadsheetDetectionAlgorithm finds the tables and then the BasicExtractionAlgorithm is used to get data from the page region - the extract method of the SEA never gets called.

In fact maybe the opposite should be true: the logic in the SpreadsheetExtractionAlgorithm that relates to table detection should maybe get moved entirely to SpreadsheetDetectionAlgorithm and then get called from the SEA. But I'm just getting to know the code so I'm not sure if that's feasible right now.

Matt Charters added 2 commits December 17, 2015 12:32
@jazzido
Copy link
Contributor

jazzido commented Dec 20, 2015

Hi @mcharters, quick question: How to run the ICDAR tests? mvn test seems to skip them over.

@mcharters
Copy link
Contributor Author

Hi @jazzido, I tagged the tests with @ignore because so many of them were failures. I didn't know if having a bunch of failing tests would mess up integration testing.

To run them just comment out the @ignore tag and then mvn test will work. Note that there's also a bunch of debug noise in the tests currently that I was using to make sure my test implementation was right. :)

@jazzido
Copy link
Contributor

jazzido commented Dec 21, 2015

Good call. In any case, expecting 0% failures would be unrealistic for the ICDAR dataset, so I guess we should not assert() things. Instead, we should emit some kind of report that we can use to track progress/regressions with the detection algorithms.

@mcharters
Copy link
Contributor Author

Sounds like a plan. I suppose to start we could track a baseline of failures to make sure changes don’t make more failures happen and alert us when things improve.

On Dec 21, 2015, at 10:54 AM, Manuel Aristarán notifications@github.com wrote:

Good call. In any case, expecting 0% failures would be unrealistic for the ICDAR dataset, so I guess we should not assert() things. Instead, we should emit some kind of report that we can use to track progress/regressions with the detection algorithms.


Reply to this email directly or view it on GitHub #53 (comment).

@jazzido
Copy link
Contributor

jazzido commented Dec 21, 2015

I opened #54 to track development.

@jazzido jazzido merged commit 282404d into tabulapdf:master Feb 2, 2016
@mcharters mcharters deleted the add-table-detection-tests branch February 2, 2016 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants