Add table detection tests and a basic table detector for guessing table regions #53

mcharters · 2015-12-15T20:59:27Z

The changes on this branch do a bunch of stuff:

Creates a new "detectors" package in tabula for table detection algorithms
Implements a SpreadsheetDetectionAlgorithm there to replicate the simple table detection from tabula (web) and tabula-extractor
The command line app now uses the new detector when the -g argument is passed (basic fix for guess option is not working in tabula-java #49)
Adds ICDAR 2013 ground truth documents for testing purposes (I know there was a new project set up mentioned in Write tests for the ICDAR 2013 groundtruth dataset #51 but maybe we can migrate the tests over there later?)
Adds (currently ignored) tests for testing table detection algorithms (2 out of 67 tests currently pass!)

This should provide a basis for people to start contributing and evaluating different table detection algorithms

… simple spreadsheet algorithm

jazzido · 2015-12-15T21:18:57Z

Fantastic, @mcharters! Thanks a LOT. Some comments:

I guess we can consider that your contribution closes Write tests for the ICDAR 2013 groundtruth dataset #51. @melisabok, unless you really want to keep working on the Scala version, there's no need to do it.
Can you add a test for the -g flag in TestCommandLineApp.java?
At some point, it would be awesome to have a report that shows how the detection accuracy evolves over time, à la PyPy Speed Center — I'll try to give it a shot over the holiday break.

mcharters · 2015-12-15T21:23:18Z

Sure, I'll add a test for the command line app.

My tests don't test the full "truth" of the IDCAR data - only table detection. Table extraction/verification is a whole other can of worms. Dunno if you want to keep #51 around until a full test suite is implemented?

jazzido · 2015-12-15T21:27:45Z

My tests don't test the full "truth" of the IDCAR data - only table detection. Table extraction/verification is a whole other can of worms. Dunno if you want to keep #51 around until all a full test suite is implemented?

You're right, spoke too soon. @melisabok, it's up to you if you want to use @mcharters's code as a starting point or keep working on the Scala project.

melisabok · 2015-12-15T21:32:01Z

I think we can keep working in what @mcharters did, it is not necessary to do it in scala we can continue with java. You can commit it here or in the new repo, is the same for me.

melisabok · 2015-12-15T23:51:36Z

Why don't move the SpreadsheetDetectionAlgorithm behavior into the SpreadsheetExtractionAlgorithm extract method. And the extract method decide if it has to detect or not the tables.
I noticed that the extract method is calling to findCells too, so we could refactor it and call it just once. What do you think?

@mcharters very nice work! if you want you can commit the code related to the test suite here(replacing the scala code): https://github.com/tabulapdf/icdar-testsuite, and then we can improve the tests to get the full 'truth', What do you think?

mcharters · 2015-12-16T01:46:21Z

@melisabok from briefly reading over some of the papers @jazzido linked to, it seems to me like there are two steps to getting tables from PDFs - first detecting table regions and then extracting the data from those regions. I thought it would be useful to keep them logically separate, that way you could mix and match if you wanted.

In fact, if you look at the code that gets run when you pass -g on the command line, I'm doing just that: the SpreadsheetDetectionAlgorithm finds the tables and then the BasicExtractionAlgorithm is used to get data from the page region - the extract method of the SEA never gets called.

In fact maybe the opposite should be true: the logic in the SpreadsheetExtractionAlgorithm that relates to table detection should maybe get moved entirely to SpreadsheetDetectionAlgorithm and then get called from the SEA. But I'm just getting to know the code so I'm not sure if that's feasible right now.

…d tables - this results in a few more tests passing

jazzido · 2015-12-20T20:17:07Z

Hi @mcharters, quick question: How to run the ICDAR tests? mvn test seems to skip them over.

mcharters · 2015-12-21T15:50:30Z

Hi @jazzido, I tagged the tests with @ignore because so many of them were failures. I didn't know if having a bunch of failing tests would mess up integration testing.

To run them just comment out the @ignore tag and then mvn test will work. Note that there's also a bunch of debug noise in the tests currently that I was using to make sure my test implementation was right. :)

jazzido · 2015-12-21T15:54:07Z

Good call. In any case, expecting 0% failures would be unrealistic for the ICDAR dataset, so I guess we should not assert() things. Instead, we should emit some kind of report that we can use to track progress/regressions with the detection algorithms.

mcharters · 2015-12-21T16:00:15Z

Sounds like a plan. I suppose to start we could track a baseline of failures to make sure changes don’t make more failures happen and alert us when things improve.

On Dec 21, 2015, at 10:54 AM, Manuel Aristarán notifications@github.com wrote:

Good call. In any case, expecting 0% failures would be unrealistic for the ICDAR dataset, so I guess we should not assert() things. Instead, we should emit some kind of report that we can use to track progress/regressions with the detection algorithms.

—
Reply to this email directly or view it on GitHub #53 (comment).

jazzido · 2015-12-21T16:08:44Z

I opened #54 to track development.

Matt Charters added 8 commits December 14, 2015 14:52

Add icdar competition datasetto test resources

1219e6c

Create a test that reads in the icdar xml data

38456e9

Create a new package for table detection

0728d9c

Comments

f95d6cb

Add a test to compare ground truth regions to regions detected by the…

1ca54ef

… simple spreadsheet algorithm

Ignore table detection tests until better algorithms are created

af9508a

Ignore IntelliJ project files

e3804e8

Use the new spreadsheet table detector for the -g command line option

03b83cb

mcharters mentioned this pull request Dec 15, 2015

Write tests for the ICDAR 2013 groundtruth dataset #51

Open

Add a test for the -g command line option

729acd3

Matt Charters added 2 commits December 17, 2015 12:32

Test case bug: only keep track of pages that actually contain detecte…

9123689

…d tables - this results in a few more tests passing

Reuse variable

282404d

jazzido mentioned this pull request Dec 21, 2015

Table detection tests dashboard #54

Open

jazzido merged commit 282404d into tabulapdf:master Feb 2, 2016

mcharters deleted the add-table-detection-tests branch February 2, 2016 20:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add table detection tests and a basic table detector for guessing table regions #53

Add table detection tests and a basic table detector for guessing table regions #53

mcharters commented Dec 15, 2015

jazzido commented Dec 15, 2015

mcharters commented Dec 15, 2015

jazzido commented Dec 15, 2015

melisabok commented Dec 15, 2015

melisabok commented Dec 15, 2015

mcharters commented Dec 16, 2015

jazzido commented Dec 20, 2015

mcharters commented Dec 21, 2015

jazzido commented Dec 21, 2015

mcharters commented Dec 21, 2015

jazzido commented Dec 21, 2015

Add table detection tests and a basic table detector for guessing table regions #53

Add table detection tests and a basic table detector for guessing table regions #53

Conversation

mcharters commented Dec 15, 2015

jazzido commented Dec 15, 2015

mcharters commented Dec 15, 2015

jazzido commented Dec 15, 2015

melisabok commented Dec 15, 2015

melisabok commented Dec 15, 2015

mcharters commented Dec 16, 2015

jazzido commented Dec 20, 2015

mcharters commented Dec 21, 2015

jazzido commented Dec 21, 2015

mcharters commented Dec 21, 2015

jazzido commented Dec 21, 2015