Implement Bentley-Ottmann algorithm for detecting grids #16

jazzido · 2013-07-18T21:09:55Z

If we detect that a set of horizontal and vertical lines form a grid, we can construct the set of cells using Bentley-Ottmann.

See:

jazzido · 2013-11-20T22:27:47Z

When ready, we should replace @jeremybmerrill 's TableGuesser.find_tables with Bentley-Ottman.

jazzido · 2013-11-23T03:04:23Z

Instead of implementing Bentley, let's do it naively (O(n^2)) and see what happens

jazzido · 2013-11-25T19:56:23Z

See this algorithm in Anssi Nurminen's master thesis:

array foundRectangles;
//All crossing-points have been sorted from up to down,
//and left to right in ascending order
Loop for each crossing-point:
{
   topLeft = NextCrossingPoint();
   //Fetch all points on the same vertical and horizontal
   //line with current crossing point
   array x_points = CrossingPointsDirectlyBelow( topLeft);
   array y_points = CrossingPointsDirectlyToTheRight( topLeft);
   Loop for each point x_point in x_points:
{
                               //Skip to next crossing-point
if( NOT EdgeExistsBetween( topLeft, x_point)) next crossing-
                                                   point;
Loop for each point y_point in y_points:
{
   if( NOT EdgeExistsBetween( topLeft, y_point)) next crossing-
                                                      point;
   //Hypothetical bottom right point of rectangle
   btmRight = Point( y_point.x(), x_point.y());
   if( CrossingPointExists( btmRight) AND
       EdgeExistsBetween( x_point, btmRight) AND
       EdgeExistsBetween( y_point, btmRight))
{
} }
//Rectangle is confirmed to have 4 sides
foundRectangles.append( Rectangle( topLeft, btmRight);
//Each crossing point can be the top left corner
//of only a single rectangle
next crossing-point;
} }

jeremybmerrill · 2013-11-26T17:26:46Z

cf this too http://www.drdobbs.com/database/the-maximal-rectangle-problem/184410529 and "convex hulls"

Whatever I/we write that more efficiently transforms a collection of lines into (a) the tables and (b) their constituent cells should take the place of TableGuesser.find_tables and Spreadsheet.new

jeremybmerrill · 2013-11-26T17:29:58Z

http://cgm.cs.mcgill.ca/~athens/cs507/Projects/2003/DanielSud/

jeremybmerrill · 2013-11-26T23:31:29Z

I would note that it might be better to detect cells (i.e. minimal rectangles) first, then piece together a Spreadsheet from that. We could then more elegantly deal with weirdly shaped tables.

Started a little bit implementing the nurminen algorithm in rectalgo branch; feel free to work on it if you want, don't if you don't.

jazzido · 2013-11-27T04:41:53Z

awesome

jeremybmerrill · 2013-12-03T03:25:44Z

Nurminen algo implemented in rectalgo and merged into spreadsheets, which is now up to date with text-extractor-refactor. I need to check if it, in fact, copes better with weirdly shaped tables, which I'll try to do tomorrow at work. If it does, then we can close this.

jazzido · 2013-12-03T03:53:58Z

WHOOOOOOO!

Manuel Aristarán
http://jazzido.com

On Tue, Dec 3, 2013 at 12:25 AM, Jeremy B. Merrill <notifications@github.com

wrote:

Nurminen algo implemented. I need to check if it, in fact, copes better
with weirdly shaped tables, which I'll try to do tomorrow at work. If it
does, then we can close this.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/16#issuecomment-29680702
.

jeremybmerrill · 2013-12-03T16:44:28Z

Oh, I realized I still have to write the algorithm to transform cells (minimal rectangles) into Spreadsheets (maximal rectangles or just Areas ). Do you know of any smart implementations of this anywhere? Otherwise I'll just write something naive that hopefully works.

jeremybmerrill · 2013-12-03T17:10:14Z

Probably gonna implement this one: http://stackoverflow.com/questions/13746284/merging-multiple-adjacent-rectangles-into-one-polygon

but may also refer to: http://www.cs.mcgill.ca/~cs644/Godfried/2005/Fall/sdroui9/p4_algorithm.html

jeremybmerrill · 2013-12-03T23:26:33Z

hascells branch works on the China PDF I sent in email last week. That branch finds cells first, then builds a Spreadsheet out of that. Note that this is just a proof of concept and needs a lot of efficiency work and getting rid of silly hacks (especially around nearly intersecting lines).

See more details in the commit: ccbf671

jeremybmerrill · 2013-12-03T23:38:04Z

Oh, and rectalgo is obsolete and can be deleted. It's merged into spreadsheets

jazzido · 2013-12-04T02:06:57Z

Awesome, Jeremy. Thanks!

Can you write a short test method when you have the chance? (I'd like to figure out how to implement the spreadsheet extractor in the UI workflow)

jeremybmerrill · 2013-12-04T03:06:40Z

Yeah, totally. I'll write the test tomorrow when I get to work. I think it'll work pretty well everywhere, but my tests so far are pretty limited.

As far as actually using it is concerned, most of the work is done for free. There's efficiency gains to be had (e.g. lazy table-extraction, etc.), but the simplest script looks just like this:

pdf_file_path = "whatever.pdf"
out = open("out.csv", 'w')
extractor = Tabula::Extraction::SpreadsheetExtractor.new(pdf_file_path, :all )
extractor.extract.each do |pdf_page, spreadsheet|
  out << spreadsheet.to_csv
  out << "\n\n"
end
out.close

jazzido · 2013-12-04T03:33:16Z

BTW, O'Rourke's algorithm is beautiful.

jeremybmerrill · 2013-12-04T03:39:43Z

that McGill demo, in particular, is pretty cool.

today was one of those days when I had Ruby, Java, JRuby and Python docs open all at once trying to convert the Python example into (J)Ruby.

As I said, the implementation is a little hairy because of the nearly-intersecting-lines problem, but, hey, it works for now.

jazzido · 2013-12-04T03:40:45Z

today was one of those days when I had Ruby, Java, JRuby and Python docs open all at once trying to convert the Python example into (J)Ruby.

Aren't those the best days? ;)

jazzido · 2013-12-04T04:08:09Z

Idea: now that we're getting serious with computational geometry algos, we should start to consider snap rounding for lines.

jeremybmerrill · 2013-12-04T04:11:47Z

This'd fix #38?

On Tue, Dec 3, 2013 at 11:08 PM, Manuel Aristarán
notifications@github.comwrote:

Idea: now that we're getting serious with computational geometry algos, we
should start to consider snap roundinghttp://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.220for lines.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/16#issuecomment-29777555
.

jazzido · 2013-12-04T04:15:56Z

It might, not sure.

Also, if we're going to snap-round lines we need to determine the size of the "pixel" first. If I'm not mistaken, an upper bound to that value would be the (average?) witdh of the lines present in our area of interest. We would need to track that value, we don't do it now.

jeremybmerrill · 2013-12-07T00:01:16Z

Tests written in: d94ae3f

bin/tabula modified to use only SpreadsheetExtractor in: 7f1f000

jazzido · 2013-12-07T00:04:35Z

bin/tabula modified to use only SpreadsheetExtractor in: 7f1f0007f1f000

So SpreadsheetExtractor also works for non-ruled tables?

jeremybmerrill · 2013-12-07T00:06:24Z

Nope. That would be a good reason to retain the old stuff in bin/tabula.... :P

jazzido · 2013-12-07T00:08:11Z

LOL.

Ok, so I guess we should have a ---spreadsheet flag that forces using the new extractor. And a way of detecting spreadsheet-y tables too.

jeremybmerrill · 2013-12-07T00:08:38Z

yeah

working on that now

jeremybmerrill · 2013-12-07T00:18:42Z

better now... fe88cb4

didn't write the heuristic. I haven't profiled the spreadsheets stuff; if it's quick enough, we can just do all/most of that work, see if there are a handful of Spreadsheet objects that were detected, and if so, use the SpreadsheetExtractor method.

jazzido · 2013-12-07T00:50:57Z

I'm going to merge spreadsheets into text-extractor-refactor and delete it....let's keep working on —or branching— the latter.

What do you think?

jeremybmerrill · 2013-12-07T05:14:24Z

Fine by me. Only objection is that text-extractor-refactor is a pain in the ass to type (all those left-hand letters!)... maybe a diff name?

jazzido · 2013-12-08T16:25:53Z

Aight, new merge candidate is now pre07.

Closing this one.

jeremybmerrill mentioned this issue Oct 11, 2013

Put multi-line cell content into a single cell #23

Open

jazzido mentioned this issue Dec 4, 2013

Store line width in Rulings #40

Open

jazzido closed this as completed Dec 8, 2013

jeremybmerrill mentioned this issue Dec 8, 2013

heuristic to differentiate spreadsheet/normal algorithm use #43

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Bentley-Ottmann algorithm for detecting grids #16

Implement Bentley-Ottmann algorithm for detecting grids #16

jazzido commented Jul 18, 2013

jazzido commented Nov 20, 2013

jazzido commented Nov 23, 2013

jazzido commented Nov 25, 2013

jeremybmerrill commented Nov 26, 2013

jeremybmerrill commented Nov 26, 2013

jeremybmerrill commented Nov 26, 2013

jazzido commented Nov 27, 2013

jeremybmerrill commented Dec 3, 2013

jazzido commented Dec 3, 2013

jeremybmerrill commented Dec 3, 2013

jeremybmerrill commented Dec 3, 2013

jeremybmerrill commented Dec 3, 2013

jeremybmerrill commented Dec 3, 2013

jazzido commented Dec 4, 2013

jeremybmerrill commented Dec 4, 2013

jazzido commented Dec 4, 2013

jeremybmerrill commented Dec 4, 2013

jazzido commented Dec 4, 2013

jazzido commented Dec 4, 2013

jeremybmerrill commented Dec 4, 2013

jazzido commented Dec 4, 2013

jeremybmerrill commented Dec 7, 2013

jazzido commented Dec 7, 2013

jeremybmerrill commented Dec 7, 2013

jazzido commented Dec 7, 2013

jeremybmerrill commented Dec 7, 2013

jeremybmerrill commented Dec 7, 2013

jazzido commented Dec 7, 2013

jeremybmerrill commented Dec 7, 2013

jazzido commented Dec 8, 2013

Implement Bentley-Ottmann algorithm for detecting grids #16

Implement Bentley-Ottmann algorithm for detecting grids #16

Comments

jazzido commented Jul 18, 2013

jazzido commented Nov 20, 2013

jazzido commented Nov 23, 2013

jazzido commented Nov 25, 2013

jeremybmerrill commented Nov 26, 2013

jeremybmerrill commented Nov 26, 2013

jeremybmerrill commented Nov 26, 2013

jazzido commented Nov 27, 2013

jeremybmerrill commented Dec 3, 2013

jazzido commented Dec 3, 2013

jeremybmerrill commented Dec 3, 2013

jeremybmerrill commented Dec 3, 2013

jeremybmerrill commented Dec 3, 2013

jeremybmerrill commented Dec 3, 2013

jazzido commented Dec 4, 2013

jeremybmerrill commented Dec 4, 2013

jazzido commented Dec 4, 2013

jeremybmerrill commented Dec 4, 2013

jazzido commented Dec 4, 2013

jazzido commented Dec 4, 2013

jeremybmerrill commented Dec 4, 2013

jazzido commented Dec 4, 2013

jeremybmerrill commented Dec 7, 2013

jazzido commented Dec 7, 2013

jeremybmerrill commented Dec 7, 2013

jazzido commented Dec 7, 2013

jeremybmerrill commented Dec 7, 2013

jeremybmerrill commented Dec 7, 2013

jazzido commented Dec 7, 2013

jeremybmerrill commented Dec 7, 2013

jazzido commented Dec 8, 2013