Row with data at top and bottom of different cells become more than one row of text #412

rkiddy · 2021-03-31T23:18:22Z

I am trying to interpret files like:
https://www.sccgov.org/sites/proc/DoingBusinesswiththeCounty/Documents/Contracts%20Report%20for%20Month%20of%20November%202019.pdf

Here is a small screenshot:
https://opencalaccess.org/img/Screen_Shot_2021-03-31_at_4.00.22_PM.png

See the 4th and 8th row in the pic.

If you have a pdf with:

-----------------------------------------------
|                 |     AAAAA   |    BBBBBB    |
|                 |             |              |
|   CCCCC         |             |              |
-----------------------------------------------

This is one row with three cells. I would like to get:

CCCCC tab AAAAA tab BBBBBB eol

What I get is actually:

"" tab AAAAA tab BBBBBB eol
CCCCC tab tab eol

I have forked and will check out the source and see about finding a minimal test case. And I will try to determine whether this is a duplicate bug or not.

cheers - ray

The text was updated successfully, but these errors were encountered:

rkiddy · 2021-04-01T01:55:55Z

I think that this may be an error on my part. I was using --stream and I think I should have been using --lattice for these reports. Will confirm.

rkiddy · 2021-04-01T18:07:48Z

Even with --lattice there is a problem.

See the pic of one row at https://opencalaccess.org/img/Screen_Shot_2021-04-01_at_10.53.54_AM.png

This is of the form:

-------------------------------------------------
|          |           |   AAAA     |   BBBBB   |
|          |           |   CCCCC    |           |
|  DDDDD   |   EEEEEE  |   FFFFFF   |           |
-------------------------------------------------

There are different ways this could be interpreted. What happens, though, is:

DDDDD tab EEEEEE tab AAAA tab eol
CCCCC eol
FFFFFF tab BBBBB eol

As I say, there are different ways this could be interpreted, but this does not make sense.

Whatever comes out, it should be determinable that AAAA, CCCCC, and FFFFF are in the same column, that BBBBB is in its own column, and the data coming out here does not show these things.

rkiddy · 2021-04-13T03:04:57Z

Well, now I am more confused. I am looking at the TestSpreadsheetExtractor class, and specifically the testSpreadsheetExtractionIssue656 method.

It appears that

--------------------
| AAAAA   |        |
| BBBBB   | CCCCC  |
--------------------

becomes:

AAAAA eol
BBBBB tab CCCCC eol

There seems to be no way to differentiate the eol that is inside the column header from the eol that is at the end of the line.

How is that supposed to work? One would never be able to tell which part of the output goes into which column. It seems wrong.

rkiddy · 2021-04-13T07:20:11Z

Pilot error. I was reading the file in with the java.util.Scanner class, which breaks on both \n and \r. So, I did not see that the "in cell" values were only separated by a \r and the lines were separated by a \n. Very good.

If this could have been documented somehow, that would have been great, but o well.

jeremybmerrill · 2021-04-14T13:37:56Z

@rkiddy Glad you figured it out! If there's documentation we're missing, would you be willing to suggest what to write and where to put it (i.e. where you first sought the answer you eventually learned the hard way) and submit that as a PR? It'd be a big help and -- as you can perhaps imagine, as a maintainer, I'm pretty bad at writing documentation for software whose quirks, nuances and assumptions are burned into my brain.

rkiddy · 2021-05-01T00:16:40Z

Do you have a place for documentation, other than your README?

I can create a doc folder and put something into it, unless you get to it first.

jeremybmerrill · 2021-05-04T21:39:16Z

A code comment might be best, or, someplace where you originally looking for an answer to this question. (Might be the readme, there's no other place for documentation that I'm aware of.)

rkiddy · 2021-05-04T22:15:12Z

If you do not like having a doc folder, I am not sure what to suggest. The wiki could be a good place but it points to the install site and the code. The install site does not look like a better place to have documentation, especially if it does not allow other contributors to help. Can I try adding to the wiki? This will be done via a pull request, so any change can be looked at and/or backed out.

jeremybmerrill · 2021-05-06T13:31:48Z

sure! sounds good to me.

rkiddy closed this as completed Apr 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Row with data at top and bottom of different cells become more than one row of text #412

Row with data at top and bottom of different cells become more than one row of text #412

rkiddy commented Mar 31, 2021 •

edited

rkiddy commented Apr 1, 2021

rkiddy commented Apr 1, 2021

rkiddy commented Apr 13, 2021 •

edited

rkiddy commented Apr 13, 2021

jeremybmerrill commented Apr 14, 2021

rkiddy commented May 1, 2021

jeremybmerrill commented May 4, 2021

rkiddy commented May 4, 2021

jeremybmerrill commented May 6, 2021

Row with data at top and bottom of different cells become more than one row of text #412

Row with data at top and bottom of different cells become more than one row of text #412

Comments

rkiddy commented Mar 31, 2021 • edited

rkiddy commented Apr 1, 2021

rkiddy commented Apr 1, 2021

rkiddy commented Apr 13, 2021 • edited

rkiddy commented Apr 13, 2021

jeremybmerrill commented Apr 14, 2021

rkiddy commented May 1, 2021

jeremybmerrill commented May 4, 2021

rkiddy commented May 4, 2021

jeremybmerrill commented May 6, 2021

rkiddy commented Mar 31, 2021 •

edited

rkiddy commented Apr 13, 2021 •

edited