Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Row with data at top and bottom of different cells become more than one row of text #412

Closed
rkiddy opened this issue Mar 31, 2021 · 9 comments

Comments

@rkiddy
Copy link

rkiddy commented Mar 31, 2021

I am trying to interpret files like:
https://www.sccgov.org/sites/proc/DoingBusinesswiththeCounty/Documents/Contracts%20Report%20for%20Month%20of%20November%202019.pdf

Here is a small screenshot:
https://opencalaccess.org/img/Screen_Shot_2021-03-31_at_4.00.22_PM.png

See the 4th and 8th row in the pic.

If you have a pdf with:

-----------------------------------------------
|                 |     AAAAA   |    BBBBBB    |
|                 |             |              |
|   CCCCC         |             |              |
-----------------------------------------------

This is one row with three cells. I would like to get:

CCCCC tab AAAAA tab BBBBBB eol

What I get is actually:

"" tab AAAAA tab BBBBBB eol
CCCCC tab tab eol

I have forked and will check out the source and see about finding a minimal test case. And I will try to determine whether this is a duplicate bug or not.

cheers - ray

@rkiddy
Copy link
Author

rkiddy commented Apr 1, 2021

I think that this may be an error on my part. I was using --stream and I think I should have been using --lattice for these reports. Will confirm.

@rkiddy
Copy link
Author

rkiddy commented Apr 1, 2021

Even with --lattice there is a problem.

See the pic of one row at https://opencalaccess.org/img/Screen_Shot_2021-04-01_at_10.53.54_AM.png

This is of the form:

-------------------------------------------------
|          |           |   AAAA     |   BBBBB   |
|          |           |   CCCCC    |           |
|  DDDDD   |   EEEEEE  |   FFFFFF   |           |
-------------------------------------------------

There are different ways this could be interpreted. What happens, though, is:

DDDDD tab EEEEEE tab AAAA tab eol
CCCCC eol
FFFFFF tab BBBBB eol

As I say, there are different ways this could be interpreted, but this does not make sense.

Whatever comes out, it should be determinable that AAAA, CCCCC, and FFFFF are in the same column, that BBBBB is in its own column, and the data coming out here does not show these things.

@rkiddy
Copy link
Author

rkiddy commented Apr 13, 2021

Well, now I am more confused. I am looking at the TestSpreadsheetExtractor class, and specifically the testSpreadsheetExtractionIssue656 method.

It appears that

--------------------
| AAAAA   |        |
| BBBBB   | CCCCC  |
--------------------

becomes:

AAAAA eol
BBBBB tab CCCCC eol

There seems to be no way to differentiate the eol that is inside the column header from the eol that is at the end of the line.

How is that supposed to work? One would never be able to tell which part of the output goes into which column. It seems wrong.

@rkiddy
Copy link
Author

rkiddy commented Apr 13, 2021

Pilot error. I was reading the file in with the java.util.Scanner class, which breaks on both \n and \r. So, I did not see that the "in cell" values were only separated by a \r and the lines were separated by a \n. Very good.

If this could have been documented somehow, that would have been great, but o well.

@rkiddy rkiddy closed this as completed Apr 13, 2021
@jeremybmerrill
Copy link
Member

@rkiddy Glad you figured it out! If there's documentation we're missing, would you be willing to suggest what to write and where to put it (i.e. where you first sought the answer you eventually learned the hard way) and submit that as a PR? It'd be a big help and -- as you can perhaps imagine, as a maintainer, I'm pretty bad at writing documentation for software whose quirks, nuances and assumptions are burned into my brain.

@rkiddy
Copy link
Author

rkiddy commented May 1, 2021

Do you have a place for documentation, other than your README?

I can create a doc folder and put something into it, unless you get to it first.

@jeremybmerrill
Copy link
Member

A code comment might be best, or, someplace where you originally looking for an answer to this question. (Might be the readme, there's no other place for documentation that I'm aware of.)

@rkiddy
Copy link
Author

rkiddy commented May 4, 2021

If you do not like having a doc folder, I am not sure what to suggest. The wiki could be a good place but it points to the install site and the code. The install site does not look like a better place to have documentation, especially if it does not allow other contributors to help. Can I try adding to the wiki? This will be done via a pull request, so any change can be looked at and/or backed out.

@jeremybmerrill
Copy link
Member

sure! sounds good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants