New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Row with data at top and bottom of different cells become more than one row of text #412
Comments
I think that this may be an error on my part. I was using --stream and I think I should have been using --lattice for these reports. Will confirm. |
Even with --lattice there is a problem. See the pic of one row at https://opencalaccess.org/img/Screen_Shot_2021-04-01_at_10.53.54_AM.png This is of the form:
There are different ways this could be interpreted. What happens, though, is: DDDDD tab EEEEEE tab AAAA tab eol As I say, there are different ways this could be interpreted, but this does not make sense. Whatever comes out, it should be determinable that AAAA, CCCCC, and FFFFF are in the same column, that BBBBB is in its own column, and the data coming out here does not show these things. |
Well, now I am more confused. I am looking at the TestSpreadsheetExtractor class, and specifically the testSpreadsheetExtractionIssue656 method. It appears that
becomes: AAAAA eol There seems to be no way to differentiate the eol that is inside the column header from the eol that is at the end of the line. How is that supposed to work? One would never be able to tell which part of the output goes into which column. It seems wrong. |
Pilot error. I was reading the file in with the java.util.Scanner class, which breaks on both \n and \r. So, I did not see that the "in cell" values were only separated by a \r and the lines were separated by a \n. Very good. If this could have been documented somehow, that would have been great, but o well. |
@rkiddy Glad you figured it out! If there's documentation we're missing, would you be willing to suggest what to write and where to put it (i.e. where you first sought the answer you eventually learned the hard way) and submit that as a PR? It'd be a big help and -- as you can perhaps imagine, as a maintainer, I'm pretty bad at writing documentation for software whose quirks, nuances and assumptions are burned into my brain. |
Do you have a place for documentation, other than your README? I can create a doc folder and put something into it, unless you get to it first. |
A code comment might be best, or, someplace where you originally looking for an answer to this question. (Might be the readme, there's no other place for documentation that I'm aware of.) |
If you do not like having a doc folder, I am not sure what to suggest. The wiki could be a good place but it points to the install site and the code. The install site does not look like a better place to have documentation, especially if it does not allow other contributors to help. Can I try adding to the wiki? This will be done via a pull request, so any change can be looked at and/or backed out. |
sure! sounds good to me. |
I am trying to interpret files like:
https://www.sccgov.org/sites/proc/DoingBusinesswiththeCounty/Documents/Contracts%20Report%20for%20Month%20of%20November%202019.pdf
Here is a small screenshot:
https://opencalaccess.org/img/Screen_Shot_2021-03-31_at_4.00.22_PM.png
See the 4th and 8th row in the pic.
If you have a pdf with:
This is one row with three cells. I would like to get:
CCCCC tab AAAAA tab BBBBBB eol
What I get is actually:
"" tab AAAAA tab BBBBBB eol
CCCCC tab tab eol
I have forked and will check out the source and see about finding a minimal test case. And I will try to determine whether this is a duplicate bug or not.
cheers - ray
The text was updated successfully, but these errors were encountered: