Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Just re-saving XLSX file without making changes causes cell count mismatch #14

Open
sschuberth opened this issue Mar 14, 2019 · 24 comments

Comments

@sschuberth
Copy link
Contributor

Sorry for this support-type of question. I have saved an XLSX file as the expected result and the sameWorkbook() assertion passes whem comparing against a programmatically created workbook. However, if I just open and re-safe the XLSX file (without making any changes) with Excel on Windows, I get

java.lang.AssertionError: 
Expected: entire workbook to be equal
     but: got <2> cell(s) on row <1> expected <6> sheet "Summary",
          got <2> cell(s) on row <2> expected <6> sheet "Summary",
          got <2> cell(s) on row <5> expected <6> sheet "Summary",
          got <2> cell(s) on row <6> expected <6> sheet "Summary",
          got <2> cell(s) on row <7> expected <6> sheet "Summary",
          got <2> cell(s) on row <8> expected <6> sheet "Summary"

So somehow the number of (non-physical) cells seems to have changed. Any idea why, and how to deal with it? I wanted to update the XLSX file with new test expectations, but now I cannot due to the above issue which always make the test fail.

@sschuberth
Copy link
Contributor Author

sschuberth commented Mar 14, 2019

Looks like is has something to do with merged cells in my case: At the example of row 1, there is cell A1, and cells B1 to F1 are merged into one "visual cell".

@sschuberth
Copy link
Contributor Author

sschuberth commented Mar 14, 2019

Indeed, in the original file generated via Apache POI 3.17, xl/worksheets/sheet1.xml has

    <row r="1">
      <c r="A1" s="2" t="s">
        <v>0</v>
      </c>
      <c r="B1" s="2" t="s">
        <v>1</v>
      </c>
    </row>

But after resaving with Excel it has

    <row r="1" spans="1:6">
      <c r="A1" s="2" t="s">
        <v>0</v>
      </c>
      <c r="B1" s="5" t="s">
        <v>1</v>
      </c>
      <c r="C1" s="6"/>
      <c r="D1" s="6"/>
      <c r="E1" s="6"/>
      <c r="F1" s="6"/>
    </row>

So it looks like Apache POI is not writing out cell spans correctly 😢 Edit: Or phrased more correctly, Apache POI does not seem to make getLastCellNum() return the correct number in case of this new (?) format of defining cell spans.

sschuberth added a commit to oss-review-toolkit/ort that referenced this issue Mar 14, 2019
See tobyweston/simple-excel#14 for some
information.

Signed-off-by: Sebastian Schuberth <sebastian.schuberth@here.com>
sschuberth added a commit to oss-review-toolkit/ort that referenced this issue Mar 14, 2019
See tobyweston/simple-excel#14 for some
information.

Signed-off-by: Sebastian Schuberth <sebastian.schuberth@here.com>
sschuberth added a commit to oss-review-toolkit/ort that referenced this issue Mar 14, 2019
See tobyweston/simple-excel#14 for some
information.

Signed-off-by: Sebastian Schuberth <sebastian.schuberth@here.com>
@tobyweston
Copy link
Owner

Awesome spot. Thanks for posting and apologies for not looking sooner.

So I'm clear, is this an issue with POI over simple-excel? Anything you think we should do in simple-excel now? Think we can get a (java) test in simple-excel to at least demonstrate the issue? I'm never confident in upgrading POI dependency but that might be worth considering?

@sschuberth
Copy link
Contributor Author

I believe by now this is an issue in POI rather than in simple-excel, yes. The question is, can simple-excel do anything to work around it? Something like "if there is a cell count mismatch and merged cells are involved do X to double-check", but me not knowing the POI API very well I have not idea what that X could be.

Anyway, it would certainly be interesting to see whether POI 4.0.1 would fix the issue.

@tobyweston
Copy link
Owner

From memory, POI was limited with merged cells. I may be wrong though. I'll try to take a look but a JUnit test demonstrating it would be super helpful (wink wink)

@sschuberth
Copy link
Contributor Author

I'll try to come up with a test, @tobyweston. As a small preparation for that, see #15 for some minor fixes.

@tobyweston
Copy link
Owner

Thanks. Perhaps the first thing to look at is a general dependency update in the pom.xml. We're only poi 3.17 from memory and it looks like 4.1.0 is the latest and greatest (https://poi.apache.org/download.html). Not sure what else would need looking at.

I'd also like to split the matcher stuff from the general poi wrapper. If no one finds my DSL/wrapper around poi useful, I'd be happy to remove it and leave this as a Hamcrest Matcher library only... any thoughts?

@sschuberth
Copy link
Contributor Author

Actually, my plan was to first try to come up with a test that reproduces my issue and fails initially. And then see what needs to be done to fix the issue; maybe upgrading to POI 4.1.0 would fix it. However, I'm currently running into some strange issue where a test passes that should actually fail. I'll probably create a draft PR to demonstrate the issue and ask for your help there.

@sschuberth
Copy link
Contributor Author

Sorry for the delay, @tobyweston. I've now created PR #16 to demonstrate how a test about merged cells succeeds that I'd expect to fail. Could you please have a look at the PR and comment on it?

@tobyweston
Copy link
Owner

See my comment in #16 (comment) for a potential fix

tobyweston added a commit that referenced this issue Oct 28, 2019
…orkbook as they seem equivalent (with the simple-excel version offering a little more around exception handling)
tobyweston added a commit that referenced this issue Oct 28, 2019
tobyweston added a commit that referenced this issue Oct 28, 2019
…than a literal comparison of merged cells)
@tobyweston
Copy link
Owner

tobyweston commented Oct 28, 2019

Tracking on a branch merged_regions (which has @sschuberth PR merged in #16).

tobyweston added a commit that referenced this issue Oct 28, 2019
…. Cell contents aren't used just the cell addresses within the region. Renamed the count based matcher for clarity.
tobyweston added a commit that referenced this issue Oct 28, 2019
@tobyweston
Copy link
Owner

Hmmm, rereading your comments. I've been focusing on including merged regions in the sheet comparisons and realise this probably doesn't address the original issue. I'll carry on (simple-excel will now fail when comparing sheets when two sheets have differing merged regions, regardless of content within those regions).

Let me know if that helps / hinders you.

Have you tried raising an issue with Apache POI?

@sschuberth
Copy link
Contributor Author

I'm sorry if my comments were setting you on the wrong track, but differences in the way merged regions are specified really were the only difference I could obverse.

No, I haven't raised the issue upstream with Apache POI, mainly because I wasn't sure enough whether it really is an upstream issue, and I did come up with an MWE yet.

@tobyweston
Copy link
Owner

No worries, I’ll have a closer look and see what I can do.

Does this fix help you though?

@sschuberth sschuberth changed the title Just re-saving XLSX fie without making changes causes cell count mismatch Just re-saving XLSX file without making changes causes cell count mismatch Oct 29, 2019
@sschuberth
Copy link
Contributor Author

Does this fix help you though?

Unfortunately not. I just tried simple-excel 1.2 with Apache POI 4.1.0 / 4.1.1, but I still get

java.lang.AssertionError: 
Expected: entire workbook to be equal
     but: got <2> cell(s) on row <1> expected <6> sheet "Summary",
          got <2> cell(s) on row <2> expected <6> sheet "Summary",
          got <2> cell(s) on row <5> expected <6> sheet "Summary",
          got <2> cell(s) on row <6> expected <6> sheet "Summary",
          got <2> cell(s) on row <7> expected <6> sheet "Summary",
          got <2> cell(s) on row <8> expected <6> sheet "Summary"

for Excel sheets that are actually the same, but where one has been re-saved with Excel.

@sschuberth
Copy link
Contributor Author

If you want to reproduce this, you can checkout the excel-poi branch of https://github.com/heremaps/oss-review-toolkit and then

  1. cp reporter/src/funTest/assets/file-counter-expected-scan-report.actual.xlsx reporter/src/funTest/assets/file-counter-expected-scan-report.xlsx
  2. ./gradlew :reporter:funTest --tests com.here.ort.reporter.reporters.ExcelReporterTest

See how the test succeeds. Then open file-counter-expected-scan-report.xlsx in Excel an re-safe it (e.g. by making a dummy change like removing a character and adding it again). Then again run ./gradlew :reporter:funTest --tests com.here.ort.reporter.reporters.ExcelReporterTest and see how the test fails.

@tobyweston
Copy link
Owner

tobyweston commented Nov 1, 2019

Thanks,

just tried simple-excel 1.2 with Apache POI 4.1.0 / 4.1.1

The tweaks to merged cells is on 1.3 which isn’t released yet. It currently resides in the merged_refions branch. At least we know POI 4.x doesn’t help.

I’ll look in more detail at your case and see if I can help before I actually release 1.3. If the merged regions are an authogonal issue, I may not include that “fix” (as I’m not sure how I feel about it).

@sschuberth
Copy link
Contributor Author

sschuberth commented Nov 1, 2019

The mismatch occurs in CellNumberMatcher.matchesSafely() for the expected row being

<xml-fragment r="1" spans="1:6" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac" xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision" xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2" xmlns:xr3="http://schemas.microsoft.com/office/spreadsheetml/2016/revision3" xmlns:main="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <main:c r="A1" s="2" t="s">
    <main:v>0</main:v>
  </main:c>
  <main:c r="B1" s="5" t="s">
    <main:v>1</main:v>
  </main:c>
  <main:c r="C1" s="6"/>
  <main:c r="D1" s="6"/>
  <main:c r="E1" s="6"/>
  <main:c r="F1" s="6"/>
</xml-fragment>

(for the above 6 cells are reported despite the spans="1:6" attribute) and the actual row being

<xml-fragment r="1" xmlns:main="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <main:c r="A1" t="s" s="2">
    <main:v>0</main:v>
  </main:c>
  <main:c r="B1" t="s" s="2">
    <main:v>1</main:v>
  </main:c>
</xml-fragment>

(for the above 2 cells are reported). Which is essentially the same that I had found out earlier by looking at the unzipped contents of the xlsx files.

@sschuberth
Copy link
Contributor Author

sschuberth commented Nov 1, 2019

It currently resides in the merged_refions branch.

Ah, right, sorry, will try that branch later.

@sschuberth
Copy link
Contributor Author

So I've tried your merged_regions branch and the output is already much more telling:

com.here.ort.reporter.reporters.ExcelReporterTest > com.here.ort.reporter.reporters.ExcelReporterTest > executionError FAILED
    java.lang.AssertionError:
    Expected: entire workbook to be equal
         but: got "B1:F1, B2:F2, B5:F5, B6:F6, B7:F7, B8:F8" as a merged region(s) in sheet "Summary" expected "B8:F8, B1:F1, B2:F2, B5:F5, B6:F6, B7:F7"

Note that the merged region actually are the same, they are just not reported in the same order! That is, in the expected output for some reason B8:F8 does not go last. If you could sort the merged regions before comparison, I believe we would be set 😄

@tobyweston
Copy link
Owner

Cool easily done!

I was also thinking that if we need some custom behaviour, like a workaround to some of the matching, we could enable it with a flag so it doesn’t affect anyone unless they opt in...

sschuberth added a commit to oss-review-toolkit/ort that referenced this issue Oct 10, 2020
The simple-excel project has not been updated for about a year despite
having issues [1] [2], so get rid of it in favor of using kotest matchers
directly on the sheet properties.

[1] tobyweston/simple-excel#14
[2] tobyweston/simple-excel#19

Signed-off-by: Sebastian Schuberth <sebastian.schuberth@bosch.io>
sschuberth added a commit to oss-review-toolkit/ort that referenced this issue Oct 12, 2020
The simple-excel project has not been updated for about a year despite
having issues [1] [2], so get rid of it in favor of using kotest matchers
directly on the sheet properties.

[1] tobyweston/simple-excel#14
[2] tobyweston/simple-excel#19

Signed-off-by: Sebastian Schuberth <sebastian.schuberth@bosch.io>
sschuberth added a commit to oss-review-toolkit/ort that referenced this issue Oct 12, 2020
The simple-excel project has not been updated for about a year despite
having issues [1] [2], so get rid of it in favor of using kotest matchers
directly on the sheet properties.

[1] tobyweston/simple-excel#14
[2] tobyweston/simple-excel#19

Signed-off-by: Sebastian Schuberth <sebastian.schuberth@bosch.io>
sschuberth added a commit to oss-review-toolkit/ort that referenced this issue Oct 12, 2020
The simple-excel project has not been updated for about a year despite
having issues [1] [2], so get rid of it in favor of using kotest matchers
directly on the sheet properties.

[1] tobyweston/simple-excel#14
[2] tobyweston/simple-excel#19

Signed-off-by: Sebastian Schuberth <sebastian.schuberth@bosch.io>
sschuberth added a commit to oss-review-toolkit/ort that referenced this issue Oct 12, 2020
The simple-excel project has not been updated for about a year despite
having issues [1] [2], so get rid of it in favor of using kotest matchers
directly on the sheet properties.

[1] tobyweston/simple-excel#14
[2] tobyweston/simple-excel#19

Signed-off-by: Sebastian Schuberth <sebastian.schuberth@bosch.io>
@sschuberth
Copy link
Contributor Author

sschuberth commented Oct 18, 2020

FYI, I was finally able to solve this by writing my own code which "de-duplicates" cells that are actually part of merged cells before comparing cell contents, see

https://github.com/oss-review-toolkit/ort/blob/40ca07dec80d9b2c49e220975906d6953cd9256a/reporter/src/funTest/kotlin/reporters/ExcelReporterFunTest.kt#L105-L110

@tobyweston
Copy link
Owner

Awesome but I feel bad we didn't resolve this in simple-excel. We were close no? I've re-read the trail but am fuzzy about the details. If I ever get some time (unlikely atm) I can remind myself and try and merge the merged_regions branch.

@sschuberth
Copy link
Contributor Author

TBH, in retrospect I don't believe the merged_regions branch was on the right track. The solution to the problem was not to compare the merged regions themselves (as these are identical in both original and re-saved workbooks), but to consider the merged regions when iterating over the cells.

Here's an example: Let's assume there's a merged region at B1:D1. In the original workbook generated by Apache POI itself, the cell iterator for the first row would return A1, B1, C1, D1, E1 etc. I.e. it returns the individual cells of a merged region. When re-saving the workbook created by Apache POI in Excel, without doing any modifications, then the cell iterator suddenly returns A1, B1, E1 etc. So cells of a merged region other than the first cell are skipped.

That's why I'm changing the cell iterator to always skip cells of a merged region other than the first cell of a merged region.

In any case, I'd consider this to be a bug in Apache POI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants