Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xslt-util/calstable/xpl and com.xmlcalabash conversion errors #41

Open
sentientmachine opened this issue Dec 27, 2018 · 8 comments
Open

Comments

@sentientmachine
Copy link

sentientmachine commented Dec 27, 2018

Bug Report:
My OS: Linux Gentoo Base System release 2.24.1.12 64 bit PC desktop
Java: 1.8.0_66
Shell: bash 4.3.42 (x86_64-pc-linux-gnu)
Install: cd /home/el/bin; git clone https://github.com/transpect/docx2tex --recursive
The input docx has a few unicode shenanigans, but nothing too out of band: http://www.filedropper.com/examplefail
Run you code: cd /home/el/bin/docx2tex; ./d2t ExampleFail.docx
Failure .log File: http://www.filedropper.com/examplefaild2t

What I expected: I expected some kind of output file ExampleFail.tex output containing latex code.

Quarantining the bug, proving the bug isn't on my side:

  1. Use libreoffice version 5.2.3.3 -writer to create an new empty .docx document containing the ascii text asdf.

  2. Save the above file as Untitled.docx using format Microsoft Word 2007-2013 XML (.docx) format.

  3. Openoffice -writer produces this Untitled.docx: http://www.filedropper.com/untitled_22

  4. Run the code: cd /home/el/bin/docx2tex; ./d2t Untitled.docx

  5. docx2tex works as expected, the contents of Untitled.tex render by pdflatex to a similar looking pdf:

The problem is in the table layouts.

@gimsieke
Copy link
Contributor

This must be the infamous Open Source Entitlement hitting us finally.
Thanks for reporting, we might eventually look into the issue, despite your impolite manners.

@sentientmachine
Copy link
Author

sentientmachine commented Dec 27, 2018

Ha, sorry for being rude. But my beard length going down the hall entitles me to Level 4 open source entitlements when the wind blows from the east on Tuesdays.

Workaround 1 helps isolate the input bug:

  1. Create a new empty Libreoffice .docx document.
  2. Open the ExampleFail.tex that produced the error above, do a Select-all, Copy, and paste into a new file Untitled2.docx
  3. Run the code: cd /home/el/bin/docx2tex; ./d2t Untitled2.docx
  4. A .tex output is successfully produced.

A libreoffice select-all, copy and paste performs some kind of normalization operation on the faulty .docx nested table object without destroying the variation in the varying rows and columns.

@gimsieke
Copy link
Contributor

OpenOffice or LibreOffice might create OOXML (docx) structures in a legal yet unexpected way. The tool should (in the sense of: “we should make it so”, not in the sense of: “it should already be Ok”) convert tables saved by recent versions of LibreOffice correctly provided they are valid OOXML, so I think we will fix this soon.

@sentientmachine
Copy link
Author

sentientmachine commented Dec 27, 2018

I've reproduced the error closer to the source. This screenshot tells the story:

https://ibb.co/vQjwS35

The conversion of "CALS tables" to latex tables fails because for it doesn't handle variation in the number of columns or rows.

The conversion error is asserted here: https://github.com/transpect/xslt-util/blob/74bb4f7d3c15b8649a71dfc55dae085ab6dfd38e/calstable/xsl/normalize.xsl

So now I can create an SSCCE using microsoft word, linux libreoffice and docx2tex thustly:

  1. In Windows, make a new empty Microsoft Word document.
  2. Choose table -> insert table, accept default 2x2 table.
  3. In the 2x2 table, join the upper two cells together horizontally.
  4. Save it as whatever.docx
  5. Run the code in linux: cd /home/el/bin/docx2tex; ./d2t whatever.docx
  6. You get the errors as describe on first post.

Workaround 2:

docx2tex can't handle Microsoft Word tables with an inconsistent number of columns and rows. If you must use them, a cleansing operation is to copy and paste those tables using libreoffice -writer into a fresh libreoffice document with docx format. Then all is well.

This .docx is a minimum possible document to illuminate the problem, it's just an empty word document with a table containing inconsistent number of rows: http://www.filedropper.com/ssccefordocx2tex

Microsoft's Office word document has an option to join cells of a table horizontally on a row by row basis, wheras libreoffice doesn't seem to allow me to do so, however I can copy and paste such things and the distinctions aren't destroyed, the copy/paste cleanses them. So maybe you can program in an auto cleanse xsl.

@gimsieke
Copy link
Contributor

Thanks for the repro. I don’t think it’s related to merged cells per se. It occurs when there are merged cells within nested tables. Investigating…

@gimsieke
Copy link
Contributor

The error doesn’t occur if I revert to transpect/xslt-util@271dd78. So there seems to be a regression. It is caused by another fix that improved other aspects of CALS table normalization and that is not covered by any test yet, apparently.
With the old version, the LaTeX code that was generated didn’t compile though. This seems to be related to the table nesting, too, but at a later stage.
I will try to fix the calstable bug, and if the problem with the generated LaTeX code then persists, @mkraetke needs to look into this.
We will not commit to a time frame for a fix.

@gimsieke
Copy link
Contributor

I was able to resolve the first error (not pushed the commit yet). However, there are more fundamental reasons why both your sample files don’t compile.

The default mode of operation for docx2tex is to resolve embedded tables, that is, to add more columns and rows to the containing table so that the embedded table becomes part of the containing table. The outer table’s rows and columns will turn into merged cells. But this only works if the embedded table occupies a full cell of the containing table, with no paragraphs and/or other embedded tables in the same cell. sscce_for_docx2tex.docx cannot be processed because it violates this condition. We’ll probably add a message to the log file that explains this restriction. It’s unlikely that we will be able to fix this.

The alternative to resolving embedded tables is to keep them nested (there’s an option for this that is currently not exposed in w2t). But the LaTeX code that we currently generate for this case creates extraneous \begin{table}/\end{table} around the embedded tabularx environments. I think it should just put them in curly braces instead if they are embedded. This is a thing that I will eventually look into with @mkraetke.

The other sample file, ExampleFail.docx, failed for other reasons. One is that definition list environments don’t seem to be supported yet in cells. I think they need to be wrapped in a \parbox. There were other errors related to generated \FontAwesome and \privateuse macros. Again, @mkraetke and I might occasionally look into these things.

Since these errors don’t affect our daily production lines (that produce hundreds of thousands pages per year), we are unlikely to look at them with high priority. However, we are constantly trying to improve the tool, and your examples are certainly helpful since they are demanding in terms of table nesting requirements, despite their small size.

I tried also the workaround, pasting the document into an empty LibreOffice document. But the embedded tables came out the same way. I’m using LO 6.0.0.3, maybe LO 5 flattened the tables while LO 6 keeps them nested. So this workaround is not working for me.

Let me stress again that not colspans or rowspans are an issue. They are supported in principle. The main problem is nested tables, but also other issues have shown that are related to special characters and definition lists.

@sentientmachine
Copy link
Author

sentientmachine commented Dec 28, 2018

Thanks for the quick turnaround, that sounds right. The above workarounds handled my cases and I can tweak the input .docx to remove the bad table. Maybe a better error message would help future users realize the limitation quicker, without having to trial and error input files. Maybe even a flag to aggressively string-join-flatten horizontal and string-join-flatten-vertically the offending table.

I'd prefer a best-attempt result .tex file even if the nested subtable was not exactly represented, because my intention was to tweak the .tex file as needed to clean it up anyway.

Looks like the escape from Microsoft Island is not so easy as it sounds. Not big surprise. 👍 :

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants