Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LABEL records in xls 1) exist and 2) have length stored as 16-bit value; fix… #312

Merged
merged 4 commits into from Mar 31, 2017
Merged

LABEL records in xls 1) exist and 2) have length stored as 16-bit value; fix… #312

merged 4 commits into from Mar 31, 2017

Conversation

jennybc
Copy link
Member

@jennybc jennybc commented Mar 30, 2017

…es #309

Here is an alternative to #310 that is much more surgical and preserves my hard-earned clean report from memcheck!

Restating the context

In #309, we have an xls BIFF5, a very old format (but with recent data, so apparently some tool is still writing this!). It uses the LABEL record type to hold strings, which is an ancient practice and, I thought, no longer occurring. I was wrong.

Problem 1: readxl had stopped reading LABEL cells. Hadley had covered this cell type, but I'd let it go missing in one place in my revisions. I've added it back and such cells are now loaded again.

Problem 2: The fix re: using longer integers to index into the shared string table (#293) actually breaks the string extraction for LABEL records. All of this is libxls problems/bugs btw. The problem is that label needs to be a pointer to uint32_t for parsing modern BIFF8 sheets / LABELSST records and a pointer to uint16_t for sheets like the one in #309. This only matters for advancing label past the string length, so it points at the beginning of the string.

This solution just moves label backwards one byte inside if(pWB->is5ver) {...} to correct for the fact that label++; goes too far.

@jennybc
Copy link
Member Author

jennybc commented Mar 30, 2017

The previous thread holds the relevant info on the structure of LABEL records in BIFF5 xls files: #310 (comment).

@jimhester
Copy link
Contributor

There are a number of issues here, first I think the declaration of xls_getfcell needs to change to extern BYTE* xls_getfcell(xlsWorkBook* pWB,struct st_cell_data* cell,BYTE *label); and move the byte pointer the correct distance for each case, rather than trying shoe-horn a DWORD pointer in there. Note even before this PR (a41f61b) I get a error when running with sanitizers because you are trying to read a dereference a mis-aligned DWORD.

xlstool.c:641:44: runtime error: load of misaligned address 0x6020000c2df6 for type 'DWORD', which requires 4 byte alignment

Regardless as far as I can tell with your new test this is failing before the code even gets to xls_getfcell(). In particular src/endian.c:115 is causing a buffer overflow. If you look at 5.8.1 BOF Records Written by Excel from the open office documentation you can see the structure for BIFF5 records is different than BIFF8 records. You will either need to condition in xlsConvertBiff() on the BIFF version for the last two fields or define a new structure with the proper layout for BIFF5.

Below is a patch for these changes you are welcome to use which passes all tests tests without any memory errors. One thing you may want to check is the non is5ver cases to ensure the label pointer is pointing to the start of the string for utf8_decode() and unicode_decode().

commit eb48e3c6fe57273b326023ed75ca0505bc2dbb63
Author: James Hester <james.hester@rstudio.com>
Date:   Thu Mar 30 19:07:25 2017 +0000

    Support BIFF5 records as well as BIFF8

    Changes the API of xls_getfcell to use BYTE* rather than DWORD* as the records
    are not aligned properly for DWORD*, then manually moves the address rather
    than relying on pointer arithmetic.

diff --git a/src/endian.c b/src/endian.c
index a0ff923..38cdb59 100755
--- a/src/endian.c
+++ b/src/endian.c
@@ -112,9 +112,11 @@ void xlsConvertBiff(BIFF *b)
     b->type = xlsShortVal(b->type);
     b->id_make = xlsShortVal(b->id_make);
     b->year = xlsShortVal(b->year);
+    if (b->ver == 0x600) {
     b->flags = xlsIntVal(b->flags);
     b->min_ver = xlsIntVal(b->min_ver);
 }
+}

 void xlsConvertWindow(WIND1 *w)
 {
diff --git a/src/libxls/xlstool.h b/src/libxls/xlstool.h
index b701f73..1b8c80c 100644
--- a/src/libxls/xlstool.h
+++ b/src/libxls/xlstool.h
@@ -48,6 +48,6 @@ extern void xls_showCell(struct st_cell_data* cell);
 extern void xls_showFont(struct st_font_data* font);
 extern void xls_showXF(XF8* xf);
 extern void xls_showFormat(struct st_format_data* format);
-extern BYTE* xls_getfcell(xlsWorkBook* pWB,struct st_cell_data* cell,DWORD *label);
+extern BYTE* xls_getfcell(xlsWorkBook* pWB,struct st_cell_data* cell,BYTE *label);
 extern char* xls_getCSS(xlsWorkBook* pWB);
 extern void xls_showBOF(BOF* bof);
diff --git a/src/xls.c b/src/xls.c
index 5ba40d5..212a2a6 100644
--- a/src/xls.c
+++ b/src/xls.c
@@ -530,7 +530,7 @@ struct st_cell_data *xls_addCell(xlsWorkSheet* pWS,BOF* bof,BYTE* buf)
         break;
     case XLS_RECORD_LABELSST:
     case XLS_RECORD_LABEL:
-               cell->str=xls_getfcell(pWS->workbook,cell,(DWORD_UA *)&((LABEL*)buf)->value);
+               cell->str=xls_getfcell(pWS->workbook,cell,(BYTE*)&((LABEL*)buf)->value);
                sscanf((char *)cell->str, "%d", &cell->l);
                sscanf((char *)cell->str, "%lf", &cell->d);
                break;
diff --git a/src/xlstool.c b/src/xlstool.c
index 7d4a27e..7030ec1 100644
--- a/src/xlstool.c
+++ b/src/xlstool.c
@@ -626,7 +626,7 @@ void xls_showXF(XF8* xf)
     printf("GroundColor: 0x%x\n",xf->groundcolor);
 }

-BYTE *xls_getfcell(xlsWorkBook* pWB,struct st_cell_data* cell,DWORD *label)
+BYTE *xls_getfcell(xlsWorkBook* pWB,struct st_cell_data* cell,BYTE *label)
 {
     struct st_xf_data *xf;
        WORD    len;
@@ -646,7 +646,7 @@ BYTE *xls_getfcell(xlsWorkBook* pWB,struct st_cell_data* cell,DWORD *label)
         break;
     case XLS_RECORD_LABEL:
                len = xlsShortVal(*label);
-        label++;
+    label += 2;
                if(pWB->is5ver) {
                  // readxl patch
                  //
@@ -662,15 +662,14 @@ BYTE *xls_getfcell(xlsWorkBook* pWB,struct st_cell_data* cell,DWORD *label)
                  //
                  // this line moves the label pointer back one byte, to the beginning
                  // of the string
-                 label = (DWORD *)((WORD *)label - 1);
                        asprintf(&ret,"%.*s", len, (char *)label);
                        //printf("Found BIFF5 string of len=%d \"%s\"\n", len, ret);
                } else
                if ((*(BYTE *)label & 0x01) == 0) {
-                       ret = (char *)utf8_decode((BYTE *)label + 1, len, pWB->charset);
+                               ret = (char *)utf8_decode((BYTE *)label, len, pWB->charset);
                } else {
                        size_t newlen;
-                   ret = (char *)unicode_decode((BYTE *)label + 1, len*2, &newlen, pWB->charset);
+                               ret = (char *)unicode_decode((BYTE *)label, len*2, &newlen, pWB->charset);
                }
         break;
     case XLS_RECORD_RK:

@@ -322,9 +322,10 @@ class XlsWorkSheet {
cell->id == 0x205 ||
// cell holds either Boolean or error:
// 0x205 --> 517 BoolErr (section 2.4.24) p216
cell->id == 0x0FD
cell->id == 0x0FD || cell->id == 0x204
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These would really be more useful as self-documenting constants rather than trying to keep the comments up to date.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, all these well-named constants already exist in libxls, so I can just include the appropriate header and get them from there.

@jennybc
Copy link
Member Author

jennybc commented Mar 30, 2017

Thanks @jimhester for the thorough analysis and fix!

I will incorporate it. This seems like another thing that should go back upstream to libxls. In fact, maybe I can enlist the maintainer to sort out the xlsConvertBiff() problem in the manner they prefer.

One thing you may want to check is the non is5ver cases to ensure the label pointer is pointing to the start of the string for utf8_decode() and unicode_decode().

This did occur to me, with the previous solution that was less contained. But I don't have any BIFF8 sheets that use the LABEL record type. I've never even seen a LABEL record before this issue and thought they were extinct. Which I still believe to be true, but it would still be nice to handle them properly.

@jennybc
Copy link
Member Author

jennybc commented Mar 30, 2017

@jimhester If it is still easy for you to do so, would you do your checks with sanitizers on 0587f00? That is the commit just prior to the change re: using 32-bit integers to index the string table vs 16. I'm curious if this is all fallout of that change or if there were pre-existing problems and this sheet just exposes them.

@jennybc
Copy link
Member Author

jennybc commented Mar 30, 2017

I think the label + 1 has to be retained inside utf8_decode() and unicode_decode(). Here is better info on XLUnicodeString structures from yet another document (the one that is the most reliable for BIFF8):

p896 of http://interoperability.blob.core.windows.net/files/MS-XLS/[MS-XLS].pdf

screen shot 2017-03-30 at 3 25 11 pm

as per @jimhester

Changes the API of xls_getfcell to use BYTE* rather than DWORD* as the records are not aligned properly for DWORD*, then manually moves the address rather than relying on pointer arithmetic.
@jennybc
Copy link
Member Author

jennybc commented Mar 30, 2017

@jimhester

Below is a patch for these changes you are welcome to use which passes all tests tests without any memory errors.

I implemented this ac13976 and pass all tests, but still get a very great deal of memory errors ("Conditional jump or move depends on uninitialised value(s)" and "Use of uninitialised value of size 8"), all associated with this line:

asprintf(&ret,"%.*s", len, (char *)label);

And I do not understand why I'm getting these memory complaints?!?

@jennybc
Copy link
Member Author

jennybc commented Mar 31, 2017

I still get the errors about uninitialized values when I pre-allocate memory to accept the string (ret above), which I think confirms they come from reading label. The only thing I can think of is that previously label was a pointer to WORD_UA or DWORD_UA. Could this somehow be about alignment? That doesn't really make sense to me though.

@jennybc
Copy link
Member Author

jennybc commented Mar 31, 2017

@jimhester and I have determined my memory errors are false positives peculiar to valgrind on Mac OS (e.g., Valgrind reports errors for a very simple C program). For example, he cannot reproduce them in a ASAN enabled docker container or with the hub sanitizers or valgrind. I'm writing the necessary suppression file so I won't see them in future and will soon merge this fix.

@jennybc jennybc changed the title LABEL records in xls 1) exist and 2) are stored as 16-bit values; fix… LABEL records in xls 1) exist and 2) have length stored as 16-bit value; fix… Mar 31, 2017
@jennybc jennybc merged commit 357e473 into tidyverse:master Mar 31, 2017
@jennybc
Copy link
Member Author

jennybc commented Mar 31, 2017

@jimhester I submitted your fix as a patch to libxls: https://sourceforge.net/p/libxls/patches/13/

@jennybc jennybc deleted the label-record-type-take2 branch March 31, 2017 23:09
jennybc added a commit that referenced this pull request Apr 12, 2017
* Overwrite our libxls "fork" with current svn "HEAD"

svn2github/libxls@bb84eb9

* Don't clutter this operation with xls2csv.c

* No, do not define DEBUG_DRAWINGS

* Suppress print debugging re: XLS_RECORD_MULRK in xls_addCell()

* Reinstate the fix made in PR #273

Rescues read of last column in xls written by some 3rd party tools

* Reinstate the fixes made in PR #312

* Replace #ifdef AIX with #if defined(_AIX) || defined(__sun)

* _WIN32 handling, don't warn for check re: endian-ness

* Bring print statement into line with code

* Changes re: "flexible array members"

* Whitespace the way libxls likes it

* Match va_start() and va_end(); see #195

* Reinstate asprintf() fix from #246

* Re-mask illegal functions for CMD check

* Possible refactor of cran.h

* Undefine assert before redefining

* Quiet more warnings

These are more harmless than they seem on the surface, because there is lots of code in these files that is inside `#ifdef 0 ... `#endif` blocks and the like.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants