Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 string in level 5 mat file fails to load or is corrupted #164

Closed
russelburgess opened this issue Mar 9, 2021 · 7 comments
Closed
Labels

Comments

@russelburgess
Copy link

I am having issues with UTF-8 strings causing mat files to fail to load. Below is a test case that writes a character array named "x" containing the string "test°test" to test.mat. Until R2020a this loaded in matlab, though with some corruption (a couple of random bytes on the end of the string), in R2020b this fails with "Error using load / Cannot read file D:...\test.mat."

#include <stdio.h>
#include <string.h>
#include "matio.h"

int main()
{
  mat_t *matfp;
  matvar_t *matvar;

  char data[] = "test\u00B0test"; // "test" + degree symbol + "test"
  size_t dims[2];

  matfp = Mat_CreateVer("test.mat", "Level 5 mat file UTF8 test", MAT_FT_MAT5);
  if (matfp == NULL) {
    printf("MAT - Error creating file.\n");
    return 1;
  }

  dims[0] = 1;
  dims[1] = strlen(data); // Returns the byte length of data (10) not the number of UTF-8 codes (9)
  matvar = Mat_VarCreate("x", MAT_C_CHAR, MAT_T_UTF8, 2, dims, (char *)data, MAT_F_DONT_COPY_DATA);
  if (matvar == NULL) {
    printf("MAT - DATA variable creation failed.\n");
    return 1;
  }

  Mat_VarWrite(matfp, matvar, MAT_COMPRESSION_NONE);
  Mat_VarFree(matvar);

  Mat_Close(matfp);

  return 0;
}

It would appear that the issue is caused by the dims[] being used for both the dimensions of the character array and for the number of bytes of data. That is, matlab expects the dimensions of the string "test°test" to be 1 x 9 (counting "°" as one), but the string occupies 10 bytes in the mat file, and matio writes the same number for both.

This can be seen by inspecting the hex dump of the file produced ("test.mat'):

```
4C 65 76 65 6C 20 35 20 6D 61 74 20 66 69 6C 65
20 55 54 46 38 20 74 65 73 74 00 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 00 01 49 4D
0E 00 00 00 40 00 00 00 06 00 00 00 08 00 00 00
04 00 00 00 00 00 00 00 05 00 00 00 08 00 00 00
01 00 00 00 0A 00 00 00 01 00 01 00 78 00 00 00
10 00 00 00 0A 00 00 00 74 65 73 74 C2 B0 74 65
73 74 00 00 00 00 00 00
```

The first bolded byte is specifying the second dimension of the array; the second bolded byte is specifying the number of bytes of data in the element. Both of these originate in dims[1].

By changing that first byte from 0A to 09, R2020a and R2020b both load the file correctly.

@tbeu
Copy link
Owner

tbeu commented Mar 9, 2021

Thanks for reporting. Especially the changed behavior in MATLAB is interesting.
But, isn't this an application issue actually? If you replace the call of strlen by something like

// See https://stackoverflow.com/a/32936928/8520615
size_t count_utf8_code_points(const char *s) {
    size_t count = 0;
    while (*s) {
        count += (*s++ & 0xC0) != 0x80;
    }
    return count;
}

you get the expected length (on Linux). By the way, running strlen with MSVC (on Win) also gives the expected length. Thus, there definetely is a platform dependency here, which I have not yet figured out.

See also https://godbolt.org/z/T6GMb5 for some experimenting.

@tbeu
Copy link
Owner

tbeu commented Mar 9, 2021

Ah, I understand better now. Array length and number of bytes must be set differently. Let me see what I can do.

tbeu added a commit that referenced this issue Mar 9, 2021
@tbeu
Copy link
Owner

tbeu commented Mar 9, 2021

By changing that first byte from 0A to 09, R2020a and R2020b both load the file correctly.

It would not in R2019b where there is no proper UTF-8 support for MAT files. See also #79. It seems The MathWorks recently has improved UTF-8 support in MATLAB. Can you check if the files UTF8-Counterexample.zip and UTF32-Counterexample.zip (taken from blog.omega-prime.co.uk) can be successfully loaded in MATLAB R2020a and R2020b?

By the way, running strlen with MSVC (on Win) also gives the expected length. Thus, there definetely is a platform dependency here, which I have not yet figured out.

For MSVC I need to set char data[] = "test\xC2\xB0test"; // "test" + degree symbol + "test" to get the expected behaviour.

I also tried to address the issue by 836f8dd (still WIP). Nevertheless, you need to adapt the dimension in the application code, for example dims[1] = count_utf8_code_points(data);.

@russelburgess
Copy link
Author

Ah, can't believe I missed that previous issue #! Neither of UTF8-Counterexample.zip and UTF32-Counterexample.zip loaded in R2020a or R2020b, both fail with the same error as before "Error using load / Cannot read file D:...\UTF8-Counterexample.mat." Presumably MATLAB still can't read code points > 2 bytes then.

However using commit 836f8dd and count_utf8_code_points() does work (for the example string I gave, loads in both R2020a and R2020b). I should have mentioned before that I'm working in msys2, gcc 9.3.0, under Windows 10 64-bit.

I wasn't aware strlen() was platform dependent so that's good to know. FWIW for me strlen("test\u00B0test") returns:

  • 10 from msys2 / gcc 9.3.0
  • 9 from Visual Studio 16.9.1 (MSVC version 1928)

But... strlen("test\xC2\xB0test") returns:

  • 10 from msys2
  • 10 from Visual Studio

Which is...confusing. As best as I can tell MSVC is turning "\u00B0" into literally just the byte B0, whereas gcc is turning it into C2B0.

It seems like that commit will fix the issue as much as can be fixed on matio's end, so I can close the issue unless there are any other tests in R2020a/b you'd like me to run?

@tbeu
Copy link
Owner

tbeu commented Mar 10, 2021

Thanks for confirmation and further analysis. Please leave this issue open for now as I need to run the testsuite through latest MATLAB. I will close it once it is properly addressed and commited to master. Afterwards I will release a new version.

@russelburgess
Copy link
Author

Awesome, thanks for the prompt response to all this!

tbeu added a commit that referenced this issue Mar 11, 2021
tbeu added a commit that referenced this issue Mar 14, 2021
tbeu added a commit that referenced this issue Mar 14, 2021
tbeu added a commit that referenced this issue Mar 20, 2021
tbeu added a commit that referenced this issue Mar 21, 2021
@tbeu
Copy link
Owner

tbeu commented Mar 21, 2021

It's now fixed on master branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants