UTF-8 string in level 5 mat file fails to load or is corrupted #164

russelburgess · 2021-03-09T03:36:30Z

I am having issues with UTF-8 strings causing mat files to fail to load. Below is a test case that writes a character array named "x" containing the string "test°test" to test.mat. Until R2020a this loaded in matlab, though with some corruption (a couple of random bytes on the end of the string), in R2020b this fails with "Error using load / Cannot read file D:...\test.mat."

#include <stdio.h>
#include <string.h>
#include "matio.h"

int main()
{
  mat_t *matfp;
  matvar_t *matvar;

  char data[] = "test\u00B0test"; // "test" + degree symbol + "test"
  size_t dims[2];

  matfp = Mat_CreateVer("test.mat", "Level 5 mat file UTF8 test", MAT_FT_MAT5);
  if (matfp == NULL) {
    printf("MAT - Error creating file.\n");
    return 1;
  }

  dims[0] = 1;
  dims[1] = strlen(data); // Returns the byte length of data (10) not the number of UTF-8 codes (9)
  matvar = Mat_VarCreate("x", MAT_C_CHAR, MAT_T_UTF8, 2, dims, (char *)data, MAT_F_DONT_COPY_DATA);
  if (matvar == NULL) {
    printf("MAT - DATA variable creation failed.\n");
    return 1;
  }

  Mat_VarWrite(matfp, matvar, MAT_COMPRESSION_NONE);
  Mat_VarFree(matvar);

  Mat_Close(matfp);

  return 0;
}

It would appear that the issue is caused by the dims[] being used for both the dimensions of the character array and for the number of bytes of data. That is, matlab expects the dimensions of the string "test°test" to be 1 x 9 (counting "°" as one), but the string occupies 10 bytes in the mat file, and matio writes the same number for both.

This can be seen by inspecting the hex dump of the file produced ("test.mat'):

```
4C 65 76 65 6C 20 35 20 6D 61 74 20 66 69 6C 65
20 55 54 46 38 20 74 65 73 74 00 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 00 01 49 4D
0E 00 00 00 40 00 00 00 06 00 00 00 08 00 00 00
04 00 00 00 00 00 00 00 05 00 00 00 08 00 00 00
01 00 00 00 0A 00 00 00 01 00 01 00 78 00 00 00
10 00 00 00 0A 00 00 00 74 65 73 74 C2 B0 74 65
73 74 00 00 00 00 00 00
```

The first bolded byte is specifying the second dimension of the array; the second bolded byte is specifying the number of bytes of data in the element. Both of these originate in dims[1].

By changing that first byte from 0A to 09, R2020a and R2020b both load the file correctly.

tbeu · 2021-03-09T19:35:03Z

Thanks for reporting. Especially the changed behavior in MATLAB is interesting.
But, isn't this an application issue actually? If you replace the call of strlen by something like

// See https://stackoverflow.com/a/32936928/8520615
size_t count_utf8_code_points(const char *s) {
    size_t count = 0;
    while (*s) {
        count += (*s++ & 0xC0) != 0x80;
    }
    return count;
}

you get the expected length (on Linux). By the way, running strlen with MSVC (on Win) also gives the expected length. Thus, there definetely is a platform dependency here, which I have not yet figured out.

See also https://godbolt.org/z/T6GMb5 for some experimenting.

tbeu · 2021-03-09T19:45:43Z

Ah, I understand better now. Array length and number of bytes must be set differently. Let me see what I can do.

As reported by #79 and #164

tbeu · 2021-03-09T21:38:44Z

By changing that first byte from 0A to 09, R2020a and R2020b both load the file correctly.

It would not in R2019b where there is no proper UTF-8 support for MAT files. See also #79. It seems The MathWorks recently has improved UTF-8 support in MATLAB. Can you check if the files UTF8-Counterexample.zip and UTF32-Counterexample.zip (taken from blog.omega-prime.co.uk) can be successfully loaded in MATLAB R2020a and R2020b?

By the way, running strlen with MSVC (on Win) also gives the expected length. Thus, there definetely is a platform dependency here, which I have not yet figured out.

For MSVC I need to set char data[] = "test\xC2\xB0test"; // "test" + degree symbol + "test" to get the expected behaviour.

I also tried to address the issue by 836f8dd (still WIP). Nevertheless, you need to adapt the dimension in the application code, for example dims[1] = count_utf8_code_points(data);.

russelburgess · 2021-03-09T22:53:20Z

Ah, can't believe I missed that previous issue #! Neither of UTF8-Counterexample.zip and UTF32-Counterexample.zip loaded in R2020a or R2020b, both fail with the same error as before "Error using load / Cannot read file D:...\UTF8-Counterexample.mat." Presumably MATLAB still can't read code points > 2 bytes then.

However using commit 836f8dd and count_utf8_code_points() does work (for the example string I gave, loads in both R2020a and R2020b). I should have mentioned before that I'm working in msys2, gcc 9.3.0, under Windows 10 64-bit.

I wasn't aware strlen() was platform dependent so that's good to know. FWIW for me strlen("test\u00B0test") returns:

10 from msys2 / gcc 9.3.0
9 from Visual Studio 16.9.1 (MSVC version 1928)

But... strlen("test\xC2\xB0test") returns:

10 from msys2
10 from Visual Studio

Which is...confusing. As best as I can tell MSVC is turning "\u00B0" into literally just the byte B0, whereas gcc is turning it into C2B0.

It seems like that commit will fix the issue as much as can be fixed on matio's end, so I can close the issue unless there are any other tests in R2020a/b you'd like me to run?

tbeu · 2021-03-10T07:05:33Z

Thanks for confirmation and further analysis. Please leave this issue open for now as I need to run the testsuite through latest MATLAB. I will close it once it is properly addressed and commited to master. Afterwards I will release a new version.

russelburgess · 2021-03-11T18:59:34Z

Awesome, thanks for the prompt response to all this!

As reported by #79 and #164

tbeu · 2021-03-21T18:34:40Z

It's now fixed on master branch.

tbeu added a commit that referenced this issue Mar 9, 2021

Fix writing of UTF-8 encoded character array to v5 MAT file

836f8dd

As reported by #79 and #164

tbeu added a commit that referenced this issue Mar 11, 2021

Fix I/O of UTF-8 encoded character array from/to v5 MAT file

254091a

As reported by #79 and #164

tbeu added a commit that referenced this issue Mar 14, 2021

Fix I/O of UTF-8 encoded character array from/to v5 MAT file

912d4e4

As reported by #79 and #164

tbeu added a commit that referenced this issue Mar 14, 2021

Fix I/O of UTF-8 encoded character array from/to v5 MAT file

32bd89e

As reported by #79 and #164

tbeu added a commit that referenced this issue Mar 20, 2021

Fix I/O of UTF-8 encoded character array from/to v5 MAT file

6672e3f

As reported by #79 and #164

tbeu added a commit that referenced this issue Mar 21, 2021

Fix I/O of UTF-8 encoded character array from/to v5 MAT file

67a922f

As reported by #79 and #164

tbeu closed this as completed Mar 21, 2021

tbeu added the bug label Mar 21, 2021

This was referenced Mar 23, 2021

Using strlen to detect the number of bytes breaks the usage of char arrays #165

Closed

telemetryDeviceDumper: problems in some .mat files saved robotology/robometry#107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 string in level 5 mat file fails to load or is corrupted #164

UTF-8 string in level 5 mat file fails to load or is corrupted #164

russelburgess commented Mar 9, 2021

tbeu commented Mar 9, 2021

tbeu commented Mar 9, 2021

tbeu commented Mar 9, 2021

russelburgess commented Mar 9, 2021

tbeu commented Mar 10, 2021

russelburgess commented Mar 11, 2021

tbeu commented Mar 21, 2021

UTF-8 string in level 5 mat file fails to load or is corrupted #164

UTF-8 string in level 5 mat file fails to load or is corrupted #164

Comments

russelburgess commented Mar 9, 2021

tbeu commented Mar 9, 2021

tbeu commented Mar 9, 2021

tbeu commented Mar 9, 2021

russelburgess commented Mar 9, 2021

tbeu commented Mar 10, 2021

russelburgess commented Mar 11, 2021

tbeu commented Mar 21, 2021