-
-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-8 string in level 5 mat file fails to load or is corrupted #164
Comments
Thanks for reporting. Especially the changed behavior in MATLAB is interesting. // See https://stackoverflow.com/a/32936928/8520615
size_t count_utf8_code_points(const char *s) {
size_t count = 0;
while (*s) {
count += (*s++ & 0xC0) != 0x80;
}
return count;
} you get the expected length (on Linux). By the way, running See also https://godbolt.org/z/T6GMb5 for some experimenting. |
Ah, I understand better now. Array length and number of bytes must be set differently. Let me see what I can do. |
It would not in R2019b where there is no proper UTF-8 support for MAT files. See also #79. It seems The MathWorks recently has improved UTF-8 support in MATLAB. Can you check if the files UTF8-Counterexample.zip and UTF32-Counterexample.zip (taken from blog.omega-prime.co.uk) can be successfully loaded in MATLAB R2020a and R2020b?
For MSVC I need to set I also tried to address the issue by 836f8dd (still WIP). Nevertheless, you need to adapt the dimension in the application code, for example |
Ah, can't believe I missed that previous issue #! Neither of UTF8-Counterexample.zip and UTF32-Counterexample.zip loaded in R2020a or R2020b, both fail with the same error as before "Error using load / Cannot read file D:...\UTF8-Counterexample.mat." Presumably MATLAB still can't read code points > 2 bytes then. However using commit 836f8dd and count_utf8_code_points() does work (for the example string I gave, loads in both R2020a and R2020b). I should have mentioned before that I'm working in msys2, gcc 9.3.0, under Windows 10 64-bit. I wasn't aware strlen() was platform dependent so that's good to know. FWIW for me strlen("test\u00B0test") returns:
But... strlen("test\xC2\xB0test") returns:
Which is...confusing. As best as I can tell MSVC is turning "\u00B0" into literally just the byte B0, whereas gcc is turning it into C2B0. It seems like that commit will fix the issue as much as can be fixed on matio's end, so I can close the issue unless there are any other tests in R2020a/b you'd like me to run? |
Thanks for confirmation and further analysis. Please leave this issue open for now as I need to run the testsuite through latest MATLAB. I will close it once it is properly addressed and commited to master. Afterwards I will release a new version. |
Awesome, thanks for the prompt response to all this! |
It's now fixed on master branch. |
I am having issues with UTF-8 strings causing mat files to fail to load. Below is a test case that writes a character array named "x" containing the string "test°test" to test.mat. Until R2020a this loaded in matlab, though with some corruption (a couple of random bytes on the end of the string), in R2020b this fails with "Error using load / Cannot read file D:...\test.mat."
It would appear that the issue is caused by the dims[] being used for both the dimensions of the character array and for the number of bytes of data. That is, matlab expects the dimensions of the string "test°test" to be 1 x 9 (counting "°" as one), but the string occupies 10 bytes in the mat file, and matio writes the same number for both.
This can be seen by inspecting the hex dump of the file produced ("test.mat'):
The first bolded byte is specifying the second dimension of the array; the second bolded byte is specifying the number of bytes of data in the element. Both of these originate in dims[1].
By changing that first byte from 0A to 09, R2020a and R2020b both load the file correctly.
The text was updated successfully, but these errors were encountered: