Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-16 support #34

Closed
bavay opened this issue Apr 6, 2016 · 9 comments
Closed

UTF-16 support #34

bavay opened this issue Apr 6, 2016 · 9 comments

Comments

@bavay
Copy link

bavay commented Apr 6, 2016

Currently, matio does not support reading UTF16 strings. Unfortunately, decently recent Windows computer use UTF16 and when Matlab writes a .mat file, it will default to UTF16 for strings containing non-ascii characters.

This means that with the current version of matio, trying to read a .mat file created by Matlab on Windows that contains non-ascii characters (like the '°' for temperatures or like some German accentuated letters) will write lots of error messages ("Character data not supported type") as well as fail to read the offending strings (which becomes even more problematic when these should contain the units). As I am stuck with some files with such properties (they come from an operational weather forecast toolchain and should be forwarded into another operational toolchain, so I can absolutely not ask for anything to be changed), my only hope is to have some support implemented in matio.

As I see it, they are a few options:

  • relying on something like Posix "iconv", but this requires an external dependency and is not fully portable;
  • relying on a piece of C++ that could do it (most of the code examples use c++ because most of the required elements are available in the standard library) but again, this introduces external dependencies since the code would not be pure c anymore;
  • I found a contribution to libxml2 that adds utf16->utf8 conversion (see https://opensource.apple.com/source/libxml2/libxml2-7/libxml2/encoding.c (UTF16BEToUTF8()). I could extract this function and adapt it to matio so it now successfully reads my .mat files

Of course, you might have other ideas or prefer some other options!

@tbeu
Copy link
Owner

tbeu commented Apr 6, 2016

Can you please attach a MAT-file - created by MATLAB - with UTF16 char array. Thanks.

@bavay
Copy link
Author

bavay commented Apr 6, 2016

I had to zip it in order to attach it...
test-UTF16.zip

@tbeu
Copy link
Owner

tbeu commented Apr 6, 2016

Indeed, both function ReadCharData and ReadCompressedCharData miss the case for MAT_T_UTF16.

Can you check if 12b7d40 solves already the problem.

@bavay
Copy link
Author

bavay commented Apr 6, 2016

This is half way better... The string is read without error message but the non-ascii characters remain messed up (and I have no clue what these are encoded into).
for example, for the units: '�C'

@tbeu
Copy link
Owner

tbeu commented Apr 6, 2016

Should be little endian Unicode on Win (which is little endian). Thus, it should be exactly what you want.

From matfile_format.pdf:

The UTF-16 and UTF-32 encodings are in the byte order specified by the
Endian Indicator. UTF-8 is byte order neutral.

@tbeu
Copy link
Owner

tbeu commented Apr 6, 2016

If you apply e0d8f44 to matdump then matdump -d test-UTF16.mat stat.dunit > dunit.txt correctly shows °C in Latin1 encoding in dunit.txt. It's really strange.

Or try test_mat readvar test-UTF16.mat stat > stat.txt with same result for the dunit field.

@bavay
Copy link
Author

bavay commented Apr 6, 2016

Ok, now this is clear... I'm sorry, my terminal was configured as UTF-8, therefore it was messing up the Latin1 encoding. So, your commit does works, it properly reads UTF-16 and outputs ISO-8859-1. Thanks a lot for your very quick reply and commit!

@tbeu
Copy link
Owner

tbeu commented Apr 6, 2016

But what still is strange, that the UTF-16 string is not Unicode encoded but ISO-8859-1. MATLAB simply is not consistent on this, see e.g. http://blog.omega-prime.co.uk/?p=150.

@tbeu
Copy link
Owner

tbeu commented Apr 6, 2016

I'd like to see some MAT-file where the UTF-16 character array does not reduce to 8-bit encoding.

tbeu added a commit that referenced this issue Apr 11, 2016
tbeu added a commit that referenced this issue Apr 11, 2016
tbeu added a commit that referenced this issue Apr 11, 2016
@tbeu tbeu closed this as completed Apr 11, 2016
papadop pushed a commit to papadop/matio that referenced this issue Nov 29, 2017
papadop pushed a commit to papadop/matio that referenced this issue Nov 29, 2017
papadop pushed a commit to papadop/matio that referenced this issue Nov 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants