New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootloader: resolve unicode issues in temporary path. #2767

Closed
wants to merge 2 commits into
base: develop
from

Conversation

Projects
None yet
3 participants
@ghost

ghost commented Aug 23, 2017

No description provided.

xoviat added some commits Aug 23, 2017

xoviat
bootloader: use short path to avoid unicode issues
The temporary path was converted to ANSI, which cannot be correctly handled with international names. Using the short path avoids these issues.

Closes #2754.
xoviat
bootloader: slight optimization
I was unsure originally whether wchar_buffer could be the source and target, but apparently it can.
@htgoebel

This comment has been minimized.

Member

htgoebel commented Aug 23, 2017

many thanks for this pull-request. DO you have any idea how we can test this in CI?

@htgoebel htgoebel added the bootloader label Aug 23, 2017

@htgoebel htgoebel added this to the PyInstaller 3.3 milestone Aug 23, 2017

@ghost

This comment has been minimized.

ghost commented Aug 23, 2017

I assume we can use a unicode path. There might be other issues that this was masking. I'm not going to do everyone's work for them.

@codewarrior0

This comment has been minimized.

Member

codewarrior0 commented Aug 24, 2017

Wrong fix for #2754. Short path names should only be used for Python 2, since Python 3 offers the option of encoding paths as UTF-16 instead of ANSI. Should instead look at why pyi_win32_utf8_to_mbs_sfn didn't return c:\Users\TET~1\AppData\Local\Temp.

@ghost

This comment has been minimized.

ghost commented Aug 24, 2017

The fix works though. I'm not going to spend my time debugging pyi_win32_utf8_to_mbs_sfn.

@ghost

This comment has been minimized.

ghost commented Aug 24, 2017

Either this is going in 3.3 or you are going to submit a PR or it won't be fixed.

@htgoebel htgoebel removed this from the PyInstaller 3.3 milestone Aug 24, 2017

@htgoebel

This comment has been minimized.

Member

htgoebel commented Aug 24, 2017

I will not depend 3.3 on this one.

@ghost ghost closed this Aug 24, 2017

@ghost ghost deleted the bootloader branch Aug 24, 2017

@ghost

This comment has been minimized.

ghost commented Aug 24, 2017

Probably for the best as it will push more people to use Python ,3.

@brian-a-clark

This comment has been minimized.

brian-a-clark commented Aug 28, 2017

I appreciate the effort here, but I tried your fix myself and it doesn't work for my particular unicode character. While I'm sure there are other unicode characters for which it doesn't work as well, my issue was, strangely, not a general unicode bug.

It could be a windows bug, I suppose. It seems if I do a GetTempPathW() for a user with the name 'tést', for example, the existing code works fine, as the OS returns "C:\Users\TST~1...". Doing the same for a user with the name 'teşt' yields "C:\Users\teşt...". When I explicitly ask for GetShortPathNameW() with "C:\Users\teşt", I just get the same string back.

'dir /x' shows the short path as c:\Users\TET~1 as expected.

I'll keep investigating and let you know if I figure anything out.

@codewarrior0

This comment has been minimized.

Member

codewarrior0 commented Aug 28, 2017

It's a Windows feature. In your Region Control Panel, you have an option for "Language for Non-Unicode Programs" - this selects which ANSI Code Page is used for non-Unicode Windows APIs and filenames, which means any characters in your codepage are allowed to exist in short path names. It looks like your codepage contains the S-with-cedilla but not the E-with-acute-accent. (This setting also controls which code page is used when using the mbcs text encoding in Python.)

@brian-a-clark

This comment has been minimized.

brian-a-clark commented Aug 28, 2017

It's set to English (United States). I have difficulty believe that that ansi code page would have 'ş' in it at all. This character is used in very few languages (see [https://en.wikipedia.org/wiki/%C5%9E])

@codewarrior0

This comment has been minimized.

Member

codewarrior0 commented Aug 28, 2017

How bizarre.

I did some testing and it looks like I was either mistaken or misinformed. I created folders with filenames less than 6 characters long containing each character from U+0080 to U+0200 along with the character's codepoint and checked which ones return shortened paths from GetShortPathName and which ones are identical.

All code points up to U+00FF return a shortened path. (Note that the English code page, Latin-1, is identical to the first 256 Unicode code points.) Less than half of the code points from U+0100 to U+0200 returned a shortened path. Here is the list of code points that did return a shortened path. S-with-cedilla (U+015E) is not among them.

['0x110', '0x132', '0x133', '0x138', '0x13f', '0x140', '0x149', '0x14a', '0x14b', '0x152', '0x153', '0x160', '0x161', '0x178', '0x17d', '0x17e', '0x17f', '0x181', '0x182', '0x183', '0x184', '0x185', '0x186', '0x187', '0x188', '0x189', '0x18a', '0x18b', '0x18c', '0x18d', '0x18e', '0x18f', '0x190', '0x191', '0x192', '0x193', '0x194', '0x195', '0x196', '0x198', '0x199', '0x19b', '0x19c', '0x19d', '0x19e', '0x1a2', '0x1a3', '0x1a4', '0x1a5', '0x1a6', '0x1a7', '0x1a8', '0x1a9', '0x1aa', '0x1ac', '0x1ad', '0x1b1', '0x1b2', '0x1b3', '0x1b4', '0x1b5', '0x1b7', '0x1b8', '0x1b9', '0x1ba', '0x1bb', '0x1bc', '0x1bd', '0x1be', '0x1bf', '0x1c0', '0x1c1', '0x1c2', '0x1c4', '0x1c5', '0x1c6', '0x1c7', '0x1c8', '0x1c9', '0x1ca', '0x1cb', '0x1cc', '0x1dd', '0x1e0', '0x1e1', '0x1e2', '0x1e3', '0x1ee', '0x1ef', '0x1f1', '0x1f2', '0x1f3', '0x1f4', '0x1f5', '0x1f6', '0x1f7', '0x1f8', '0x1f9', '0x1fa', '0x1fb', '0x1fc', '0x1fd', '0x1fe', '0x1ff']

I'm not sure what the pattern is, here.

@brian-a-clark

This comment has been minimized.

brian-a-clark commented Aug 28, 2017

I believe what you are talking about would be responsible for the portion of the issue where 'c:\Users\teşt' is turned into 'c:\Users\test' (i.e., because the 'ş' is not present on the ansi code page). But that's already after the point where the shortname is incorrectly generated, which I have to imagine would be unrelated to the codepage (but maybe not).

@codewarrior0

This comment has been minimized.

Member

codewarrior0 commented Aug 28, 2017

Actually, I'm referring to how GetShortPathNameW('teşt') returns teşt, but GetShortPathNameW('tést') returns TST~1.

@brian-a-clark

This comment has been minimized.

brian-a-clark commented Aug 28, 2017

yeah, sorry, my last comment was intended to come before your comment preceding it. I seem to be seeing the comments on a delay.

@ghost

This comment has been minimized.

ghost commented Aug 28, 2017

After some investigating, it seems that short paths are not guaranteed to be ansi-only or even to exist. However, this may be more likely to work:

LPSTR *get_c_path(wchar *cpy) {
    wlength = GetShortPathNameW(cpy,0,0);
    LPWSTR shortp = (LPWSTR)calloc(wlength,sizeof(WCHAR));
    GetShortPathNameW(cpy,shortp,wlength);
    clength = WideCharToMultiByte(CP_OEMCP, WC_NO_BEST_FIT_CHARS, shortp, wlength, 0, 0, 0, 0);
    LPSTR cpath = (LPSTR)calloc(clength,sizeof(CHAR));
    WideCharToMultiByte(CP_OEMCP, WC_NO_BEST_FIT_CHARS, shortp, wlength, cpath, clength, 0, 0);

    return &cpath;
}

Note: I have not checked this.

@codewarrior0

This comment has been minimized.

Member

codewarrior0 commented Aug 28, 2017

That's exactly what is done by the function you declined to debug.

@ghost

This comment has been minimized.

ghost commented Aug 28, 2017

Well then we may need to label #2754 as wontfix unless anyone else has other ideas?

@codewarrior0

This comment has been minimized.

Member

codewarrior0 commented Aug 28, 2017

That would be fair, since a normal Python 2.7 installation won't even start if it is installed inside a so named folder. I had hoped that GetShortPathNameW would at least allow PyI apps to start in those folders, but even that doesn't seem to work in all cases (for mysterious reasons).

The absolutely infurating part about this is being unable to write a unit-test, since the subprocess module on Python 2.7 only uses the ANSI Windows APIs, so there's no way to set the path to the executable or any of the environment variables to a string that can't be encoded in the current codepage, which is exactly what I want to test.

@htgoebel

This comment has been minimized.

Member

htgoebel commented Aug 31, 2017

Thanks a lot for all this debugging. Very obscure issue :-\

So we close this pull-request and #2754. Can one of you please at a note to #2754 which I can use as a "known limitation" text for the change-log, describing the problem briefly. I'm heading the release 3.3 then.

This issue was closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment