-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filename encoding issues #17
Comments
I think that fixing this requires the use of Windows' wide character filename APIs. du uses opendir/readdir/closedir to handle directory traversal. mingw has non-standard wide character versions of these that could be used. Once du has a filename is passes it to stat. busybox-w32 has it's own stat replacement (stolen from git) which uses GetFileAttributesExA. That would have to be replaced with GetFileAttributesExW. I suspect that supporting wide character filenames throughout busybox-w32 would be a significant amount of work. |
Wide char APIs will be great, but perhaps there is a simpler way? Although it is not unicode, but all national characters should be supported. |
I've made a couple of changes that should improve support for non-ASCII characters in ls and vi. |
Thank you, it's better now: there are no question marks. E.g. for file named Update |
I know almost nothing about Windows codepages but I thought there might be problems. I'd like to get this sorted, though: cyrillic support is important to us at the company I work for. I need to investigate further. |
There are several ways:
|
I've attempted to intercept I/O operations to the console and convert characters. I definitely haven't caught all I/O and nor have I tested it very thoroughly, but the latest binary might show an improvement. |
I have not checked all applets, but I will when you tell me that they are ready. |
ec386ad But text output should not be converted. |
This is also seen in the "find" applet when walking across a search - this is more of a systemic issue across all file name access by busybox applets. Also, is issue #5 related to this? |
I wonder why Edit: hardly a consolation, but some rationale actually exists. |
A patch adding |
I would like to try working on adding full unicode support. I've done this in other applications, with the approach mainly being:
I've done most of the above elsewhere, including wrapping In this project, however, I found it harder to identify the boundary between "internal" and "OS access", especially when it comes to applet's So basically, I'd appreciate some help on which Comments? |
For reference, @dscho maintained a fork of this repo with unicode support for git-for-windows - https://github.com/git-for-windows/busybox-w32 . Last I tested it (not recently) I couldn't find issues related to unicode - everything seemed to work, but I didn't perform intensive tests. If someone wants to try it out, this download for instance includes It seems to be (last) rebased on top of 096aee2 (2017-12), and the fork hasn't been updates since 2018-01-30. This seems to be the diff applied on top of upstream at the time: https://github.com/git-for-windows/busybox-w32/compare/096aee2b..0b3cdd76 The differences seem mainly:
Overall, the diff is big but not huge. I assume that since then some of it might now conflict and/or be more similar than it was in the past. Maybe some combined effort can be made to start merging parts of it back upstream (i.e. here)? @dscho are you still interested in busybox-w32 for GFW? it seems new release do include it, so I'd assume you'd prefer to be closer to upstream than your current fork? If yes, and if @rmyorston is intetested as well, would you be willing to coordinate sub-patches which are acceptable upstream? |
I am interested, but short of time. For that reason, the Git for Windows fork is seriously out of date. Last time I checked, the rebased version failed to build in the very early steps already. The latest state should be in some branch in my personal fork (can't check right now, I'm on my phone). |
Sufficiently interested to have kept this issue open for nearly seven years as a reminder. Make of that what you will. Observations:
|
Yup, figured that out already.
Yup, though it does use console API to read input with that EURO thingy. This will be handled later, first priority IMO will be to get non-interactive scripts working fully. So far I got to/from UTF-8 at the process boundary (argv/env on entry, cmd/argv/env on spawn) - which already allows a lot of things to work very nicely. However, general unicode paths (stats, fopen, cd, readdir, etc) are still not implemented.
Hmm.. I indeed thought of just make it a replacement, but I can do build time config as well. Stay tuned. |
Are you sure? I just tried busybox 1.27.2 on Ubuntu, and I could paste unicode strings, and moving the cursor seemed to progress correctly (over codepoints rather than individual bytes). EDIT - I also just tried it with this file https://salsa.debian.org/printing-team/cups/raw/debian/master/cups/utf8demo.txt and it seemed to render and move the cursor mostly correctly - though not perfectly. |
Yes, I'm sure. I fired up an Ubuntu 18.04 virtual machine to check, but no, it's just as borken as on Fedora. The utf8demo.txt file has no trailing white space but on lines with multi-byte characters it's possible to move the cursor beyond the visible characters. Or position the cursor on a multi-byte character and use 'a' to append text after it: the character is split. |
Just sharing some information for now. While working on UTF-8 wrappers for Windows APIs, I found out that starting with Windows 10 1903 (May 2019), an application can specify at its manifest that it wants the active codepage to be UTF8. This means that the ANSI APIs (like Note that currently it affects the value from GetACP(), but it doesn't change the console input/output codepages. See https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page I've experimented with this, and from little testing it does make busybox able to at least list file names with unicode paths with correct sizes etc. The printout still depends on GetConsoleOutputCP() value, but if that is set to UTF8 as well then the files do print correctly on screen (I also did write code to use the W console output APIs when not using the manifest - seem to work). When these two (manifest for ACP and console output codepage) set and in effect, the only remaining issue appears to be console input - which should be relatively simple to solve with the W APIs. I did try to set Console(Input)CP to UTF8, but this seem to make busybox not echo keypresses. I didn't try to dig deeper for now. Additionally, The windows settings for "non unicode programs" seem to affect only console input and output CP (ACP remains the same - seems to have the same effect as when using Now, this manifest, when applicable, seems to solve all the hard problems for busybox, but it's not compatible with windows XP (busybox refuses to run), and seems to have no effect (as expected) on win7. I'm assuming busybox does want to keep supporting xp and win7/8 and 10 earlier than 1903, so for those to support unicode wrappers are required, but it's useful to keep in mind that with recent windows the hard part of the wrappers is not required. |
@avih, thanks for the update. I think it's important that the standard binary of busybox-w32 should Just Work on any version of Windows and without any configuration by the user. |
Sure, except that it already doesn't when it comes to unicode support. So while it would be great to have unicode support on earlier windows versions (and that is still my intention, and how I started this work - by successfully converting argv/env on entry and spawn), I now think we should incrementally enhance busybox as follows:
So I'm trying to specify the wrapping/emulation control parameters:
If the stars align, e.g. when strings are UTF8 and ConsoleOutput is UTF8 and native console VT works, then it basically completely bypass all of the output emulation/wrapping. Etc. |
It would also be nice if it detect the environment where it was launched and react accordingly to have correct output.
|
The only relevant environment is the system code page and the code pages of the windows-console in which busybox runs, which should be detected and respected.
I don't quite understand how this is relevant. If you suggest that busybox should detect automatically the encoding of an output from some program (e.g. if you run inside busybox busybox has However, if you only want to run inside busybox |
(Also, I did not drop my intention to add full unicode support to busybox-w32, but i'm sidetracked by unrelated things since about my last comment on this subject. I do want to get back to it at some stage, and all my preliminary testing suggested it's entirely possible). |
For reference, that's the code I got for complete environment support (arguments, variables). It may or may not apply cleanly to current busybox master. I didn't try. diff --git a/Config.in b/Config.in
index d18f3dac5..adc21d86a 100644
--- a/Config.in
+++ b/Config.in
@@ -467,6 +467,16 @@ config FEATURE_EURO
requires the OEM code page to be 858. If the OEM code page of
the console is 850 when BusyBox starts it's changed to 858.
+config FEATURE_MINGW_UTF8
+ bool "Support unicode using internal UTF-8 translation (WIP)"
+ default y
+ depends on PLATFORM_MINGW32
+ help
+ Unicode arguments and environment variables are translated to UTF-8
+ on entry, and back to Windows-Unicode when executing a program.
+ For best results, [shell] scripts with non-English content should
+ be encoded in UTF-8. Overrides FEATURE_EURO where applicable.
+
config FEATURE_EXTRA_FILE_DATA
bool "Read additional file metadata (2.1 kb)"
default y
diff --git a/configs/mingw32_defconfig b/configs/mingw32_defconfig
index d1dec129f..aa68fd33f 100644
--- a/configs/mingw32_defconfig
+++ b/configs/mingw32_defconfig
@@ -53,6 +53,7 @@ CONFIG_FEATURE_ICON=y
# CONFIG_FEATURE_ICON_STERM is not set
CONFIG_FEATURE_ICON_ALL=y
CONFIG_FEATURE_EURO=y
+CONFIG_FEATURE_MINGW_UTF8=y
CONFIG_FEATURE_EXTRA_FILE_DATA=y
CONFIG_FEATURE_READLINK2=y
diff --git a/configs/mingw64_defconfig b/configs/mingw64_defconfig
index bdaafaa86..5dc257188 100644
--- a/configs/mingw64_defconfig
+++ b/configs/mingw64_defconfig
@@ -53,6 +53,7 @@ CONFIG_FEATURE_ICON=y
# CONFIG_FEATURE_ICON_STERM is not set
CONFIG_FEATURE_ICON_ALL=y
CONFIG_FEATURE_EURO=y
+CONFIG_FEATURE_MINGW_UTF8=y
CONFIG_FEATURE_EXTRA_FILE_DATA=y
CONFIG_FEATURE_READLINK2=y
diff --git a/include/mingw.h b/include/mingw.h
index a67b161c7..70f4567e0 100644
--- a/include/mingw.h
+++ b/include/mingw.h
@@ -348,6 +348,22 @@ int mingw_fstat(int fd, struct mingw_stat *buf);
#define stat mingw_stat
#define fstat mingw_fstat
+#if ENABLE_FEATURE_MINGW_UTF8
+// By convention foo_U is the windows API fooW/_wfoo but with UTF-8 interface,
+// and mu_bar is a utility
+
+// allocate a null-terminated conversion-result for null-terminated input.
+char *mu_utf8(const wchar_t *ws);
+wchar_t *mu_wide(const char *u8);
+
+// allocate a null-terminated array of null-terminated converted strings.
+// if maxn < 0: up to a null input string, else up to maxn or a null input string
+char **mu_utf8_vec(wchar_t *const *wvec, int maxn);
+wchar_t **mu_wide_vec(char *const *uvec, int maxn);
+
+#endif // ENABLE_FEATURE MINGW_UTF8
+
+
/*
* sys/sysmacros.h
*/
diff --git a/libbb/appletlib.c b/libbb/appletlib.c
index d2f98567e..3bb7e7ba4 100644
--- a/libbb/appletlib.c
+++ b/libbb/appletlib.c
@@ -1190,6 +1190,38 @@ get_script_content(unsigned n UNUSED_PARAM)
#endif /* defined(SINGLE_APPLET_MAIN) */
+#if ENABLE_PLATFORM_MINGW32 && ENABLE_FEATURE_MINGW_UTF8
+static void mu_utf8_set_env(void)
+{
+ wchar_t *envw0 = GetEnvironmentStringsW(), *envw = envw0, *p;
+ char *eu;
+
+ for (; envw && *envw; envw += wcslen(envw) + 1) {
+ for (p = envw; *p && *p < 0x80; p++)
+ /* no-op */;
+ if (!*p)
+ continue; // name and value are ascii-7
+ if (!(eu = mu_utf8(envw)))
+ continue; // nothing to do on error, just skip
+
+ // replace the (OEM) char* entry with UTF-8 of the wchar_t* entry.
+ // If the OEM name bytes-sequence is different than in UTF-8, then
+ // the OEM name will remain and we'll just add a new entry with the
+ // UTF-8 name (and value).
+ // In an OEM string all byte values are valid, and putenv and getenv
+ // allow any (obviously except '\0' and '=') even on winXP, so using it
+ // to hold UTF-8 works fully, and `environ' is UTF-8 from now on.
+ // _wgetenv, however, will not translate our UTF-8 to wchar_t, and
+ // simiparly _wputenv will not end with UTF-8 in `environ', so the
+ // wide variants should not be used from now on in this environment.
+
+ putenv(eu);
+ free(eu); // arg got copied (unlike posix spec, but like glibc)
+ }
+
+ FreeEnvironmentStringsW(envw0);
+}
+#endif // ENABLE_PLATFORM_MINGW32 && ENABLE_FEATURE_MINGW_UTF8
#if ENABLE_BUILD_LIBBUSYBOX
int lbb_main(char **argv)
@@ -1244,6 +1276,19 @@ int main(int argc UNUSED_PARAM, char **argv)
}
#endif
#if ENABLE_PLATFORM_MINGW32
+
+#if ENABLE_FEATURE_MINGW_UTF8
+ {
+ int n;
+ wchar_t **wargv = CommandLineToArgvW(GetCommandLineW(), &n);
+ char **uargv = wargv ? mu_utf8_vec(wargv, n) : 0;
+ if (uargv)
+ argv = uargv; // leaked on exit. FIXME: conflicts with BB_MMU?
+
+ mu_utf8_set_env();
+ }
+#endif // ENABLE_FEATURE_MINGW_UTF8
+
/* detect if we're running an interpreted script */
if (argv[0][1] == ':' && argv[0][2] == '/') {
switch (argv[0][0]) {
diff --git a/shell/ash.c b/shell/ash.c
index d35ae027f..8c3c367a5 100644
--- a/shell/ash.c
+++ b/shell/ash.c
@@ -15036,7 +15036,8 @@ init(void)
bs_to_slash(end+1);
}
- /* check for invalid characters in name */
+ /* check for invalid characters in name. busybox ash is_name
+ * limit name chars to [_[:alnum:]] in ASCII-7 (no UTF-8) */
for (start = *envp;start < end;start++) {
if (!isdigit(*start) && !isalpha(*start) && *start != '_') {
break;
diff --git a/win32/mingw.c b/win32/mingw.c
index faa9f2b57..5a701a039 100644
--- a/win32/mingw.c
+++ b/win32/mingw.c
@@ -165,6 +165,99 @@ int err_win_to_posix(void)
static int zero_fd = -1;
static int rand_fd = -1;
+#if ENABLE_FEATURE_MINGW_UTF8
+// All functions which use mu_utf8_count/mu_wide_count to first check the
+// expected size assume the conversion will work with same or bigger space.
+
+// positive (not 0) on success. result is in destination units and includes the
+// terminating null if it's within the input count range. nws/nu8 can be -1 to
+// indicate that the input is null-terminated (and the result includes it).
+#define mu_utf8_count(ws_src, nws) \
+ WideCharToMultiByte(CP_UTF8, 0, (ws_src), (nws), 0, 0, 0, 0)
+#define mu_wide_count(u8_src, nu8) \
+ MultiByteToWideChar(CP_UTF8, 0, (u8_src), (nu8), 0, 0)
+
+// performs a conversion, trimmed (without null) if dest size is insufficient
+// TODO: if input size is given, does it convert beyond input \0 if size allows?
+#define mu_utf8_raw(ws_src, nws, u8_dst, nu8) \
+ WideCharToMultiByte(CP_UTF8, 0, (ws_src), (nws), (u8_dst), (nu8), 0, 0)
+#define mu_wide_raw(u8_src, nu8, ws_dst, nws) \
+ MultiByteToWideChar(CP_UTF8, 0, (u8_src), (nu8), (ws_dst), (nws))
+
+
+// dies on OOM, returns NULL on other errors.
+char *mu_utf8(const wchar_t *ws)
+{
+ char *u8 = 0;
+ int n = mu_utf8_count(ws, -1);
+ if (n > 0) {
+ u8 = xmalloc(sizeof(char) * n);
+ mu_utf8_raw(ws, -1, u8, n);
+ }
+ return u8;
+}
+
+// dies on OOM, returns NULL on other errors.
+wchar_t *mu_wide(const char *u8)
+{
+ wchar_t *ws = 0;
+ int n = mu_wide_count(u8, -1);
+ if (n > 0) {
+ ws = xmalloc(sizeof(wchar_t) * n);
+ mu_wide_raw(u8, -1, ws, n);
+ }
+ return ws;
+}
+
+// continuous allocation for the pointers (incl. final NULL) and the strings.
+// if maxn > 0 then up to maxn or NULL - whichever comes first. final NULL is
+// always added at the result array. NUll input vector is the same as empty.
+// dies on OOM, returns NULL if conversion failed otherwise.
+char **mu_utf8_vec(wchar_t *const *wvec, int maxn)
+{
+ size_t usize, n, i;
+ char **uvec, *uarg;
+
+ for (usize = 0, n = 0; wvec && wvec[n] && (maxn < 0 || n < maxn); n++) {
+ int count = mu_utf8_count(wvec[n], -1);
+ if (count <= 0)
+ return NULL;
+ usize += count;
+ }
+
+ uvec = xmalloc((n+1) * sizeof(char *) + usize * sizeof(char));
+ for (i = 0, uarg = (void *)(uvec + n + 1); i < n; i++) {
+ uvec[i] = uarg;
+ uarg += mu_utf8_raw(wvec[i], -1, uarg, usize);
+ }
+ uvec[i] = NULL;
+
+ return uvec;
+}
+
+wchar_t **mu_wide_vec(char *const *uvec, int maxn)
+{
+ size_t wsize, n, i;
+ wchar_t **wvec, *warg;
+
+ for (wsize = 0, n = 0; uvec && uvec[n] && (maxn < 0 || n < maxn); n++) {
+ int count = mu_wide_count(uvec[n], -1);
+ if (count <= 0)
+ return NULL;
+ wsize += count;
+ }
+
+ wvec = xmalloc((n+1) * sizeof(wchar_t *) + wsize * sizeof(wchar_t));
+ for (i = 0, warg = (void *)(wvec + n + 1); i < n; i++) {
+ wvec[i] = warg;
+ warg += mu_wide_raw(uvec[i], -1, warg, wsize);
+ }
+ wvec[i] = NULL;
+
+ return wvec;
+}
+#endif // ENABLE_FEATURE_MINGW_UTF8
+
/*
* Determine if 'filename' corresponds to one of the supported
* device files. Constants for these are defined as an enum
diff --git a/win32/process.c b/win32/process.c
index ac63a9c58..0b5ae4acf 100644
--- a/win32/process.c
+++ b/win32/process.c
@@ -3,6 +3,35 @@
#include <psapi.h>
#include "lazyload.h"
+
+#if ENABLE_FEATURE_MINGW_UTF8
+#ifdef spawnve
+#undef spawnve
+#endif
+#define spawnve spawnve_U
+
+static intptr_t spawnve_U(int mode,
+ const char *cmd, char *const *argv, char *const *env)
+{
+ intptr_t ret = -1;
+ wchar_t *wcmd = mu_wide(cmd),
+ **wargv = mu_wide_vec(argv, -1),
+ **wenv = mu_wide_vec(env, -1);
+
+ if (!wcmd || !wargv || !wenv)
+ errno = EINVAL;
+ else
+ ret = _wspawnve(mode, wcmd,(const wchar_t *const *)wargv,
+ (const wchar_t *const *)wenv);
+
+ free(wenv);
+ free(wargv);
+ free(wcmd);
+
+ return ret;
+}
+#endif // ENABLE_FEATURE_MINGW_UTF8
+
pid_t waitpid(pid_t pid, int *status, int options)
#if ENABLE_TIME
{
I have more WIP code which covers the console IO as well which generally works, and I didn't start working on changing the file APIs to UTF8 (well, I did, but then found out about the manifest "automatic" UTF-8 ANSI APIs, and started experimenting with that instead, and then got busy with other things). EDIT: |
Quick update:
[1] I think the manifest used to also work as an external file (busybox.exe.manifest at the same dir as busybox.exe), but currently I can only make it work if embedding it at the binary, like so:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
<assemblyIdentity type="win32" name="any-name-e-g-busybox-utf8" version="6.0.0.0"/>
<application>
<windowsSettings>
<activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
</windowsSettings>
</application>
</assembly>
Test cases with some files with unicode names at the current dir, executed from
mt.exe: |
So, few things:
So I wrote my own small C command line utility to handle resources in PE files - perc (pre-built windows binaries are available), which can be used like this to attach a UTF-8 manifest: perc -a utf8.manifest -t MANIFEST -i 1 busybox.exe Back to the main point now. While not impossible to add support, it's been nearly 10 years since this issue was opened, and busybox-w32 still doesn't support Unicode. But adding unicode is not trivial, and I'm guessing that if no one added it so far - while Windows 7 was still supported, it's much less likely to happen now when support for anything earlier than Windows 10 is quickly being dropped. I'm not at all suggesting that busybox-w32 should drop support for earlier windows. On the contrary, I think it's great it supports older versions. So I think the best approach would be to keep supportting earlier Windows users like today, but still add unicode support on Windows 10+ using the UTF-8 manifrest file. I'm still using this method to add support to my copy of busybox-w32, and have been doing so for at least a year now, and it still works great as far as I can tell. I think it could be beneficial to add to the release files also a Windows 10+ executable which has such UTF-8 manifest embedded. The natural method would be at build time, though |
The support for previous version of Windows isn't an issue I think. If the strings are kept internally in variables as UTF-8, then when an API it is needed they can be kept as UTF-8 or converted to ANSI or UNICODE on demand depending of the need of the API and the Windows version. Adding support to UTF-8 or long file names through manifest is just a shortcut that will cause more issue later. Also in my opinion starting to have separate executables for Windows versions just kill portability and increase complexity. |
The fact is that it's both a very useful feature, and also a feature which was not yet implemented suggests that it's not a trivial task (@ dscho did for the mini git busybox-w32 package, but it was not merged back, I started working on it and then left it).
How did you come up with that? That's the least risky way to add unicode support to existing Windows applications, but it only works on Windows 10 or later. All other methods require much more extensive and risky changes to an application.
There's already at least two executable for each release - 32 and 64, and IIRC there were glob choices too, and here I suggest adding an single additional win10-64-unicode binary. Yes, people need to think which version they should use, but if you need unicode support, because you have some files with non-english names, then you would gladly choose a version which supports it in shell scripts - if one was available. It might be useful to also limit this version (also using the manifest) to Windows 10 or later, so that it refuses to run on win7/8 instead of someone incorrectly assuming that if it runs then it also supports unicode (it refuses to run on XP just by using the UTF8 manifest, but on 7/8 it will run, just without unicode support). |
The 32-bit version of BusyBox (that I use) does works perfectly also on 64-bit Windows, so it is universal from 32-bit Win XP to 64-bit Win 11. |
Sure, I agree it would be great, but it won't happen by itself, and it hasn't happened in the 10 years since this issue was opened. I don't intend to continue working on it, and there are no other efforts I'm aware of. Do you intend to work on it? Who will do this work? Meanwhile, a non negligible amount of users which use Windows 10 - some could guess even the majority of busybox-w32 users, could have unicode support working with trivial effort and effectively zero risk to the codebase for at least some years now. Why not give them that? |
Commit 830e2cf adds a build-time option (disabled by default) to include the UTF-8 manifest. |
When I use chcp 65001, then pressing crtl+alt+e doesn't insert the euro symbol, does it works with the manifest? |
@avih |
I don't know, I don't have a Euro symbol on my KB. The main goal of adding the manifest is that shell scripts and utilities can work with unicode file names out of the box, especially when the name was NOT typed manyally at the busybox shell prompt:
Additionally, if the console code page is set to UTF-8 (by invoking Keyboard input at the shell prompt (or other utilities, like Do note that Unicode typing doesn't work without the manifest, so even if the manifest doesn't improve it, it's probably not a huge regression, if at all. Specifically for the Euro symbol, busybox-w32 does have some special handling of the Euro symbol, so it's possible that this code doesn't play nicely with the manifest, but I don't know that for a fact. A quick way to test would be to disable the special handling of this symbol in busybox-w32, but I haven't tried that either. It should be possible to modify the KB input code to handle unicode, and I did try that too back when I was trying to add native unicode support, and it did mostly work. However, this will be an actual change at the source code, and no longer "only add the manifest", but it might be worth exploring too.
On windows 7/8 the downside IS that it's ignored, because if someone downloads this version, and it has "unicode" at the name, then even if it also says "Windows10", they might think that if it runs on their system then unicode must work too, while not realizing that unicode will not work. Also, with the manifest it no longer runs on XP. These are the only downsides I can think of, but there could be others which I'm not aware of.
What is "by default"? replacing the w64 binary? Personally I don't think it should replace it. Also, keep in mind that you don't actually need to build a new version to include the manifest. You can also download any older version and attach the manifest to the old binary - and it would become a unicode version. See my previous comments which explain how to do that. (or you can download a unicode version and then remove the manifest yourself, e.g. with
I don't know. It can be described as experimental, and some people might opt to try it anyway. In my copies I've been adding the manifest for more than a year now, and I don't think I've encountered an issue which related to unicode text, or other issues as a result of adding the manifest. |
@avih |
I don't know. Try it out and report back? But generally speaking, I'd imagine that pasting at the busybox shell prompt should be the same as typing. Also, busybox can change the console input locale to UTF8, and this might help, but for this you do need to patch the source code and rebuild. This is probably also worth exploring, as it should be a tiny and trivial change at the source code (possibly only enabled when building with the manifest). |
I have tried with chcp 65001 and you are right, it break even typing (è is on Italian keyboard). |
There's a new prerelease with support for UTF8 (busybox_pre64u.exe).
Many thanks to @avih for working on this. |
Bug:
Edit: I have tried also chcp 852 with the same result. |
In addition, pasting ❤️ with both chcp 858 and chcp 65001 cause the pasting of 2 characters but backspace can remove only 1 of them. |
It looks and edits correctly only with
That's a limitation of the Windows console. You should try the Windows terminal - https://github.com/microsoft/terminal which is more capable in displaying and managing unicode text. At the windows console you're limited by the console font and other factors. It should still work correctly, e.g. if you paste into busybox-w32 shell You could also try that at the We can't fix that. |
I can't get too excited about things not working in CP 858. There's simply no way to represent emojis or CJK characters there. However, even with CP 65001 in the Windows terminal there's an issue with Red Heart. Apparently it consists of U+2764 Heavy Black Heart and U+FE0F Variation Selector-16. The line editing doesn't quite understand that. |
@avih I don't use Terminal for the simple fact that it isn't bundled with Windows (my scripts are portable and simple to use so they shouldn't require anything that isn't just a double click with the mouse). Edit: cmd.exe have the same problem with ❤️ divided in 2 characters but it count them corrently so backspace works fine. |
Partly correct. The input is currently console-cp agnostic, but the output still depends on the console CP. We could use the W APIs to write (wherever The console itself (e.g. cmd.exe shell) supports typing, display any chars regardless of the console CP.
Right, that's a limitation of the (upstream) busybox unicode support. Also, as mentioned at the main unicode commit, the windows terminal doesn't display combining chars well, and so editing is also hurt. Unicode editing needs the stars to align between the program (busybox-w32) and the terminal/console which displays it. They can't coordinate what can and can't be displayed correctly, so the app hopes the stuff it prints is displayed correctly. If either the app or the console mess up something, things break. Also, the busybox (upstream) unicode support is incomplete, especially when it comes to combining chars and editing, and when it comes to emojis and other codepoints above U+FFFF. So this one is broken both at the busybox side and the terminal side.
That's fine, but you should be aware that the console is limited. busybox-w32 does the right thing as much as it can, but it happens to work more correctly on the windows terminal, and less correctly on the windows console.
Sure. That would be great, but we only enable the upstream busybox unicode support, we didn't rewrite the editing system, and we don't intend to do that. So we inherit whatever upstream supports or doesn't support, and we're also limited by the terminal/console itself. Both of those are beyond the control of busybox-w32. |
Once we get to the release notes, I think the above should be mentioned, as well as:
Regardless of the release notes, it's worth keeping in mind that all of the unicode-related work so far (as well as the notes above) would stay relevant also if we move to proper unicode (W APIs) without the manifest, with the common paradigm of "W APIs at the edges, UTF8 everywhere else". I'm guessing that would mostly need to use the W versions of |
Just a quote for a possible UTF-8 support on Windows 7 in the future:
|
@dscho the latest prerelease ( Do you have any setup to test the suitability of this version for git-for-windows? Would you be able to test it? If the suitability includes mapping If that's still not enough, and the suitability also includes mapping (these commits are rebased occasionally, typically without changes, but should be always accessible near the top of this branch as long as I keep them (no intent to drop it) https://github.com/avih/busybox-w32/commits/avih ) We might still add generic unicode support similar to your fork at some later strage, but for now we only support unicode via the UTF8 manifest which only has an effect on win 10+. |
Not really. I do have something that could be adapted to such a setup: https://github.com/dscho/git/blob/busybox/.github/workflows/main.yml (this is the workflow I used to try to identify the spots where using BusyBox to run Git's test suite should be faster than MSYS2's Bash but simply isn't, sometimes it's even slower).
I am woefully short on time, and shifted my focus away from BusyBox, so: unfortunately, I won't be able to test it. |
Thanks. I'm assuming you're still interested in also providing a busybox variant of git-for-windows if it's not too much effort, and that you prefer upstream busybox-w32 over maintaining a downstream fork. If those assumptions are correct, what would be the steps to get there? I guess I could, as a first step, replace the |
This is difficult to answer. I am obviously interested in less maintenance burden, at the same time I cannot rely on a project that does not even enable regular CI builds.
The best idea would be to adapt (read: cherry-pick) the Personally, I would do this in two steps:
|
du skips files having foreign chars in the filename:
The text was updated successfully, but these errors were encountered: