Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filename encoding issues #17

Open
eadmaster opened this issue Oct 21, 2013 · 52 comments
Open

filename encoding issues #17

eadmaster opened this issue Oct 21, 2013 · 52 comments

Comments

@eadmaster
Copy link

du skips files having foreign chars in the filename:

> du -a TouhouKoumajouDensetsuScarletSymphony > stdout.txt
du: TouhouKoumajouDensetsuScarletSymphony/?????1.00.pdf: No such file or directory
du: TouhouKoumajouDensetsuScarletSymphony/????? ??????.exe: No such file or directory
@rmyorston
Copy link
Owner

I think that fixing this requires the use of Windows' wide character filename APIs.

du uses opendir/readdir/closedir to handle directory traversal. mingw has non-standard wide character versions of these that could be used. Once du has a filename is passes it to stat. busybox-w32 has it's own stat replacement (stolen from git) which uses GetFileAttributesExA. That would have to be replaced with GetFileAttributesExW.

I suspect that supporting wide character filenames throughout busybox-w32 would be a significant amount of work.

@johnd0e
Copy link

johnd0e commented Mar 17, 2014

Wide char APIs will be great, but perhaps there is a simpler way?
Most old not unicode-aware utils are still able to deal properly with characters of OEM codepage.

Although it is not unicode, but all national characters should be supported.

@rmyorston
Copy link
Owner

I've made a couple of changes that should improve support for non-ASCII characters in ls and vi.

@johnd0e
Copy link

johnd0e commented Mar 18, 2014

Thank you, it's better now: there are no question marks.
But names are listed in wrong codepage: console codepage is OEM (866 in case of cyrillic), but ls applet output is ANSI (1251 in case of cyrillic).

E.g. for file named кириллица.txt ls output is ъшЁшыышЎр.txt

Update
Just cheched vi applet, text encoding is fine
But filename encoding (in status bar) is wrong

@rmyorston
Copy link
Owner

I know almost nothing about Windows codepages but I thought there might be problems. I'd like to get this sorted, though: cyrillic support is important to us at the company I work for.

I need to investigate further.

@johnd0e
Copy link

johnd0e commented Mar 18, 2014

There are several ways:

@rmyorston
Copy link
Owner

I've attempted to intercept I/O operations to the console and convert characters. I definitely haven't caught all I/O and nor have I tested it very thoroughly, but the latest binary might show an improvement.

@johnd0e
Copy link

johnd0e commented Mar 19, 2014

  • ls seems OK now
  • vi and grep show filenames correctly, but IMO you should not touch text of edited file. E.g. compare the behavior of cmd.exe commands edit, type. I think it is not good approach: to intercept all I/O to the console.
  • I've checked some other applets, such as cat, grep and see that console output is same as before, and I think it's right: we need to convert only system messages and filenames.
  • Many applets still need to be fixed, e.g. dirname.

I have not checked all applets, but I will when you tell me that they are ready.

@johnd0e
Copy link

johnd0e commented Mar 19, 2014

ec386ad
Previously there were no problems if specify cyrillic filename in grep cmdline.
Now it is broken

Just rechecked, and all seems OK with filenames, sorry.

But text output should not be converted.
E.g. you can try grep not specifying FILE, so it read data from console input: busybox.exe grep -e .*
When typing cyrillic - you'll see wrong output, overconverted.

@dgtlrift
Copy link

This is also seen in the "find" applet when walking across a search - this is more of a systemic issue across all file name access by busybox applets.

Also, is issue #5 related to this?

@sergeevabc
Copy link

sergeevabc commented Mar 9, 2020

I wonder why -b is stripped in this version of du?
du -b file.txt is way shorter than stat -c "%s %n" file.txt

Edit: hardly a consolation, but some rationale actually exists.

@rmyorston
Copy link
Owner

A patch adding -b was submitted to the upstream mailing list last month.

@avih
Copy link
Contributor

avih commented Aug 1, 2020

I would like to try working on adding full unicode support. I've done this in other applications, with the approach mainly being:

  • All internal handling assume UTF-8 (plain char *) - this requires the least modifications all around compared to *nix implementations.
  • All OS APIs which take or return strings are wrappped to expose a UTF-8 interface but internally use wchar_t with the W Windows API variants, performing conversion as required.
  • The first conversion happens at main - converting argv and possibly envp to UTF-8 (environ should be considered as well).
  • Functions which read/write an fd need to identify whether or not the fd is a console (I believe busybox-w32 already does that), and on such case internally convert and use W console API. If fd is not a console then don't perform any conversion - this means external (files, pipes) text content is considered and assumed UTF-8.
  • When executing an external application argv (which is UTF-8 at the calling code) should be converted to wchar_t (I've not done this one before).

I've done most of the above elsewhere, including wrapping main, printf, fopen, opendir (and DIRENT), etc.

In this project, however, I found it harder to identify the boundary between "internal" and "OS access", especially when it comes to applet's main - if it's executed as a normal program then it would have a windows Unicode arguments vector accessible via CommandLineToArgvW(GetCommandLineW(), &n) (which should be converted to UTF-8), but if it's "called" from another applet then maybe its argv is already UTF-8 and the Unicode arguments are not available.

So basically, I'd appreciate some help on which main is invoked by windows and has Unicode arguments available, and which don't because they're invoked internally and their arguments are already UTF-8.

Comments?

@avih
Copy link
Contributor

avih commented Aug 1, 2020

For reference, @dscho maintained a fork of this repo with unicode support for git-for-windows - https://github.com/git-for-windows/busybox-w32 . Last I tested it (not recently) I couldn't find issues related to unicode - everything seemed to work, but I didn't perform intensive tests.

If someone wants to try it out, this download for instance includes busybox.exe 32bit from that fork (at mingw32/bin/busybox.exe): https://github.com/git-for-windows/git/releases/download/v2.28.0.windows.1/MinGit-2.28.0-busybox-32-bit.zip

It seems to be (last) rebased on top of 096aee2 (2017-12), and the fork hasn't been updates since 2018-01-30. This seems to be the diff applied on top of upstream at the time: https://github.com/git-for-windows/busybox-w32/compare/096aee2b..0b3cdd76

The differences seem mainly:

  • at libbb/appletlib.c at main: argv and environment conversion to UTF-8.
  • UTF-8 file-name wrappers - fopen, diropen, readdir, chdir, etc.
  • "enhanced" path handling - I think mainly make /c/foo and c:/foo equivalent, and UNC handling - as far as I can tell unrelated to unicode.
  • Some process and signal handling changes - as far as I can tell (mostly?) unrelated to unicode (though I'd think it should include conversion of UTF-8 argv to wchar_t).

Overall, the diff is big but not huge. I assume that since then some of it might now conflict and/or be more similar than it was in the past.

Maybe some combined effort can be made to start merging parts of it back upstream (i.e. here)?
Maybe start with just the unicode support?

@dscho are you still interested in busybox-w32 for GFW? it seems new release do include it, so I'd assume you'd prefer to be closer to upstream than your current fork? If yes, and if @rmyorston is intetested as well, would you be willing to coordinate sub-patches which are acceptable upstream?

@dscho
Copy link
Contributor

dscho commented Aug 2, 2020

I am interested, but short of time. For that reason, the Git for Windows fork is seriously out of date. Last time I checked, the rebased version failed to build in the very early steps already. The latest state should be in some branch in my personal fork (can't check right now, I'm on my phone).

@rmyorston
Copy link
Owner

Sufficiently interested to have kept this issue open for nearly seven years as a reminder. Make of that what you will.

Observations:

  • BusyBox vi doesn't support UTF-8, only single-byte character sets.
  • busybox-w32 doesn't use the console API to write characters to the console.
  • There's only one main to worry about: the applet_mains are always called after the one true main.
  • My preference is for UTF-8 support to be a compile-time alternative to the current approach (as opposed to a complement or an out-and-out replacement).

@avih
Copy link
Contributor

avih commented Aug 3, 2020

There's only one main to worry about: the applet_mains are always called after the one true main.

Yup, figured that out already.

BusyBox vi doesn't support UTF-8, only single-byte character sets.
busybox-w32 doesn't use the console API to write characters to the console

Yup, though it does use console API to read input with that EURO thingy. This will be handled later, first priority IMO will be to get non-interactive scripts working fully.

So far I got to/from UTF-8 at the process boundary (argv/env on entry, cmd/argv/env on spawn) - which already allows a lot of things to work very nicely.

However, general unicode paths (stats, fopen, cd, readdir, etc) are still not implemented.

My preference is for UTF-8 support to be a compile-time alternative to the current approach (as opposed to a complement or an out-and-out replacement).

Hmm.. I indeed thought of just make it a replacement, but I can do build time config as well. Stay tuned.

@avih
Copy link
Contributor

avih commented Aug 3, 2020

BusyBox vi doesn't support UTF-8, only single-byte character sets.

Are you sure? I just tried busybox 1.27.2 on Ubuntu, and I could paste unicode strings, and moving the cursor seemed to progress correctly (over codepoints rather than individual bytes).

EDIT - I also just tried it with this file https://salsa.debian.org/printing-team/cups/raw/debian/master/cups/utf8demo.txt and it seemed to render and move the cursor mostly correctly - though not perfectly.

@rmyorston
Copy link
Owner

Yes, I'm sure. I fired up an Ubuntu 18.04 virtual machine to check, but no, it's just as borken as on Fedora.

The utf8demo.txt file has no trailing white space but on lines with multi-byte characters it's possible to move the cursor beyond the visible characters. Or position the cursor on a multi-byte character and use 'a' to append text after it: the character is split.

@avih
Copy link
Contributor

avih commented Aug 6, 2020

Just sharing some information for now.

While working on UTF-8 wrappers for Windows APIs, I found out that starting with Windows 10 1903 (May 2019), an application can specify at its manifest that it wants the active codepage to be UTF8. This means that the ANSI APIs (like CreateProcessA, but also fopen) become UTF8.

Note that currently it affects the value from GetACP(), but it doesn't change the console input/output codepages.

See https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page

I've experimented with this, and from little testing it does make busybox able to at least list file names with unicode paths with correct sizes etc.

The printout still depends on GetConsoleOutputCP() value, but if that is set to UTF8 as well then the files do print correctly on screen (I also did write code to use the W console output APIs when not using the manifest - seem to work).

When these two (manifest for ACP and console output codepage) set and in effect, the only remaining issue appears to be console input - which should be relatively simple to solve with the W APIs. I did try to set Console(Input)CP to UTF8, but this seem to make busybox not echo keypresses. I didn't try to dig deeper for now.

Additionally, The windows settings for "non unicode programs" seem to affect only console input and output CP (ACP remains the same - seems to have the same effect as when using chcp from cli), but windows 10 1904 and later has a new option for "Beta: Use Unicode UTF-8 for worldwide language support". I didn't try it out. I'm guessing its effect is same as the manifest described above and/or changing console input and output CP to CP_UTF8.

Now, this manifest, when applicable, seems to solve all the hard problems for busybox, but it's not compatible with windows XP (busybox refuses to run), and seems to have no effect (as expected) on win7.

I'm assuming busybox does want to keep supporting xp and win7/8 and 10 earlier than 1903, so for those to support unicode wrappers are required, but it's useful to keep in mind that with recent windows the hard part of the wrappers is not required.

@rmyorston
Copy link
Owner

@avih, thanks for the update.

I think it's important that the standard binary of busybox-w32 should Just Work on any version of Windows and without any configuration by the user.

@avih
Copy link
Contributor

avih commented Aug 6, 2020

should Just Work

Sure, except that it already doesn't when it comes to unicode support.

So while it would be great to have unicode support on earlier windows versions (and that is still my intention, and how I started this work - by successfully converting argv/env on entry and spawn), I now think we should incrementally enhance busybox as follows:

  1. Make it work on recent windows when ACP and console out CP is UTF8 - basically only add support to console input with the W APIs because everything else is already UTF8 (I don't have a patch for this).
  2. Add support for console output with the W APIs if everything else is already UTF8 (I do have a patch for this which seems working).
  3. Add the system W wrappers for the "hard part" (process boundary - I already have a patch, and other system APIs like files etc which I didn't touch yet).

So I'm trying to specify the wrapping/emulation control parameters:

  1. Whether or not strings are UTF8:
  • When ACP is UTF8 and "system" W wrappers are not used.
  • When the W wrappers are used regardless of ACP.
  1. Whether or not ConsoleOutputCP is UTF8.
  2. Whether or not Console(Input)CP is UTF8 - TBD.
  3. Whether or not the console supports VT emulation natively.

If the stars align, e.g. when strings are UTF8 and ConsoleOutput is UTF8 and native console VT works, then it basically completely bypass all of the output emulation/wrapping. Etc.

@ale5000-git
Copy link

It would also be nice if it detect the environment where it was launched and react accordingly to have correct output.

cmd.exe /A internal command output for pipe and files is ANSI
cmd.exe /U internal command output for pipe and files is UNICODE

@avih
Copy link
Contributor

avih commented Aug 27, 2020

It would also be nice if it detect the environment where it was launched and react accordingly to have correct output.

The only relevant environment is the system code page and the code pages of the windows-console in which busybox runs, which should be detected and respected.

cmd.exe /A internal command output for pipe and files is ANSI
cmd.exe /U internal command output for pipe and files is UNICODE

I don't quite understand how this is relevant.

If you suggest that busybox should detect automatically the encoding of an output from some program (e.g. if you run inside busybox cmd.exe /U /C dir | less), then I disagree.

busybox has iconv, and if you knowingly generate output in UTF-16LE, then you should knowingly convert it to an acceptable encoding yourself, e.g. cmd.exe /U /C dir | iconv -f UTF-16LE -t UTF-8 | less.

However, if you only want to run inside busybox cmd.exe /U and then interact with it yourself, then busybox-win32 is not involved. cmd.exe interacts with the console on its own, and it should behave inside busybox exactly as it woud outside of it (but I didn't test it).

@avih
Copy link
Contributor

avih commented Aug 27, 2020

(Also, I did not drop my intention to add full unicode support to busybox-w32, but i'm sidetracked by unrelated things since about my last comment on this subject. I do want to get back to it at some stage, and all my preliminary testing suggested it's entirely possible).

@avih
Copy link
Contributor

avih commented Aug 27, 2020

For reference, that's the code I got for complete environment support (arguments, variables). It may or may not apply cleanly to current busybox master. I didn't try.

diff --git a/Config.in b/Config.in
index d18f3dac5..adc21d86a 100644
--- a/Config.in
+++ b/Config.in
@@ -467,6 +467,16 @@ config FEATURE_EURO
 	requires the OEM code page to be 858.  If the OEM code page of
 	the console is 850 when BusyBox starts it's changed to 858.
 
+config FEATURE_MINGW_UTF8
+	bool "Support unicode using internal UTF-8 translation (WIP)"
+	default y
+	depends on PLATFORM_MINGW32
+	help
+	Unicode arguments and environment variables are translated to UTF-8
+	on entry, and back to Windows-Unicode when executing a program.
+	For best results, [shell] scripts with non-English content should
+	be encoded in UTF-8. Overrides FEATURE_EURO where applicable.
+
 config FEATURE_EXTRA_FILE_DATA
 	bool "Read additional file metadata (2.1 kb)"
 	default y
diff --git a/configs/mingw32_defconfig b/configs/mingw32_defconfig
index d1dec129f..aa68fd33f 100644
--- a/configs/mingw32_defconfig
+++ b/configs/mingw32_defconfig
@@ -53,6 +53,7 @@ CONFIG_FEATURE_ICON=y
 # CONFIG_FEATURE_ICON_STERM is not set
 CONFIG_FEATURE_ICON_ALL=y
 CONFIG_FEATURE_EURO=y
+CONFIG_FEATURE_MINGW_UTF8=y
 CONFIG_FEATURE_EXTRA_FILE_DATA=y
 CONFIG_FEATURE_READLINK2=y
 
diff --git a/configs/mingw64_defconfig b/configs/mingw64_defconfig
index bdaafaa86..5dc257188 100644
--- a/configs/mingw64_defconfig
+++ b/configs/mingw64_defconfig
@@ -53,6 +53,7 @@ CONFIG_FEATURE_ICON=y
 # CONFIG_FEATURE_ICON_STERM is not set
 CONFIG_FEATURE_ICON_ALL=y
 CONFIG_FEATURE_EURO=y
+CONFIG_FEATURE_MINGW_UTF8=y
 CONFIG_FEATURE_EXTRA_FILE_DATA=y
 CONFIG_FEATURE_READLINK2=y
 
diff --git a/include/mingw.h b/include/mingw.h
index a67b161c7..70f4567e0 100644
--- a/include/mingw.h
+++ b/include/mingw.h
@@ -348,6 +348,22 @@ int mingw_fstat(int fd, struct mingw_stat *buf);
 #define stat mingw_stat
 #define fstat mingw_fstat
 
+#if ENABLE_FEATURE_MINGW_UTF8
+// By convention foo_U is the windows API fooW/_wfoo but with UTF-8 interface,
+// and mu_bar is a utility
+
+// allocate a null-terminated conversion-result for null-terminated input.
+char *mu_utf8(const wchar_t *ws);
+wchar_t *mu_wide(const char *u8);
+
+// allocate a null-terminated array of null-terminated converted strings.
+// if maxn < 0: up to a null input string, else up to maxn or a null input string
+char **mu_utf8_vec(wchar_t *const *wvec, int maxn);
+wchar_t **mu_wide_vec(char *const *uvec, int maxn);
+
+#endif  // ENABLE_FEATURE MINGW_UTF8
+
+
 /*
  * sys/sysmacros.h
  */
diff --git a/libbb/appletlib.c b/libbb/appletlib.c
index d2f98567e..3bb7e7ba4 100644
--- a/libbb/appletlib.c
+++ b/libbb/appletlib.c
@@ -1190,6 +1190,38 @@ get_script_content(unsigned n UNUSED_PARAM)
 
 #endif /* defined(SINGLE_APPLET_MAIN) */
 
+#if ENABLE_PLATFORM_MINGW32 && ENABLE_FEATURE_MINGW_UTF8
+static void mu_utf8_set_env(void)
+{
+	wchar_t *envw0 = GetEnvironmentStringsW(), *envw = envw0, *p;
+	char *eu;
+
+	for (; envw && *envw; envw += wcslen(envw) + 1) {
+		for (p = envw; *p && *p < 0x80; p++)
+			/* no-op */;
+		if (!*p)
+			continue;  // name and value are ascii-7
+		if (!(eu = mu_utf8(envw)))
+			continue;  // nothing to do on error, just skip
+
+		// replace the (OEM) char* entry with UTF-8 of the wchar_t* entry.
+		//   If the OEM name bytes-sequence is different than in UTF-8, then
+		// the OEM name will remain and we'll just add a new entry with the
+		// UTF-8 name (and value).
+		//   In an OEM string all byte values are valid, and putenv and getenv
+		// allow any (obviously except '\0' and '=') even on winXP, so using it
+		// to hold UTF-8 works fully, and `environ' is UTF-8 from now on.
+		//   _wgetenv, however, will not translate our UTF-8 to wchar_t, and
+		// simiparly _wputenv will not end with UTF-8 in `environ', so the
+		// wide variants should not be used from now on in this environment.
+
+		putenv(eu);
+		free(eu);  // arg got copied (unlike posix spec, but like glibc)
+	}
+
+	FreeEnvironmentStringsW(envw0);
+}
+#endif  // ENABLE_PLATFORM_MINGW32 && ENABLE_FEATURE_MINGW_UTF8
 
 #if ENABLE_BUILD_LIBBUSYBOX
 int lbb_main(char **argv)
@@ -1244,6 +1276,19 @@ int main(int argc UNUSED_PARAM, char **argv)
 	}
 #endif
 #if ENABLE_PLATFORM_MINGW32
+
+#if ENABLE_FEATURE_MINGW_UTF8
+	{
+		int n;
+		wchar_t **wargv = CommandLineToArgvW(GetCommandLineW(), &n);
+		char **uargv = wargv ? mu_utf8_vec(wargv, n) : 0;
+		if (uargv)
+			argv = uargv;  // leaked on exit. FIXME: conflicts with BB_MMU?
+
+		mu_utf8_set_env();
+	}
+#endif  // ENABLE_FEATURE_MINGW_UTF8
+
 	/* detect if we're running an interpreted script */
 	if (argv[0][1] == ':' && argv[0][2] == '/') {
 		switch (argv[0][0]) {
diff --git a/shell/ash.c b/shell/ash.c
index d35ae027f..8c3c367a5 100644
--- a/shell/ash.c
+++ b/shell/ash.c
@@ -15036,7 +15036,8 @@ init(void)
 					bs_to_slash(end+1);
 				}
 
-				/* check for invalid characters in name */
+				/* check for invalid characters in name. busybox ash is_name
+				 * limit name chars to [_[:alnum:]] in ASCII-7 (no UTF-8) */
 				for (start = *envp;start < end;start++) {
 					if (!isdigit(*start) && !isalpha(*start) && *start != '_') {
 						break;
diff --git a/win32/mingw.c b/win32/mingw.c
index faa9f2b57..5a701a039 100644
--- a/win32/mingw.c
+++ b/win32/mingw.c
@@ -165,6 +165,99 @@ int err_win_to_posix(void)
 static int zero_fd = -1;
 static int rand_fd = -1;
 
+#if ENABLE_FEATURE_MINGW_UTF8
+// All functions which use mu_utf8_count/mu_wide_count to first check the
+// expected size assume the conversion will work with same or bigger space.
+
+// positive (not 0) on success. result is in destination units and includes the
+// terminating null if it's within the input count range. nws/nu8 can be -1 to
+// indicate that the input is null-terminated (and the result includes it).
+#define mu_utf8_count(ws_src, nws) \
+	WideCharToMultiByte(CP_UTF8, 0, (ws_src), (nws), 0, 0, 0, 0)
+#define mu_wide_count(u8_src, nu8) \
+	MultiByteToWideChar(CP_UTF8, 0, (u8_src), (nu8), 0, 0)
+
+// performs a conversion, trimmed (without null) if dest size is insufficient
+// TODO: if input size is given, does it convert beyond input \0 if size allows?
+#define mu_utf8_raw(ws_src, nws, u8_dst, nu8) \
+	WideCharToMultiByte(CP_UTF8, 0, (ws_src), (nws), (u8_dst), (nu8), 0, 0)
+#define mu_wide_raw(u8_src, nu8, ws_dst, nws) \
+	MultiByteToWideChar(CP_UTF8, 0, (u8_src), (nu8), (ws_dst), (nws))
+
+
+// dies on OOM, returns NULL on other errors.
+char *mu_utf8(const wchar_t *ws)
+{
+	char *u8 = 0;
+	int n = mu_utf8_count(ws, -1);
+	if (n > 0) {
+		u8 = xmalloc(sizeof(char) * n);
+		mu_utf8_raw(ws, -1, u8, n);
+	}
+	return u8;
+}
+
+// dies on OOM, returns NULL on other errors.
+wchar_t *mu_wide(const char *u8)
+{
+	wchar_t *ws = 0;
+	int n = mu_wide_count(u8, -1);
+	if (n > 0) {
+		ws = xmalloc(sizeof(wchar_t) * n);
+		mu_wide_raw(u8, -1, ws, n);
+	}
+	return ws;
+}
+
+// continuous allocation for the pointers (incl. final NULL) and the strings.
+// if maxn > 0 then up to maxn or NULL - whichever comes first. final NULL is
+// always added at the result array. NUll input vector is the same as empty.
+// dies on OOM, returns NULL if conversion failed otherwise.
+char **mu_utf8_vec(wchar_t *const *wvec, int maxn)
+{
+	size_t usize, n, i;
+	char **uvec, *uarg;
+
+	for (usize = 0, n = 0; wvec && wvec[n] && (maxn < 0 || n < maxn); n++) {
+		int count = mu_utf8_count(wvec[n], -1);
+		if (count <= 0)
+			return NULL;
+		usize += count;
+	}
+
+	uvec = xmalloc((n+1) * sizeof(char *) + usize * sizeof(char));
+	for (i = 0, uarg = (void *)(uvec + n + 1); i < n; i++) {
+		uvec[i] = uarg;
+		uarg += mu_utf8_raw(wvec[i], -1, uarg, usize);
+	}
+	uvec[i] = NULL;
+
+	return uvec;
+}
+
+wchar_t **mu_wide_vec(char *const *uvec, int maxn)
+{
+	size_t wsize, n, i;
+	wchar_t **wvec, *warg;
+
+	for (wsize = 0, n = 0; uvec && uvec[n] && (maxn < 0 || n < maxn); n++) {
+		int count = mu_wide_count(uvec[n], -1);
+		if (count <= 0)
+			return NULL;
+		wsize += count;
+	}
+
+	wvec = xmalloc((n+1) * sizeof(wchar_t *) + wsize * sizeof(wchar_t));
+	for (i = 0, warg = (void *)(wvec + n + 1); i < n; i++) {
+		wvec[i] = warg;
+		warg += mu_wide_raw(uvec[i], -1, warg, wsize);
+	}
+	wvec[i] = NULL;
+
+	return wvec;
+}
+#endif  // ENABLE_FEATURE_MINGW_UTF8
+
 /*
  * Determine if 'filename' corresponds to one of the supported
  * device files.  Constants for these are defined as an enum
diff --git a/win32/process.c b/win32/process.c
index ac63a9c58..0b5ae4acf 100644
--- a/win32/process.c
+++ b/win32/process.c
@@ -3,6 +3,35 @@
 #include <psapi.h>
 #include "lazyload.h"
 
+
+#if ENABLE_FEATURE_MINGW_UTF8
+#ifdef spawnve
+#undef spawnve
+#endif
+#define spawnve spawnve_U
+
+static intptr_t spawnve_U(int mode,
+		const char *cmd, char *const *argv, char *const *env)
+{
+	intptr_t ret = -1;
+	wchar_t *wcmd = mu_wide(cmd),
+	        **wargv = mu_wide_vec(argv, -1),
+	        **wenv = mu_wide_vec(env, -1);
+
+	if (!wcmd || !wargv || !wenv)
+		errno = EINVAL;
+	else
+		ret = _wspawnve(mode, wcmd,(const wchar_t *const *)wargv,
+		                           (const wchar_t *const *)wenv);
+
+	free(wenv);
+	free(wargv);
+	free(wcmd);
+
+	return ret;
+}
+#endif  // ENABLE_FEATURE_MINGW_UTF8
+
 pid_t waitpid(pid_t pid, int *status, int options)
 #if ENABLE_TIME
 {

I have more WIP code which covers the console IO as well which generally works, and I didn't start working on changing the file APIs to UTF8 (well, I did, but then found out about the manifest "automatic" UTF-8 ANSI APIs, and started experimenting with that instead, and then got busy with other things).

EDIT:
I also just noticed a bug. wvec[n] && (maxn < 0 || n < maxn) should be (maxn < 0 || n < maxn) && wvec[n] or else it could try to read one item out of bounds (and same at the analogous conversion the other way). By luck it's not an issue in practice though, because the maxn value is only used with the output of CommandLineToArgvW, and the value just after the array pointers is NULL in new windows versions, and the begining of the actual strings on XP - so it's in-bounds of the allocation, and it would still sop correctly on such case regardless of what the pointer value is.

@avih
Copy link
Contributor

avih commented Oct 26, 2022

Quick update:

  • Obvously I didn't continue working on that patch, and most probably will not continue.
  • The git-for-windows busybox-w32 fork appears to have diverged quite a bit, but still merges upstream changes occasionally (from this repo). Last merge as of now seems to be "7d1c7d833 ash,hush: use HOME for tab completion and prompts 2022-06-26 Ron Yorston"
  • The manifest method still works[1] with newly built busybox.exe

[1] I think the manifest used to also work as an external file (busybox.exe.manifest at the same dir as busybox.exe), but currently I can only make it work if embedding it at the binary, like so:

  • Build/get busybox.exe
  • save this as utf8.manifest :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
  <assemblyIdentity type="win32" name="any-name-e-g-busybox-utf8" version="6.0.0.0"/>
  <application>
    <windowsSettings>
      <activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
    </windowsSettings>
  </application>
</assembly>
  • then: mt.exe -manifest utf8.manifest -outputresource:busybox.exe;#1 which will attach this manifest to the busybox.exe binary. (mt.exe is from the windows SDK. I'm attaching here a version which came with VS2015).

Test cases with some files with unicode names at the current dir, executed from cmd.exe prompt:

  • busybox.exe ls -l prints ls: ./name ?? ??????: No such file or directory without the manifest, but lists the file correctly (with correct size) with the manifest. If running chcp 65001 to change the console output codepage to UTF-8 then it will also be listed correctly visually (assuming the font has the glyphs for this name).
  • busybox.exe tar cf foo.tar * prints a similar error for this file without the manifest (and the unicode file is not at the tar file) but creates a seemingly correct tar archive, including unicode files, with the manifest.

mt.exe:
mt.exe--win-sdk-vs2015.zip

@avih
Copy link
Contributor

avih commented Jun 19, 2023

  • then: mt.exe -manifest utf8.manifest -outputresource:busybox.exe;#1 which will attach this manifest to the busybox.exe binary. (mt.exe is from the windows SDK...

So, few things:

mt.exe is closed source and probably non-distributable on its own, and the only open source alternative I found which can be used from the command line (rcedit) requires the Visual Studio compiler because it used C++ ATL. However, it does provide pre-built binaries which do work.

So I wrote my own small C command line utility to handle resources in PE files - perc (pre-built windows binaries are available), which can be used like this to attach a UTF-8 manifest:

perc -a utf8.manifest -t MANIFEST -i 1 busybox.exe

Back to the main point now.

While not impossible to add support, it's been nearly 10 years since this issue was opened, and busybox-w32 still doesn't support Unicode.

But adding unicode is not trivial, and I'm guessing that if no one added it so far - while Windows 7 was still supported, it's much less likely to happen now when support for anything earlier than Windows 10 is quickly being dropped.

I'm not at all suggesting that busybox-w32 should drop support for earlier windows. On the contrary, I think it's great it supports older versions.

So I think the best approach would be to keep supportting earlier Windows users like today, but still add unicode support on Windows 10+ using the UTF-8 manifrest file.

I'm still using this method to add support to my copy of busybox-w32, and have been doing so for at least a year now, and it still works great as far as I can tell.

I think it could be beneficial to add to the release files also a Windows 10+ executable which has such UTF-8 manifest embedded. The natural method would be at build time, though perc could be used instead post-build if the release files are prepared on Windows.

@ale5000-git
Copy link

ale5000-git commented Jun 19, 2023

The support for previous version of Windows isn't an issue I think.

If the strings are kept internally in variables as UTF-8, then when an API it is needed they can be kept as UTF-8 or converted to ANSI or UNICODE on demand depending of the need of the API and the Windows version.

Adding support to UTF-8 or long file names through manifest is just a shortcut that will cause more issue later.

Also in my opinion starting to have separate executables for Windows versions just kill portability and increase complexity.

@avih
Copy link
Contributor

avih commented Jun 19, 2023

If the strings are kept internally in variables as UTF-8...

The fact is that it's both a very useful feature, and also a feature which was not yet implemented suggests that it's not a trivial task (@ dscho did for the mini git busybox-w32 package, but it was not merged back, I started working on it and then left it).

Adding support to UTF-8 or long file names through manifest is just a shortcut that will cause more issue later.

How did you come up with that? That's the least risky way to add unicode support to existing Windows applications, but it only works on Windows 10 or later.

All other methods require much more extensive and risky changes to an application.

Also in my opinion starting to have separate executables for Windows versions just kill portability and increase complexity.

There's already at least two executable for each release - 32 and 64, and IIRC there were glob choices too, and here I suggest adding an single additional win10-64-unicode binary.

Yes, people need to think which version they should use, but if you need unicode support, because you have some files with non-english names, then you would gladly choose a version which supports it in shell scripts - if one was available.

It might be useful to also limit this version (also using the manifest) to Windows 10 or later, so that it refuses to run on win7/8 instead of someone incorrectly assuming that if it runs then it also supports unicode (it refuses to run on XP just by using the UTF8 manifest, but on 7/8 it will run, just without unicode support).

@ale5000-git
Copy link

ale5000-git commented Jun 19, 2023

There's already at least two executable for each release - 32 and 64, and IIRC there were glob choices too, and here I suggest adding an single additional win10-64-unicode binary.

The 32-bit version of BusyBox (that I use) does works perfectly also on 64-bit Windows, so it is universal from 32-bit Win XP to 64-bit Win 11.
And it would be nice to have UTF-8 also on Win XP.
I know the code is not trivial but it isn't impossible, even if it take another 10 years I hope to have it someday.

@avih
Copy link
Contributor

avih commented Jun 19, 2023

even if it take another 10 years I hope to have it someday.

Sure, I agree it would be great, but it won't happen by itself, and it hasn't happened in the 10 years since this issue was opened.

I don't intend to continue working on it, and there are no other efforts I'm aware of.

Do you intend to work on it? Who will do this work?

Meanwhile, a non negligible amount of users which use Windows 10 - some could guess even the majority of busybox-w32 users, could have unicode support working with trivial effort and effectively zero risk to the codebase for at least some years now.

Why not give them that?

@rmyorston
Copy link
Owner

rmyorston commented Jun 20, 2023

Commit 830e2cf adds a build-time option (disabled by default) to include the UTF-8 manifest.

@ale5000-git
Copy link

ale5000-git commented Jun 20, 2023

When I use chcp 65001, then pressing crtl+alt+e doesn't insert the euro symbol, does it works with the manifest?

@ale5000-git
Copy link

@avih
If there aren't any downsides and on old versions of Windows the manifest is just ignored then it could also be nice to have it by default but regressions should be checked extensively in my opinion.

@avih
Copy link
Contributor

avih commented Jun 20, 2023

When I use chcp 65001, then pressing crtl+alt+e doesn't insert the euro symbol, does it works with the manifest?

I don't know, I don't have a Euro symbol on my KB.

The main goal of adding the manifest is that shell scripts and utilities can work with unicode file names out of the box, especially when the name was NOT typed manyally at the busybox shell prompt:

  • Names which the shell encounters, like when expanding globs in scripts or command promp, or in command-substitution of output with unicode names, e.g. x=$(ls -1) (and then use $x at the shell or with some other busybox utility).
  • Names which utilities encounter, like creating an archive using tar of a directoty which contains unicode names, or when extracting such archive, or listing a directory with unicode names using ls.
  • Names which are passed to busybox from cmd.exe prompt or from a batch file, e.g. busybox.exe ls -l <unicode-name>.

Additionally, if the console code page is set to UTF-8 (by invoking chcp 65001), then unicode text will be printed correctly, e.g. x=$(ls); echo "$x".

Keyboard input at the shell prompt (or other utilities, like vi) is a different matter, and I haven't tried entering unicode text at the shell prompt, because my input is typically in English.

Do note that Unicode typing doesn't work without the manifest, so even if the manifest doesn't improve it, it's probably not a huge regression, if at all.

Specifically for the Euro symbol, busybox-w32 does have some special handling of the Euro symbol, so it's possible that this code doesn't play nicely with the manifest, but I don't know that for a fact. A quick way to test would be to disable the special handling of this symbol in busybox-w32, but I haven't tried that either.

It should be possible to modify the KB input code to handle unicode, and I did try that too back when I was trying to add native unicode support, and it did mostly work. However, this will be an actual change at the source code, and no longer "only add the manifest", but it might be worth exploring too.

If there aren't any downsides and on old versions of Windows the manifest is just ignored

On windows 7/8 the downside IS that it's ignored, because if someone downloads this version, and it has "unicode" at the name, then even if it also says "Windows10", they might think that if it runs on their system then unicode must work too, while not realizing that unicode will not work.

Also, with the manifest it no longer runs on XP.

These are the only downsides I can think of, but there could be others which I'm not aware of.

it could also be nice to have it by default

What is "by default"? replacing the w64 binary? Personally I don't think it should replace it.

Also, keep in mind that you don't actually need to build a new version to include the manifest. You can also download any older version and attach the manifest to the old binary - and it would become a unicode version. See my previous comments which explain how to do that.

(or you can download a unicode version and then remove the manifest yourself, e.g. with perc which I linked above, and it will become the non-unicode version).

but regressions should be checked extensively

I don't know. It can be described as experimental, and some people might opt to try it anyway.

In my copies I've been adding the manifest for more than a year now, and I don't think I've encountered an issue which related to unicode text, or other issues as a result of adding the manifest.

@ale5000-git
Copy link

ale5000-git commented Jun 20, 2023

@avih
So if you copy abcdè and paste it inside busybox ash with UTF-8 manifest does it works?
(note that it works with normal busybox and my default codepage 850)

@avih
Copy link
Contributor

avih commented Jun 20, 2023

So if you copy abcdè and paste it inside busybox ash with UTF-8 manifest does it works?

I don't know. Try it out and report back?

But generally speaking, I'd imagine that pasting at the busybox shell prompt should be the same as typing.

Also, busybox can change the console input locale to UTF8, and this might help, but for this you do need to patch the source code and rebuild. This is probably also worth exploring, as it should be a tiny and trivial change at the source code (possibly only enabled when building with the manifest).

@ale5000-git
Copy link

I have tried with chcp 65001 and you are right, it break even typing (è is on Italian keyboard).
In my opinion at least typing with chcp 65001 should be fixed.

@rmyorston
Copy link
Owner

There's a new prerelease with support for UTF8 (busybox_pre64u.exe).

  • There's only a 64-bit release.
  • It's a prerelease so it might not work :-(
  • Full support for UTF8 only works on Windows 10 from release 1903 and Windows 11. There's no advantage in running this binary on earlier versions.
  • To check if UTF8 support is available run busybox | head -1 and look for Unicode on.
  • Run chcp 65001 to change the console code page to UTF8.
  • Go wild creating files with emojis in their names (font permitting): touch lovely❤️

Many thanks to @avih for working on this.

@ale5000-git
Copy link

ale5000-git commented Jul 25, 2023

@rmyorston

Bug:

  • Using chcp 858
  • Starting an ash shell with the latest busybox_pre64u.exe
  • Pasting this; 參與维基百科
  • You see this: ?????? (and this is expected)
  • Now delete it with backspace, it delete the text but it also delete the ~ $ at the start of the line

Edit: I have tried also chcp 852 with the same result.

@ale5000-git
Copy link

In addition, pasting ❤️ with both chcp 858 and chcp 65001 cause the pasting of 2 characters but backspace can remove only 1 of them.

@avih
Copy link
Contributor

avih commented Jul 25, 2023

  • Using chcp 858

It looks and edits correctly only with chcp 65001 - just like the instructions say.

pasting ❤️ with both chcp 858 and chcp 65001 cause the pasting of 2 characters

That's a limitation of the Windows console.

You should try the Windows terminal - https://github.com/microsoft/terminal which is more capable in displaying and managing unicode text.

At the windows console you're limited by the console font and other factors. It should still work correctly, e.g. if you paste into busybox-w32 shell touch ❤️ and then check that dir in Windows explorer, but it will look bad and editing would be wonky at the windows console.

You could also try that at the cmd.exe console (without busybox-w32) - try pasting ❤️ into some command - the command will work but it will look wrong and edit wrong.

We can't fix that.

@rmyorston
Copy link
Owner

I can't get too excited about things not working in CP 858. There's simply no way to represent emojis or CJK characters there.

However, even with CP 65001 in the Windows terminal there's an issue with Red Heart. Apparently it consists of U+2764 Heavy Black Heart and U+FE0F Variation Selector-16.

The line editing doesn't quite understand that.

@ale5000-git
Copy link

ale5000-git commented Jul 25, 2023

@avih
I just report bugs to prevent being unnoticed; I really appreciate your work.
I think that the line editing code should understand the length of what is pasted/typed even if it can't display it correctly but it is just my opinion.

I don't use Terminal for the simple fact that it isn't bundled with Windows (my scripts are portable and simple to use so they shouldn't require anything that isn't just a double click with the mouse).
I don't use PowerShell because is really slow on some pc and I also don't like it.

Edit: cmd.exe have the same problem with ❤️ divided in 2 characters but it count them corrently so backspace works fine.

@avih
Copy link
Contributor

avih commented Jul 25, 2023

I can't get too excited about things not working in CP 858. There's simply no way to represent emojis or CJK characters there.

Partly correct. The input is currently console-cp agnostic, but the output still depends on the console CP. We could use the W APIs to write (wherever charToCon is used) and then it won't depend on the console CP anymore. It might not be super trivial, but I think it should be possible.

The console itself (e.g. cmd.exe shell) supports typing, display any chars regardless of the console CP.

The line editing doesn't quite understand that.

Right, that's a limitation of the (upstream) busybox unicode support. Also, as mentioned at the main unicode commit, the windows terminal doesn't display combining chars well, and so editing is also hurt.

Unicode editing needs the stars to align between the program (busybox-w32) and the terminal/console which displays it.

They can't coordinate what can and can't be displayed correctly, so the app hopes the stuff it prints is displayed correctly. If either the app or the console mess up something, things break.

Also, the busybox (upstream) unicode support is incomplete, especially when it comes to combining chars and editing, and when it comes to emojis and other codepoints above U+FFFF.

So this one is broken both at the busybox side and the terminal side.

I don't use Terminal

That's fine, but you should be aware that the console is limited. busybox-w32 does the right thing as much as it can, but it happens to work more correctly on the windows terminal, and less correctly on the windows console.

I think that the line editing code should understand the length of what is pasted/typed even if it can't display it correctly

Sure. That would be great, but we only enable the upstream busybox unicode support, we didn't rewrite the editing system, and we don't intend to do that.

So we inherit whatever upstream supports or doesn't support, and we're also limited by the terminal/console itself.

Both of those are beyond the control of busybox-w32.

@avih
Copy link
Contributor

avih commented Jul 25, 2023

  • There's only a 64-bit release.

  • It's a prerelease so it might not work :-(

  • Full support for UTF8 only works on Windows 10 from release 1903 and Windows 11. There's no advantage in running this binary on earlier versions.

  • To check if UTF8 support is available run busybox | head -1 and look for Unicode on.

  • Run chcp 65001 to change the console code page to UTF8.

Once we get to the release notes, I think the above should be mentioned, as well as:

  • Unicode arguments to the busybox binary or utilities from outside of busybox (e.g. from cmd.exe, or drag-and-drop from explorer, erc) should work correctly.
  • chcp 65001 can be executed so that Unicode is displayed correctly, but otherwise has no effect on the outcome - like modifying the correct file names etc.
  • Unicode output - like directory listing with ls - is UTF8.
  • Literal unicode strings in scripts should be UTF8.
  • Interactive unicode input and editing at the shell prompt is limited:
    • The Windows Terminal works better than the windows console.
    • Typing or pasting Unicode should work correctly.
    • Editing (cursor movement, delete, etc) can be buggy, especially when it involves double-width (two terminal cells) characters and/or when it involves combining chars.

Regardless of the release notes, it's worth keeping in mind that all of the unicode-related work so far (as well as the notes above) would stay relevant also if we move to proper unicode (W APIs) without the manifest, with the common paradigm of "W APIs at the edges, UTF8 everywhere else".

I'm guessing that would mostly need to use the W versions of open, fopen, FIndFilesW, CreateProcessW and other spawn variants, and diropen (et al). It should be possible, probably not huge amount of work, and was already done at the git-for-windows fork of busybox-w32.

@ale5000-git
Copy link

ale5000-git commented Jul 26, 2023

Just a quote for a possible UTF-8 support on Windows 7 in the future:

Prior to Windows 8, output using codepage 65001 is also broken.
It prints a trail of garbage text in proportion to the number of non-ASCII characters.
In this case the problem is that WriteFile and WriteConsoleA incorrectly return the number of UTF-16 codes written to the screen buffer instead of the number of UTF-8 bytes.
This confuses Python's buffered writer, leading to repeated writes of what it thinks are the remaining unwritten bytes.
This problem was fixed in Windows 8 as part of rewriting the internal console API to use the ConDrv device instead of an LPC port.
Older versions of Windows can use ConEmu or ANSICON to work around this bug.

@avih
Copy link
Contributor

avih commented Jul 29, 2023

@dscho the latest prerelease (64u variant) should now have full unicode support on win10 1803+.

Do you have any setup to test the suitability of this version for git-for-windows? Would you be able to test it?

If the suitability includes mapping /c/... paths to c:/..., can you test whether this handles your test cases correctly? avih@dd31197

If that's still not enough, and the suitability also includes mapping /c paths to c:/, then this does that avih@87d43d7

(these commits are rebased occasionally, typically without changes, but should be always accessible near the top of this branch as long as I keep them (no intent to drop it) https://github.com/avih/busybox-w32/commits/avih )

We might still add generic unicode support similar to your fork at some later strage, but for now we only support unicode via the UTF8 manifest which only has an effect on win 10+.

@dscho
Copy link
Contributor

dscho commented Aug 11, 2023

Do you have any setup to test the suitability of this version for git-for-windows?

Not really. I do have something that could be adapted to such a setup: https://github.com/dscho/git/blob/busybox/.github/workflows/main.yml (this is the workflow I used to try to identify the spots where using BusyBox to run Git's test suite should be faster than MSYS2's Bash but simply isn't, sometimes it's even slower).

Would you be able to test it?

I am woefully short on time, and shifted my focus away from BusyBox, so: unfortunately, I won't be able to test it.

@avih
Copy link
Contributor

avih commented Aug 11, 2023

I am woefully short on time, and shifted my focus away from BusyBox, so: unfortunately, I won't be able to test it.

Thanks.

I'm assuming you're still interested in also providing a busybox variant of git-for-windows if it's not too much effort, and that you prefer upstream busybox-w32 over maintaining a downstream fork.

If those assumptions are correct, what would be the steps to get there?

I guess I could, as a first step, replace the busybox.exe binary at the busybox-git-for-windows package (which is what I use mainly) with a current upstream unicode variant, and see if stuff seems to work... though running a test suite should be much better...

@dscho
Copy link
Contributor

dscho commented Aug 14, 2023

I'm assuming you're still interested in also providing a busybox variant of git-for-windows if it's not too much effort, and that you prefer upstream busybox-w32 over maintaining a downstream fork.

This is difficult to answer. I am obviously interested in less maintenance burden, at the same time I cannot rely on a project that does not even enable regular CI builds.

I guess I could, as a first step, replace the busybox.exe binary at the busybox-git-for-windows package (which is what I use mainly) with a current upstream unicode variant, and see if stuff seems to work... though running a test suite should be much better...

The best idea would be to adapt (read: cherry-pick) the .github/workflows/main.yml changes in dscho/git@git-for-windows:main...busybox with the GitHub workflow from git-for-windows@8ac1e94 cherry-picked onto a newer upstream busybox-w32 version and get it to run Git's test suite.

Personally, I would do this in two steps:

  1. get the busybox-w32 build going in a busybox-w32 fork
  2. apply a modified version of dscho/git@4e837b1 to pick up the build artifacts from step 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants