Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dir() routine mangles macOS unicode filenames #2528

Closed
Marcool04 opened this issue Dec 8, 2018 · 2 comments
Closed

dir() routine mangles macOS unicode filenames #2528

Marcool04 opened this issue Dec 8, 2018 · 2 comments

Comments

@Marcool04
Copy link

The Problem

When using dir() to list a folder that contains files with non-ascii characters on macOS, the representation of the filename that perl6 returns is mangled.

Expected Behavior

Correct utf8 characters should be returned by dir()

Actual Behavior

Characters are rendered in a codepoint type string, for example é => e􏿽xCC􏿽x81

Steps to Reproduce

$ mkdir test
$ cd test
$ touch 'é'
$ perl6 -e 'say dir.gist'
("é".IO)
$ rm é
$ touch temp
#### rename "temp" to "é" in Finder #######
$ perl6 -e 'say dir.gist'
("e􏿽xCC􏿽x81".IO)

Environment

  • Operating system:
uname -a
Darwin [HOST] 18.2.0 Darwin Kernel Version 18.2.0: Fri Oct  5 19:41:49 PDT 2018; root:xnu-4903.221.2~2/RELEASE_X86_64 x86_64
  • Compiler version (perl6 -v):
perl6 -v
This is Rakudo Star version 2018.01 built on MoarVM version 2018.01
implementing Perl 6.c.

Pleased to help if any further information is needed.
Best regards,
Mark.

@jnthn
Copy link
Member

jnthn commented Dec 9, 2018

A Perl 6 Str is stored in NFG (Normal Form Grapheme). By contrast, filenames are in general:

  • Of an unknown encoding
  • Of an unknown normalization

Therefore, whenever a filename is encountered that is not UTF-8 in NFC, synthetics are used to represent what was received from the filesystem. This means that the filenames can be reliably reproduced when handed back to the OS (e.g. using open). At the same time, it allows the programmer to pretend they are strings at least for common manipulation tasks.

If wishing to recover the original bytes, turn the filename into a string and encode the string using .encode('utf8-c8').

@jnthn jnthn closed this as completed Dec 9, 2018
@Marcool04
Copy link
Author

Fantastic! This works as expected:

say $file.Str.encode('utf8-c8').decode('utf8');

Thank you very much for taking the time to provide a detailed explanation.
Keep up the great work on rakudo!
Regards,
Mark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants