Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The design of encoding for restricted characters is counter-intuitive #7456

Open
URenko opened this issue Nov 26, 2023 · 1 comment
Open

Comments

@URenko
Copy link
Contributor

URenko commented Nov 26, 2023

The associated forum post URL from https://forum.rclone.org

https://forum.rclone.org/t/cant-access-file-with-in-the-name/43068

What is the problem you are having with rclone?

Can't access file with ( U+201B, SINGLE HIGH-REVERSED-9 QUOTATION MARK, the character used to escape restricted characters) in the name.

What is your rclone version (output from rclone version)

rclone v1.64.2
- os/version: debian 12.2 (64 bit)
- os/kernel: 6.1.0-13-amd64 (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.21.3
- go/linking: static
- go/tags: none

Which OS you are using and how many bits (e.g. Windows 7, 64 bit)

Linux, 64bit

Which cloud storage system are you using? (e.g. Google Drive)

Local and OneDrive, et.al. Here I take local as the example.

The command you were trying to run (e.g. rclone copy /tmp remote:tmp)

echo 'here is the content' > '‛'
rclone cat ‛
rclone cat ‛‛
rclone cat ‛‛‛
rclone cat ‛‛‛‛

A log from the command with the -vv flag (e.g. output from rclone -vv copy /tmp remote:tmp)

rclone cat ‛ -vv
<7>DEBUG : rclone: Version "v1.64.2" starting with parameters ["rclone" "cat" "‛" "-vv"]
<7>DEBUG : rclone: systemd logging support activated
<7>DEBUG : Creating backend with remote "‛"
<7>DEBUG : Using config file from "/home/<myusername>/.config/rclone/rclone.conf"
<7>DEBUG : fs cache: renaming cache item "‛" to be canonical "/tmp/test/‛‛"
<3>ERROR : : error listing: directory not found
<7>DEBUG : 4 go routines active
Failed to cat with 2 errors: last error was: directory not found

After reading the code, I understand that the story is like this:

  • For human input, rclone treated it as already be encoded as a way called "Standard", which means EncodeZero | EncodeSlash | EncodeCtl | EncodeDel | EncodeDot
  • Then rclone will decode it and encode back with the encoding of this backend (FromStandardPath). And use the encoded path to access the backend.
  • encoding = None, which corresponds to EncodeZero, does NOT mean no encoding. Actually, it means the encoding NUL(0x00) → , Therefore, we have no way to indicates rclone not to use encoding if we follow the design faithfully.

However, there is a short-circuit simplified code in FromStandardPath:

  • If the target encoding is equal to "Standard", FromStandardPath will do nothing but just return the input path.

Therefore, One trick to solve my problem is use "Standard" as the encoding for backend:

$ rclone cat ‛ --local-encoding None,Slash,Ctl,Del,Dot
here is the content

I think this issue reflects that the current design needs improvement, as currently:

  • None actually do NUL(0x00) → and ‛‛
  • None,Slash,Ctl,Del,Dot behave like the encoding is disabled

which is counter-intuitive.


How to use GitHub

  • Please use the 👍 reaction to show that you are affected by the same issue.
  • Please don't comment if you have no relevant information to add. It's just extra noise for everyone subscribed to this issue.
  • Subscribe to receive notifications on status change and new comments.
@ncw
Copy link
Member

ncw commented Nov 27, 2023

I think the handling of ‛ is wrong in the current design - see #6098 for an explanation.

I think this issue reflects that the current design needs improvement, as currently:

  • None actually do NUL(0x00) → and ‛‛
  • None,Slash,Ctl,Del,Dot behave like the encoding is disabled

That is interesting and probably explains some of the confusion that this topic generates. Some backends use the Standard encoding directly so have been skipping some encoding whereas others don't.

PS If I was doing character encodings from scratch again, I wouldn't choose the wide letters as these are used widely in CJK languages which I didn't know at the time I chose them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants