Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max_len with sanitize_filename behaves incorrectly with multi-byte unicode chars #47

Open
7x11x13 opened this issue Jun 18, 2024 · 0 comments · May be fixed by #48
Open

max_len with sanitize_filename behaves incorrectly with multi-byte unicode chars #47

7x11x13 opened this issue Jun 18, 2024 · 0 comments · May be fixed by #48

Comments

@7x11x13
Copy link

7x11x13 commented Jun 18, 2024

Code to reproduce:

from pathvalidate import sanitize_filename, validate_filename

filename = "図彌見視御未味尾微身実箕論學識我遠不及他段齉籲颧饕掱麒麟魑魅魍魉麤𪚥龘 爨馕龘龘憂鬱龘國龘əəəəəəəəə иߤ-кߎ߹𝒴-𝓃߫߯ р✁ту✁ть.mp3"

print("length:", len(filename))
print("bytes:", len(filename.encode()))
sanitized = sanitize_filename(filename, max_len=255)
print("sanitized length:", len(sanitized))
print("sanitized bytes:", len(sanitized.encode()))
validate_filename(sanitized, max_len=255)

gives output:

length: 114
bytes: 270
sanitized length: 114
sanitized bytes: 270
Traceback (most recent call last):
  ...
pathvalidate.error.ValidationError: [PV1101] found an invalid string length: filename is too long: expected<=255 bytes, actual=270 bytes, platform=universal, fs_encoding=utf-8, byte_count=270
@7x11x13 7x11x13 linked a pull request Jun 18, 2024 that will close this issue
@7x11x13 7x11x13 changed the title max_len with filename_sanitize behaves incorrectly with multi-byte unicode chars max_len with sanitize_filename behaves incorrectly with multi-byte unicode chars Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant