Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diacritic characters does truncated #20

Open
andry81 opened this issue Jul 28, 2022 · 10 comments
Open

diacritic characters does truncated #20

andry81 opened this issue Jul 28, 2022 · 10 comments

Comments

@andry81
Copy link

andry81 commented Jul 28, 2022

As noted here: https://stackoverflow.com/questions/39365489/how-do-you-keep-diacritics-in-shortcut-paths

The WScript.Shell implementation does not support diacritic characters in an Unicode string in case of TargetPath shortcut property. But this module has the same issue:

c:\1.txt.lnk -> c:\ööö\1.txt

>pylnk p c:\1.txt.lnk _link_info._path
c:\ooo\1.txt

But WorkingDirectory property is not affected:

>pylnk p c:\1.txt.lnk _work_dir
c:\ööö

I've compared with https://github.com/Matmaus/LnkParse3 implementation and it returns more reliable results:

>lnkparse c:\1.txt.lnk
...

   LINK INFO:
      Link info flags: 1
      Local base path: C:\ooo\1.txt
      Common path suffix:
      Local base unicode: C:\ööö\1.txt
      Common path suffix unicode: .\ööö\1.txt6C:\ööödz
...

   DATA
      Relative path: .\ööö\1.txt
      Working directory: C:\ööö

When pylnk3 is not:

>pylnk3 p c:\1.txt.lnk _link_info.local_base_path
C:\ooo\1.txt

I've tried to change the code:

#DEFAULT_CHARSET = 'cp1251'
DEFAULT_CHARSET = 'utf-8'

But it still returns a truncated variant. Seems the app does read only one property field (Ansi) instead of 2 (Ansi+Unicode) as LnkParse3 does.

@strayge
Copy link
Owner

strayge commented Jul 30, 2022

Hi, thanks for interesting issue.

LinkInfo structure contains only one field with path.
Looks like it can be encoded as utf-8.

Can you check diacritic_characters branch with possible fix?

@andry81
Copy link
Author

andry81 commented Jul 30, 2022

Can you check diacritic_characters branch with possible fix?

c:\Work\OpenSource\pylnk\diacritic_characters>c:\python\x64\310\python
Python 3.10.1 (heads/3.10.1-win7:830a41fd9d, Dec 12 2021, 11:29:02) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pylnk3
>>> lnk = pylnk3.Lnk('d:\\1.txt.lnk')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\Work\OpenSource\pylnk\diacritic_characters\pylnk3.py", line 1504, in __init__
    self._parse_lnk_file(f)
  File "c:\Work\OpenSource\pylnk\diacritic_characters\pylnk3.py", line 1555, in _parse_lnk_file
    self._link_info = LinkInfo(lnk, unicode=self.link_flags.IsUnicode)
  File "c:\Work\OpenSource\pylnk\diacritic_characters\pylnk3.py", line 994, in __init__
    self._parse_path_elements(lnk)
  File "c:\Work\OpenSource\pylnk\diacritic_characters\pylnk3.py", line 1026, in _parse_path_elements
    self.local_base_path = read_cstring(lnk, encoding=self.encoding)
  File "c:\Work\OpenSource\pylnk\diacritic_characters\pylnk3.py", line 186, in read_cstring
    return s.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 3: invalid start byte

I've just created d:\ööö\1.txt directory and file. Then just have used ctrl-c and Windows Explorer context menu to paste as shortcut into d:\1.txt.lnk.

@strayge
Copy link
Owner

strayge commented Jul 30, 2022

Can you share broken lnk file?

Can't reproduce myself.
Win10 En writes utf-8 path with diacritics into LinkInfo.
Win7 Ru writes acsii path (converts diacritic symbols to closest latin ones) into LinkInfo.
Both reads without errors with new branch.

@andry81
Copy link
Author

andry81 commented Jul 31, 2022

Can you share broken lnk file?

1_buggy.txt.zip

I suspect there is some other format than utf-8.

By the way the 0xf6 is code of the ö character.

@strayge
Copy link
Owner

strayge commented Jul 31, 2022

it's cp1252, but i does not know how choice correct encoding

@andry81
Copy link
Author

andry81 commented Jul 31, 2022

it's cp1252

How did you find that? There is at least 3 code pages which has no difference: 1250, 1257, 1258.

, but i does not know how choice correct encoding

You can create the --chcp <str> parameter or something for that. And add --ignore-decode-errors to call decode(..., errors='ignore') instead.

@strayge
Copy link
Owner

strayge commented Jul 31, 2022

How did you find that? There is at least 3 code pages which has no difference: 1250, 1257, 1258.

Just guesses. It's default for english Windows. And it's decodes path correctly.
You can try master branch with changed DEFAULT_CHARSET to cp1252.

@andry81
Copy link
Author

andry81 commented Jul 31, 2022

Does there exist instructions how to build executable in the Scripts?

@andry81
Copy link
Author

andry81 commented Dec 25, 2022

@andry81
Copy link
Author

andry81 commented Dec 25, 2022

Another solution here is that. If try to use --json print:

{
    "relative_path": ".\\\u00f6\u00f6\u00f6\\1.txt",
    "work_dir": "D:\\\u00f6\u00f6\u00f6",
    "link_info": {
        "local_base_path": "D:\\\u0446\u0446\u0446\\1.txt"
    },
}

It does print correct characters in case of relative_path property. May be add an option to decode the TargetPath property as composition of work_dir + relative_path as an alternative?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants