UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence #12771

danerlt · 2024-06-18T01:46:29Z

Description

When the unit attempted to install dependencies using the pip install -r requirements.txt command, an error UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence occurred. The error log is as follows:

error trace:

ERROR: Exception:
Traceback (most recent call last):
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\cli\base_command.py", line 180, in exc_logging_wrapper
    status = run_func(*args)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\cli\req_command.py", line 245, in wrapper
    return func(self, options, args)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\commands\install.py", line 342, in run
    reqs = self.get_requirements(args, options, finder, session)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\cli\req_command.py", line 433, in get_requirements
    for parsed_req in parse_requirements(
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 156, in parse_requirements
    for parsed_line in parser.parse(filename, constraint):
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 337, in parse
    yield from self._parse_and_recurse(filename, constraint)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 342, in _parse_and_recurse
    for line in self._parse_file(filename, constraint):
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 373, in _parse_file
    _, content = get_file_content(filename, self._session)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 551, in get_file_content
    content = auto_decode(f.read())
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\utils\encoding.py", line 34, in auto_decode
    return data.decode(
UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence

Expected behavior

Properly install dependencies.

pip version

24.0

Python version

3.10.14

OS

window10

How to Reproduce

When the unit attempted to install dependencies using the pip install -r requirements.txt command, an error UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence occurred. The error log is as follows:

error trace:

ERROR: Exception:
Traceback (most recent call last):
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\cli\base_command.py", line 180, in exc_logging_wrapper
    status = run_func(*args)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\cli\req_command.py", line 245, in wrapper
    return func(self, options, args)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\commands\install.py", line 342, in run
    reqs = self.get_requirements(args, options, finder, session)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\cli\req_command.py", line 433, in get_requirements
    for parsed_req in parse_requirements(
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 156, in parse_requirements
    for parsed_line in parser.parse(filename, constraint):
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 337, in parse
    yield from self._parse_and_recurse(filename, constraint)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 342, in _parse_and_recurse
    for line in self._parse_file(filename, constraint):
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 373, in _parse_file
    _, content = get_file_content(filename, self._session)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 551, in get_file_content
    content = auto_decode(f.read())
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\utils\encoding.py", line 34, in auto_decode
    return data.decode(
UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence

Output

No response

Code of Conduct

I agree to follow the PSF Code of Conduct.

The text was updated successfully, but these errors were encountered:

matthewhughes934 · 2024-06-18T06:13:11Z

Are you able to share the contents of the requirements.txt file you were using?

danerlt · 2024-06-18T06:17:34Z

@matthewhughes934
The contents of my requirements.txt are as follows:

# server
supervisor==4.2.5
gunicorn==21.2.0
gevent==23.9.1

# web
Werkzeug==2.3.7
celery==5.2.7
click==8.1.7
dataclasses_json==0.6.4
Flask==2.3.3
Flask_Cors==3.0.10
Flask_Login==0.6.2
Flask_Migrate==4.0.5
Flask_RESTful==0.3.9
flask_sqlalchemy==3.0.5
SQLAlchemy==2.0.0
minio==7.2.4
psycopg2-binary==2.9.9
python-dotenv==1.0.1
redis==5.0.2
requests==2.31.0

# rag
langchain==0.1.16
llama-index==0.10.30
llama-index-core==0.10.30  # 这个必须手指定，不然构建的时候会去获取最新的版本，可能会有bug。
llama-index-retrievers-bm25==0.1.3
llama-index-storage-index-store-redis==0.1.2
llama-index-storage-kvstore-redis==0.1.3
llama-index-storage-docstore-mongodb==0.1.3
llama-index-vector-stores-milvus==0.1.10
llama-index-vector-stores-qdrant==0.2.5
llama-parse==0.4.1
rank-bm25==0.2.2
ragas==0.1.1
qdrant-client==1.9.0
pymongo==4.6.3
motor==3.4.0
asyncpg==0.29.0
spacy==3.7.4
jieba==0.42.1
./zh_core_web_sm-3.7.0-py3-none-any.whl
scikit-learn==1.4.2


# data loader 相关依赖
pypdf==4.2.0
pdfminer-six==20231228
PyMuPDF==1.24.2
docx2txt==0.8
python-docx==1.1.0
openpyxl==3.1.2

# 评估相关
dashscope==1.19.2
zhipuai==2.1.0

danerlt · 2024-06-18T06:20:42Z

@matthewhughes934
I modified pip_internal\utils\encoding.py and added the ignore parameter to its data.decode method, which resolved the issue.

uranusjr · 2024-06-18T06:23:17Z

It’s probably best to always use ascii with replace. We only allow ASCII in requirements, and anything else (e.g. comments) are ignored by the parser anyway.

A PR would be much welcomed.

matthewhughes934 · 2024-06-18T19:19:42Z

I modified pip_internal\utils\encoding.py and added the ignore parameter to its data.decode method, which resolved the issue.

I guess the underlying issue was: the file looks to be UTF-8 encoded but you're working in an environment that uses a simplified Chinese locale, and so uses GBK for decoding. I guess an alternative solution would be to run Python in UTF-8 mode (https://docs.python.org/3/using/windows.html#utf-8-mode)

matthewhughes934 · 2024-06-18T19:22:06Z

It’s probably best to always use ascii with replace. We only allow ASCII in requirements, and anything else (e.g. comments) are ignored by the parser anyway.

A PR would be much welcomed.

👍 happy to get a PR up. I'm wondering two things:

If I change auto_decode: are there places where we want decoding to fail (per errors="strict") or would it be ok to always replace? Or is there code elsewhere that should be changed?
🤔 Is there any potential for issues with multi-byte/non-ascii-extended encodings: I have no idea how common these might be, but I guess a consequence could be instead of getting a 'failed to decode' error you could get an error about pip failing to install a package named "��"

pfmoore · 2024-06-18T19:48:19Z

We only allow ASCII in requirements, and anything else (e.g. comments) are ignored by the parser anyway.

Unfortunately, requirements aren't the only things in a requirement file. --requirement <path to file to include> could include arbitrary Unicode characters, and for that matter a simple local pathname is valid (and could be Unicode).

However, the documentation states that requirement files should be UTF-8 by default, so this seems like a simple bug in auto_decode - https://github.com/pypa/pip/blob/main/src/pip/_internal/utils/encoding.py#L35 should be using UTF-8. (And arguably the BOM detection in there is in violation of the spec, but IMO it's not worth changing).

Of course, even though this is technically a bug fix, it is still a breaking change, potentially, so we need to consider how we handle that. (We could fall back to the system encoding if UTF8 fails, with a deprecation warning - this won't avoid mojibake, but it will catch outright encoding failures).

uranusjr · 2024-06-19T03:13:59Z

Ah, right, I forgot about paths. Falling back with a deprecation warning sounds like the way to go.

For the case where: * a requirements file is encoded as UTF-8, and * some bytes in the file are incompatible with the system locale In this case, fallback to decoding as UTF-8 as a last resort (rather than crashing on the `UnicodeDecodeError`). This behaviour was added when parsing the request file, rather than in `auto_decode` as it didn't seem to belong in a generic util (though that util looks to only be ever called when parsing requirements files anyway). Perhaps we should just go straight to UTF-8 without querying the system locale (unless there is a PEP-263 style comment), per the docs[1]: > Requirements files are utf-8 encoding by default But to avoid a breaking change just warn if decoding with this locale fails then fallback to UTF-8 [1] https://pip.pypa.io/en/stable/reference/requirements-file-format/#encoding Fixes: pypa#12771

danerlt added S: needs triage Issues/PRs that need to be triaged type: bug A confirmed bug or unintended behavior labels Jun 18, 2024

uranusjr added state: awaiting PR Feature discussed, PR is needed and removed S: needs triage Issues/PRs that need to be triaged labels Jun 18, 2024

matthewhughes934 linked a pull request Jun 25, 2024 that will close this issue

Handle req file decode failures on locale encoding #12795

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence #12771

UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence #12771

danerlt commented Jun 18, 2024

matthewhughes934 commented Jun 18, 2024

danerlt commented Jun 18, 2024

danerlt commented Jun 18, 2024

uranusjr commented Jun 18, 2024

matthewhughes934 commented Jun 18, 2024 •

edited

Loading

matthewhughes934 commented Jun 18, 2024

pfmoore commented Jun 18, 2024

uranusjr commented Jun 19, 2024

UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence #12771

UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence #12771

Comments

danerlt commented Jun 18, 2024

Description

Expected behavior

pip version

Python version

OS

How to Reproduce

Output

Code of Conduct

matthewhughes934 commented Jun 18, 2024

danerlt commented Jun 18, 2024

danerlt commented Jun 18, 2024

uranusjr commented Jun 18, 2024

matthewhughes934 commented Jun 18, 2024 • edited Loading

matthewhughes934 commented Jun 18, 2024

pfmoore commented Jun 18, 2024

uranusjr commented Jun 19, 2024

matthewhughes934 commented Jun 18, 2024 •

edited

Loading