Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence #12771

Open
1 task done
danerlt opened this issue Jun 18, 2024 · 8 comments · May be fixed by #12795
Open
1 task done
Labels
state: awaiting PR Feature discussed, PR is needed type: bug A confirmed bug or unintended behavior

Comments

@danerlt
Copy link

danerlt commented Jun 18, 2024

Description

When the unit attempted to install dependencies using the pip install -r requirements.txt command, an error UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence occurred. The error log is as follows:

error trace:

ERROR: Exception:
Traceback (most recent call last):
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\cli\base_command.py", line 180, in exc_logging_wrapper
    status = run_func(*args)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\cli\req_command.py", line 245, in wrapper
    return func(self, options, args)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\commands\install.py", line 342, in run
    reqs = self.get_requirements(args, options, finder, session)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\cli\req_command.py", line 433, in get_requirements
    for parsed_req in parse_requirements(
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 156, in parse_requirements
    for parsed_line in parser.parse(filename, constraint):
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 337, in parse
    yield from self._parse_and_recurse(filename, constraint)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 342, in _parse_and_recurse
    for line in self._parse_file(filename, constraint):
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 373, in _parse_file
    _, content = get_file_content(filename, self._session)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 551, in get_file_content
    content = auto_decode(f.read())
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\utils\encoding.py", line 34, in auto_decode
    return data.decode(
UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence

Expected behavior

Properly install dependencies.

pip version

24.0

Python version

3.10.14

OS

window10

How to Reproduce

When the unit attempted to install dependencies using the pip install -r requirements.txt command, an error UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence occurred. The error log is as follows:

error trace:

ERROR: Exception:
Traceback (most recent call last):
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\cli\base_command.py", line 180, in exc_logging_wrapper
    status = run_func(*args)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\cli\req_command.py", line 245, in wrapper
    return func(self, options, args)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\commands\install.py", line 342, in run
    reqs = self.get_requirements(args, options, finder, session)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\cli\req_command.py", line 433, in get_requirements
    for parsed_req in parse_requirements(
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 156, in parse_requirements
    for parsed_line in parser.parse(filename, constraint):
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 337, in parse
    yield from self._parse_and_recurse(filename, constraint)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 342, in _parse_and_recurse
    for line in self._parse_file(filename, constraint):
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 373, in _parse_file
    _, content = get_file_content(filename, self._session)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 551, in get_file_content
    content = auto_decode(f.read())
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\utils\encoding.py", line 34, in auto_decode
    return data.decode(
UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence

Output

No response

Code of Conduct

@danerlt danerlt added S: needs triage Issues/PRs that need to be triaged type: bug A confirmed bug or unintended behavior labels Jun 18, 2024
@matthewhughes934
Copy link
Contributor

Are you able to share the contents of the requirements.txt file you were using?

@danerlt
Copy link
Author

danerlt commented Jun 18, 2024

@matthewhughes934
The contents of my requirements.txt are as follows:

# server
supervisor==4.2.5
gunicorn==21.2.0
gevent==23.9.1

# web
Werkzeug==2.3.7
celery==5.2.7
click==8.1.7
dataclasses_json==0.6.4
Flask==2.3.3
Flask_Cors==3.0.10
Flask_Login==0.6.2
Flask_Migrate==4.0.5
Flask_RESTful==0.3.9
flask_sqlalchemy==3.0.5
SQLAlchemy==2.0.0
minio==7.2.4
psycopg2-binary==2.9.9
python-dotenv==1.0.1
redis==5.0.2
requests==2.31.0

# rag
langchain==0.1.16
llama-index==0.10.30
llama-index-core==0.10.30  # 这个必须手指定,不然构建的时候会去获取最新的版本,可能会有bug。
llama-index-retrievers-bm25==0.1.3
llama-index-storage-index-store-redis==0.1.2
llama-index-storage-kvstore-redis==0.1.3
llama-index-storage-docstore-mongodb==0.1.3
llama-index-vector-stores-milvus==0.1.10
llama-index-vector-stores-qdrant==0.2.5
llama-parse==0.4.1
rank-bm25==0.2.2
ragas==0.1.1
qdrant-client==1.9.0
pymongo==4.6.3
motor==3.4.0
asyncpg==0.29.0
spacy==3.7.4
jieba==0.42.1
./zh_core_web_sm-3.7.0-py3-none-any.whl
scikit-learn==1.4.2


# data loader 相关依赖
pypdf==4.2.0
pdfminer-six==20231228
PyMuPDF==1.24.2
docx2txt==0.8
python-docx==1.1.0
openpyxl==3.1.2

# 评估相关
dashscope==1.19.2
zhipuai==2.1.0

@danerlt
Copy link
Author

danerlt commented Jun 18, 2024

@matthewhughes934
I modified pip_internal\utils\encoding.py and added the ignore parameter to its data.decode method, which resolved the issue.

@uranusjr
Copy link
Member

It’s probably best to always use ascii with replace. We only allow ASCII in requirements, and anything else (e.g. comments) are ignored by the parser anyway.

A PR would be much welcomed.

@uranusjr uranusjr added state: awaiting PR Feature discussed, PR is needed and removed S: needs triage Issues/PRs that need to be triaged labels Jun 18, 2024
@matthewhughes934
Copy link
Contributor

matthewhughes934 commented Jun 18, 2024

I modified pip_internal\utils\encoding.py and added the ignore parameter to its data.decode method, which resolved the issue.

I guess the underlying issue was: the file looks to be UTF-8 encoded but you're working in an environment that uses a simplified Chinese locale, and so uses GBK for decoding. I guess an alternative solution would be to run Python in UTF-8 mode (https://docs.python.org/3/using/windows.html#utf-8-mode)

@matthewhughes934
Copy link
Contributor

It’s probably best to always use ascii with replace. We only allow ASCII in requirements, and anything else (e.g. comments) are ignored by the parser anyway.

A PR would be much welcomed.

It’s probably best to always use ascii with replace. We only allow ASCII in requirements, and anything else (e.g. comments) are ignored by the parser anyway.

A PR would be much welcomed.

👍 happy to get a PR up. I'm wondering two things:

  • If I change auto_decode: are there places where we want decoding to fail (per errors="strict") or would it be ok to always replace? Or is there code elsewhere that should be changed?
  • 🤔 Is there any potential for issues with multi-byte/non-ascii-extended encodings: I have no idea how common these might be, but I guess a consequence could be instead of getting a 'failed to decode' error you could get an error about pip failing to install a package named "����"

@pfmoore
Copy link
Member

pfmoore commented Jun 18, 2024

We only allow ASCII in requirements, and anything else (e.g. comments) are ignored by the parser anyway.

Unfortunately, requirements aren't the only things in a requirement file. --requirement <path to file to include> could include arbitrary Unicode characters, and for that matter a simple local pathname is valid (and could be Unicode).

However, the documentation states that requirement files should be UTF-8 by default, so this seems like a simple bug in auto_decode - https://github.com/pypa/pip/blob/main/src/pip/_internal/utils/encoding.py#L35 should be using UTF-8. (And arguably the BOM detection in there is in violation of the spec, but IMO it's not worth changing).

Of course, even though this is technically a bug fix, it is still a breaking change, potentially, so we need to consider how we handle that. (We could fall back to the system encoding if UTF8 fails, with a deprecation warning - this won't avoid mojibake, but it will catch outright encoding failures).

@uranusjr
Copy link
Member

Ah, right, I forgot about paths. Falling back with a deprecation warning sounds like the way to go.

matthewhughes934 added a commit to matthewhughes934/pip that referenced this issue Jun 25, 2024
For the case where:

* a requirements file is encoded as UTF-8, and
* some bytes in the file are incompatible with the system locale

In this case, fallback to decoding as UTF-8 as a last resort (rather
than crashing on the `UnicodeDecodeError`). This behaviour was added
when parsing the request file, rather than in `auto_decode` as it didn't
seem to belong in a generic util (though that util looks to only be ever
called when parsing requirements files anyway).

Perhaps we should just go straight to UTF-8 without querying the system
locale (unless there is a PEP-263 style comment), per the docs[1]:

> Requirements files are utf-8 encoding by default

But to avoid a breaking change just warn if decoding with this locale
fails then fallback to UTF-8

[1] https://pip.pypa.io/en/stable/reference/requirements-file-format/#encoding

Fixes: pypa#12771
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
state: awaiting PR Feature discussed, PR is needed type: bug A confirmed bug or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants