Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Certain .pyi files are not encoded as UTF-8 in Windows #124897

Open
marovira opened this issue Apr 25, 2024 · 9 comments
Open

Certain .pyi files are not encoded as UTF-8 in Windows #124897

marovira opened this issue Apr 25, 2024 · 9 comments
Assignees
Labels
module: binaries Anything related to official binaries that we release to users module: regression It used to work, and now it doesn't module: typing Related to mypy type annotations triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Milestone

Comments

@marovira
Copy link

marovira commented Apr 25, 2024

🐛 Describe the bug

On Windows platforms, PyTorch 2.3.0 causes mypy to fail with the following error messages:

mypy: can't decode file '.venv\Lib\site-packages\torch\_C\_VariableFunctions.pyi': 'utf-8' codec can't decode byte 0x93 in position 290995: invalid start byte
mypy: can't decode file '.venv\Lib\site-packages\torch\_VF.pyi': 'utf-8' codec can't decode byte 0x93 in position 290995: invalid start byte

Inspecting _VariableFunctions.pyi and _VF.pyi reveals that the files are encoded with Latin1 encoding instead of UTF-8 like the rest of the pyi files. Looking deeper at the contents of the wheel obtained from https://download.pytorch.org/whl/cu121/torch-2.3.0%2Bcu121-cp311-cp311-win_amd64.whl shows that the files are incorrectly encoded there as well. It is important to note that PyTorch 2.2.2 does not have this problem. It was introduced in 2.3.0.

As a workaround, the files can be re-encoded to UTF-8, which allows mypy to work correctly.

Versions

PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Enterprise
GCC version: Could not collect
Clang version: 17.0.1
CMake version: version 3.27.4
Libc version: N/A

Python version: 3.11.3 (tags/v3.11.3:f3909b8, Apr 4 2023, 23:49:59) [MSC v.1934 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22631-SP0
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1650
Nvidia driver version: 552.22
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture=9
CurrentClockSpeed=4701
DeviceID=CPU0
Family=107
L2CacheSize=12288
L2CacheSpeed=
Manufacturer=AuthenticAMD
MaxClockSpeed=4701
Name=AMD Ryzen 9 7900X 12-Core Processor
ProcessorType=3
Revision=24834

Versions of relevant libraries:
[pip3] mypy==1.10.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] onnx==1.16.0
[pip3] onnxruntime==1.17.3
[pip3] torch==2.3.0+cu121
[pip3] torchvision==0.18.0+cu121
[conda] Could not collect

cc @seemethere @malfet @osalpekar @atalman @ezyang @xuzhao9 @gramster

@marovira marovira changed the title Certain .pyi files are not encoded as UTF-8 Certain .pyi files are not encoded as UTF-8 in Windows Apr 25, 2024
@malfet malfet added module: typing Related to mypy type annotations module: regression It used to work, and now it doesn't labels Apr 25, 2024
@malfet malfet self-assigned this Apr 25, 2024
@malfet malfet added this to the 2.3.1 milestone Apr 25, 2024
@malfet malfet added the module: binaries Anything related to official binaries that we release to users label Apr 25, 2024
@malfet
Copy link
Contributor

malfet commented Apr 25, 2024

I wonder how it happened, as I don't see the problem on trunk right now

@malfet malfet added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 25, 2024
@marovira
Copy link
Author

I have no idea. I double-checked the files in the source tree just to ensure that no changes had been made, and I double-checked the encoding (which is set to UTF-8) so it's not from the source files. My guess is it's coming from the process that fills in those files.

@malfet
Copy link
Contributor

malfet commented Apr 25, 2024

@marovira do you mind sharing the line around the offending characters? As at least on Mac builds all sequences are valid unicode ones

import torch
import os
print(torch.__version__)
x=open(os.path.join(os.path.dirname(torch.__file__), "_VF.pyi"), "rb").read()
y=x.decode("utf-8")
print(len(x), len(y))

Does not fail and produces (1137705, 1137683) for me

@marovira
Copy link
Author

Sure. I'll post them after I've had dinner.

@malfet
Copy link
Contributor

malfet commented Apr 25, 2024

Same here, I can get to a Windows machine but probably after dinner...

@marovira
Copy link
Author

I dug around the file a bit and found where the characters are coming from. In the docstring for cov(input: Tensor, *, ...), jump to line 6320. The offending text is this one:

            These relative weights are typically large for observations considered �important� and smaller for
            observations considered less �important�. Its numel must equal the number of columns of :attr:`input`.
            Must have floating point dtype. Ignored if ``None``. Defaults to ``None``.

The same issue is present in _VariableFunctions.pyi in the exact same line (6320).

@marovira
Copy link
Author

marovira commented Apr 25, 2024

Looking through the PyTorch source code, I think I know what's going on. The docstring comes from torch/_torch_docs.py. Reading through the docstring for torch.cov which starts at 2263, it appears that the issue is coming from this:

        These relative weights are typically large for observations considered “important” and smaller for
        observations considered less “important”. Its numel must equal the number of columns of :attr:`input`.

Note the double-quotes around important. It seems that for some reason those characters are getting mangled, which results in invalid UTF-8 characters. No idea why this would be happening, but that's where the problem is coming from. Also important to note that the file itself is valid UTF-8, so the issue is coming from whatever system takes that docstring and sets it in the pyi files.

danieldk added a commit to danieldk/spacy-transformers that referenced this issue Apr 25, 2024
@malfet
Copy link
Contributor

malfet commented Apr 25, 2024

Above-mentioned changes were introduced by #58311 and there hasn't been any significant changes to neither torchgen/utils.py nor to tools/pyi/gen_pyi.py that can warrant that change, that probably it has something to do with the build environment itself.

But let's fix it two way:

  • Get rid of unicode characters in _torch_docs.py
  • Specify file encoding in torhcgen to be ascii (or utf-8?)
  • Figure out why flake8 didn't raise the error, because torch_docs is missing encoding magic

malfet added a commit that referenced this issue Apr 25, 2024
Replace `“important”` with `important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of #124897
svlandeg pushed a commit to explosion/spacy-transformers that referenced this issue Apr 25, 2024
* CI: Fix macOS build

Constraints:

- `macos-latest` is now ARM64.
- Python 3.11 is the first macOS ARM release available in the action.

So, we use macos-13 for older versions and stop unconditionally setting
the architecture to x86, so that `macos-latest` uses ARM64 builds.

* Pin to PyTorch 2.2.2 on Windows

Until pytorch/pytorch#124897 is resolved.
malfet added a commit that referenced this issue Apr 25, 2024
Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of #124897

Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command:
```
% python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)"
```
raphaelreme added a commit to raphaelreme/torch-tps that referenced this issue Apr 26, 2024
Install torch != 2.3.0 in CI until pytorch/pytorch#124897 is fixed
carmocca pushed a commit to carmocca/pytorch that referenced this issue Apr 29, 2024
Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of pytorch#124897

Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command:
```
% python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)"
```
andoorve pushed a commit to andoorve/pytorch that referenced this issue May 1, 2024
Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of pytorch#124897

Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command:
```
% python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)"
```
andoorve pushed a commit to andoorve/pytorch that referenced this issue May 1, 2024
Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of pytorch#124897

Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command:
```
% python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)"
```
petrex pushed a commit to petrex/pytorch that referenced this issue May 3, 2024
Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of pytorch#124897

Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command:
```
% python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)"
```
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this issue May 9, 2024
Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of pytorch#124897

Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command:
```
% python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)"
```
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this issue May 9, 2024
Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of pytorch#124897

Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command:
```
% python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)"
```
@fingoldo
Copy link

Wondering if there's some flag for mypy to skip such files.

sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this issue May 13, 2024
Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of pytorch#124897

Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command:
```
% python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)"
```
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this issue May 13, 2024
Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of pytorch#124897

Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command:
```
% python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)"
```
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this issue May 14, 2024
Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of pytorch#124897

Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command:
```
% python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)"
```
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this issue May 14, 2024
Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of pytorch#124897

Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command:
```
% python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)"
```
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this issue May 14, 2024
Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of pytorch#124897

Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command:
```
% python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)"
```
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this issue May 14, 2024
Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of pytorch#124897

Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command:
```
% python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)"
```
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this issue May 15, 2024
Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of pytorch#124897

Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command:
```
% python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)"
```
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this issue May 15, 2024
Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of pytorch#124897

Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command:
```
% python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)"
```
atalman pushed a commit to atalman/pytorch that referenced this issue May 15, 2024
Replace `“important”` with `important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of pytorch#124897
atalman added a commit that referenced this issue May 15, 2024
Replace `“important”` with `important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of #124897

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this issue May 28, 2024
Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of pytorch#124897

Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command:
```
% python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)"
```
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this issue May 28, 2024
Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of pytorch#124897

Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command:
```
% python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)"
```
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this issue May 31, 2024
Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of pytorch#124897

Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command:
```
% python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)"
```
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this issue May 31, 2024
Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of pytorch#124897

Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command:
```
% python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)"
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: binaries Anything related to official binaries that we release to users module: regression It used to work, and now it doesn't module: typing Related to mypy type annotations triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants