Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: false INTERNAL ASSERT FAILED at "C:\\actions-runner\\_work\\pytorch\\pytorch\\builder\\windows\\pytorch\\aten\\src\\ATen\\native\\BatchLinearAlgebra.cpp":1538, please report a bug to PyTorch. torch.linalg.lstsq: (Batch element 0): Argument 6 has illegal value. Most certainly there is a bug in the implementation calling the backend library. #125892

Open
liudading opened this issue May 10, 2024 · 4 comments
Labels
module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul module: windows Windows support for PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@liudading
Copy link

liudading commented May 10, 2024

🐛 Describe the bug

When I try to run the linked program(https://github.com/KindXiaoming/pykan/blob/master/hellokan.ipynb), the error suggests that it is a library file bug, I get this error and I don't know how to fix it.

The code that reported the error is the penultimate line of code: model.train(dataset, opt=“LBFGS”, steps=50);. The full code is below.


from kan import *
import torch
import torchvision

# create a KAN: 2D inputs, 1D output, and 5 hidden neurons. cubic spline (k=3), 5 grid intervals (grid=5).
model = KAN(width=[2,5,1], grid=5, k=3, device='cpu', seed=0)

# create dataset f(x,y) = exp(sin(pix)+y^2)
f = lambda x: torch.exp(torch.sin(torch.pi
x[:,[0]]) + x[:,[1]]**2)
dataset = create_dataset(f, n_var=2)
dataset['train_input'].shape, dataset['train_label'].shape

# plot KAN at initialization
model(dataset['train_input']);
model.plot(beta=100)

# train the model
model.train(dataset, opt="LBFGS", steps=20, lamb=0.01, lamb_entropy=10.);

model.plot()

model.prune()
model.plot(mask=True)

model = model.prune()
model(dataset['train_input'])
model.plot()

model.train(dataset, opt="LBFGS", steps=50);

mode = "auto" # "manual"

if mode == "manual":
# manual mode
model.fix_symbolic(0,0,0,'sin');
model.fix_symbolic(0,1,0,'x^2');
model.fix_symbolic(1,0,0,'exp');
elif mode == "auto":
# automatic mode
lib = ['x','x^2','x^3','x^4','exp','log','sqrt','tanh','sin','abs']
model.auto_symbolic(lib=lib)

model.train(dataset, opt="LBFGS", steps=50); # The line of code that reported the error

model.symbolic_formula()[0][0]

Versions

pykan 0.0.3
matplotlib 3.8.4
numpy 1.26.4
scikit_learn 1.4.2
setuptools 69.5.1
sympy 1.12
torch 2.3.0
torchvision 0.18.0
tqdm 4.66.4

cc @peterjc123 @mszhanyi @skyline75489 @nbcsm @vladimir-aubrecht @iremyux @Blackhex @cristianPanaite @jianyuh @nikitaved @pearu @mruberry @walterddr @xwang233 @lezcano

@shink
Copy link
Contributor

shink commented May 10, 2024

I'm on it. Could you please share your environment versions by following commands?

wget https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
python collect_env.py

I run the code and get the following result, but I use pykan 0.0.5.

$ curl -fsSL https://gist.githubusercontent.com/shink/ff8e666f17dd6f7f115cae2fae8e075b/raw/9d0d5e2047ac838174faa3cc626e068281bc5a84/kan.py | python -
torch.Size([1000, 2])
torch.Size([1000, 1])
train loss: 1.34e-01 | test loss: 1.39e-01 | reg: 2.49e+01 : 100%|██| 20/20 [00:06<00:00,  2.90it/s]
train loss: 2.02e-03 | test loss: 2.05e-03 | reg: 1.69e+01 : 100%|██| 50/50 [00:12<00:00,  3.87it/s]
fixing (0,0,0) with sin, r2=0.9914394617080688
fixing (0,0,1) with sin, r2=0.9999473690986633
fixing (0,0,2) with sin, r2=0.9954459071159363
fixing (0,1,0) with sin, r2=0.9153558611869812
fixing (0,1,1) with x^2, r2=0.9999995231628418
fixing (0,1,2) with sin, r2=0.9454471468925476
fixing (1,0,0) with log, r2=0.1887553185224533
fixing (1,1,0) with exp, r2=0.9999992251396179
fixing (1,2,0) with abs, r2=0.6482225060462952
train loss: nan | test loss: nan | reg: nan :  10%|█▊                | 5/50 [00:01<00:14,  3.13it/s]
Intel MKL ERROR: Parameter 6 was incorrect on entry to SGELSY.

Intel MKL ERROR: Parameter 6 was incorrect on entry to SGELSY.

Intel MKL ERROR: Parameter 6 was incorrect on entry to SGELSY.

Intel MKL ERROR: Parameter 6 was incorrect on entry to SGELSY.

Intel MKL ERROR: Parameter 6 was incorrect on entry to SGELSY.

Intel MKL ERROR: Parameter 6 was incorrect on entry to SGELSY.
train loss: nan | test loss: nan | reg: nan :  10%|█▊                | 5/50 [00:01<00:14,  3.07it/s]
Traceback (most recent call last):
  File "<stdin>", line 46, in <module>
  File "/home/jyh/program/anaconda/anaconda3/envs/ai/lib/python3.11/site-packages/kan/KAN.py", line 898, in train
    self.update_grid_from_samples(dataset['train_input'][train_id].to(device))
  File "/home/jyh/program/anaconda/anaconda3/envs/ai/lib/python3.11/site-packages/kan/KAN.py", line 244, in update_grid_from_samples
    self.act_fun[l].update_grid_from_samples(self.acts[l])
  File "/home/jyh/program/anaconda/anaconda3/envs/ai/lib/python3.11/site-packages/kan/KANLayer.py", line 218, in update_grid_from_samples
    self.coef.data = curve2coef(x_pos, y_eval, self.grid, self.k, device=self.device)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jyh/program/anaconda/anaconda3/envs/ai/lib/python3.11/site-packages/kan/spline.py", line 137, in curve2coef
    coef = torch.linalg.lstsq(mat.to('cpu'), y_eval.unsqueeze(dim=2).to('cpu')).solution[:, :, 0]  # sometimes 'cuda' version may diverge
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: false INTERNAL ASSERT FAILED at "../aten/src/ATen/native/BatchLinearAlgebra.cpp":1537, please report a bug to PyTorch. torch.linalg.lstsq: (Batch element 0): Argument 6 has illegal value. Most certainly there is a bug in the implementation calling the backend library.
versions
$ python pytorch/torch/utils/collect_env.py
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.2 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.35

Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-76-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 550.54.15
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   42 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Gold 6151 CPU @ 3.00GHz
CPU family:                      6
Model:                           85
Thread(s) per core:              2
Core(s) per socket:              4
Socket(s):                       1
Stepping:                        4
BogoMIPS:                        6000.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat md_clear flush_l1d
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       128 KiB (4 instances)
L1i cache:                       128 KiB (4 instances)
L2 cache:                        4 MiB (4 instances)
L3 cache:                        24.8 MiB (1 instance)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-7
Vulnerability Itlb multihit:     KVM: Mitigation: VMX unsupported
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Mmio stale data:   Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Retbleed:          Mitigation; IBRS
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; Clear CPU buffers; SMT Host state unknown

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.3.0
[pip3] torchaudio==2.3.0
[pip3] torchvision==0.18.0
[pip3] triton==2.3.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] torch                     2.3.0                    pypi_0    pypi
[conda] torchaudio                2.3.0                    pypi_0    pypi
[conda] torchvision               0.18.0                   pypi_0    pypi
[conda] triton                    2.3.0                    pypi_0    pypi

@shink
Copy link
Contributor

shink commented May 10, 2024

It works fine on nightly version, you can install it by:

pip install --pre --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly

output:

$ curl -fsSL https://gist.githubusercontent.com/shink/ff8e666f17dd6f7f115cae2fae8e075b/raw/9d0d5e2047ac838174faa3cc626e068281bc5a84/kan.py | python -
torch.Size([1000, 2])
torch.Size([1000, 1])
train loss: 1.21e-01 | test loss: 1.24e-01 | reg: 2.61e+01 : 100%|██| 20/20 [00:38<00:00,  1.94s/it]
train loss: 1.82e-03 | test loss: 1.96e-03 | reg: 1.17e+01 : 100%|██| 50/50 [01:00<00:00,  1.21s/it]
fixing (0,0,0) with x^4, r2=0.8006412982940674
fixing (0,0,1) with sin, r2=0.9999699592590332
fixing (0,1,0) with sin, r2=0.9219197034835815
fixing (0,1,1) with x^2, r2=0.9999983310699463
fixing (1,0,0) with log, r2=0.7761432528495789
fixing (1,1,0) with exp, r2=1.0000001192092896
train loss: nan | test loss: nan | reg: nan : 100%|█████████████████| 50/50 [01:10<00:00,  1.40s/it]
versions
$ python torch/utils/collect_env.py
Collecting environment information...
PyTorch version: 2.4.0.dev20240509
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.4.1 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: Could not collect
Libc version: N/A

Python version: 3.11.9 (main, Apr 19 2024, 11:43:47) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-14.4.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M3 Pro

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.4.0.dev20240509
[pip3] torchaudio==2.2.0.dev20240509
[pip3] torchvision==0.19.0.dev20240509
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] torch                     2.4.0.dev20240509          pypi_0    pypi
[conda] torchaudio                2.2.0.dev20240509          pypi_0    pypi
[conda] torchvision               0.19.0.dev20240509          pypi_0    pypi

@malfet malfet added module: windows Windows support for PyTorch module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul labels May 10, 2024
@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 10, 2024
@Zha-Miku
Copy link

The same environment, sometimes can sometimes not,in hellokan.ipynb
but my torch is cu118cp310 v2.3.0, Devices must be set up behind each model and data, such as CPU or CUDA: 0
and restart jupyter

@shink
Copy link
Contributor

shink commented May 15, 2024

@Zha-Miku Hi, could you please try re-running your code on the nightly version? This issue may have been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul module: windows Windows support for PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: No status
Development

No branches or pull requests

5 participants