Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ImportError undefined symbol: iJIT_NotifyEvent encountered when MKL 2024.1 is installed. #123097

Open
LiutongZhou opened this issue Apr 1, 2024 · 12 comments
Labels
module: binaries Anything related to official binaries that we release to users module: mkl Related to our MKL support triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@LiutongZhou
Copy link

LiutongZhou commented Apr 1, 2024

The bug

Importing torch raises undefined symbol: iJIT_NotifyEvent from torch/lib/libtorch_cpu.so: when pytorch and MKL 2024.1.0 are installed together. Downgrading MKL to 2024.0.0 resolves it.

import torch
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
----> 1 import torch

File ~/.../lib/python3.10/site-packages/torch/__init__.py:237
    235     if USE_GLOBAL_DEPS:
    236         _load_global_deps()
--> 237     from torch._C import *  # noqa: F403
    239 # Appease the type checker; ordinarily this binding is inserted by the
    240 # torch._C module initialization code in C
    241 if TYPE_CHECKING:

ImportError: /.../lib/python3.10/site-packages/torch/lib/libtorch_cpu.so: undefined symbol: iJIT_NotifyEvent

To Reproduce

mamba create -y -n test_pytorch_mkl python=3.10 pytorch=2.2 pytorch-cuda=12.1 mkl=2024.1 -c pytorch  -c nvidia -c intel
mamba activate test_pytorch_mkl
python -c "import torch"

Versions

PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.10.13 (tags/v3.10.13-25-g07fbd8e9251-dirty:07fbd8e9251, Sep 27 2023, 23:32:09) [GCC 13.2.0] (64-bit runtime)
Python platform: Linux-5.4.0-174-generic-x86_64-with-glibc2.31
Is CUDA available: N/A
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: Tesla V100-SXM2-32GB
Nvidia driver version: 525.60.13
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.6.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      46 bits physical, 48 bits virtual
CPU(s):                             32
On-line CPU(s) list:                0-31
Thread(s) per core:                 1
Core(s) per socket:                 32
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              85
Model name:                         Intel Xeon Processor (Skylake)
Stepping:                           4
CPU MHz:                            2394.374
BogoMIPS:                           4788.74
Virtualization:                     VT-x
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          1 MiB
L1i cache:                          1 MiB
L2 cache:                           128 MiB
L3 cache:                           16 MiB
NUMA node0 CPU(s):                  0-31
Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Mitigation; PTE Inversion; VMX flush not necessary, SMT disabled
Vulnerability Mds:                  Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:             Vulnerable
Vulnerability Spec store bypass:    Vulnerable
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke arch_capabilities

Versions of relevant libraries:
[pip3] torch==2.2.2
[pip3] triton==2.2.0
[conda] blas                      1.0                         mkl    intel
[conda] mkl                       2024.1.0              intel_691    intel
[conda] pytorch                   2.2.2           py3.10_cuda12.1_cudnn8.9.2_0    pytorch
[conda] pytorch-cuda              12.1                 ha16c6d3_5    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchtriton               2.2.0                     py310    pytorch

cc @seemethere @malfet @osalpekar @atalman

@LiutongZhou LiutongZhou changed the title undefined symbol: iJIT_NotifyEvent encountered when MKL 2024.1 is installed. ImportError undefined symbol: iJIT_NotifyEvent encountered when MKL 2024.1 is installed. Apr 1, 2024
@janeyx99 janeyx99 added module: binaries Anything related to official binaries that we release to users triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: mkl Related to our MKL support labels Apr 1, 2024
@min-jean-cho
Copy link
Collaborator

cc. @CuiYifeng

btbest added a commit to btbest/ilastik that referenced this issue Apr 4, 2024
This breaks the tiktorch backend, see pytorch/pytorch#123097
@walidabualafia
Copy link

Hi all,

I am currently hitting this bug as well. It breaks the installation of the latest tbepler/topaz. Is there an ETA for a fix?

Thank you! :)

btbest added a commit to btbest/ilastik that referenced this issue Apr 5, 2024
This breaks the tiktorch backend, see pytorch/pytorch#123097
@ElHouas
Copy link

ElHouas commented Apr 5, 2024

I am experiencing this issue as well, when installing in a docker container using conda. Here is my conda env.yaml:

name: ml_env
channels:

  • pytorch
  • nvidia
  • conda-forge
  • nodefaults
    dependencies:
  • python=3.10.7
  • mamba
  • pip
  • poetry=1.6.1
  • pytorch::pytorch=2.0.1
  • pytorch::torchaudio=2.0.2
  • pytorch::torchvision=0.15.2
  • pytorch::pytorch-cuda=11.8
    platforms:
  • linux-64

Any advice?

Thanks! :)

@StefanGitHuber
Copy link

I can reproduce this issue as well:

  1. First I had the issue with
    ImportError: intel_extension_for_pytorch xpu libintel-ext-pt-gpu.so: undefined symbol for _ZNK5torch8autograd4Node4nameB5cxx11Ev
    Following this thread GPU examples undefined symbol intel-analytics/ipex-llm#8803 I ran
    ldd /home/suhu/.local/lib/python3.10/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so
    linux-vdso.so.1 (0x00007ffe60f8a000)
    libtorch.so => not found
    libtorch_cpu.so => not found
    libc10.so => not found
    libxetla_kernels.so => /home/suhu/.local/lib/python3.10/site-packages/intel_extension_for_pytorch/lib/libxetla_kernels.so (0x0000739e4f600000)
    libmkl_intel_lp64.so.2 => /opt/intel/oneapi/mkl/2024.1/lib/libmkl_intel_lp64.so.2 (0x0000739e4e000000)
    libmkl_core.so.2 => /opt/intel/oneapi/mkl/2024.1/lib/libmkl_core.so.2 (0x0000739e49e00000)
    libmkl_gnu_thread.so.2 => /opt/intel/oneapi/mkl/2024.1/lib/libmkl_gnu_thread.so.2 (0x0000739e48400000)
    libmkl_sycl_blas.so.4 => /opt/intel/oneapi/mkl/2024.1/lib/libmkl_sycl_blas.so.4 (0x0000739e42c00000)
    libmkl_sycl_lapack.so.4 => /opt/intel/oneapi/mkl/2024.1/lib/libmkl_sycl_lapack.so.4 (0x0000739e40400000)
    libmkl_sycl_sparse.so.4 => /opt/intel/oneapi/mkl/2024.1/lib/libmkl_sycl_sparse.so.4 (0x0000739e39e00000)
    libmkl_sycl_dft.so.4 => /opt/intel/oneapi/mkl/2024.1/lib/libmkl_sycl_dft.so.4 (0x0000739e36e00000)
    libmkl_sycl_vm.so.4 => /opt/intel/oneapi/mkl/2024.1/lib/libmkl_sycl_vm.so.4 (0x0000739e2e000000)
    libmkl_sycl_rng.so.4 => /opt/intel/oneapi/mkl/2024.1/lib/libmkl_sycl_rng.so.4 (0x0000739e26200000)
    libmkl_sycl_stats.so.4 => /opt/intel/oneapi/mkl/2024.1/lib/libmkl_sycl_stats.so.4 (0x0000739e24200000)
    libmkl_sycl_data_fitting.so.4 => /opt/intel/oneapi/mkl/2024.1/lib/libmkl_sycl_data_fitting.so.4 (0x0000739e23800000)
    libze_loader.so.1 => /lib/x86_64-linux-gnu/libze_loader.so.1 (0x0000739ee9ac8000)
    libOpenCL.so.1 => /opt/intel/oneapi/compiler/2024.1/opt/oclfpga/host/linux64/lib/libOpenCL.so.1 (0x0000739e23400000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x0000739ee9ac3000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x0000739ee9abe000)
    libsvml.so => /opt/intel/oneapi/compiler/2024.1/lib/libsvml.so (0x0000739e21c00000)
    libirng.so => /opt/intel/oneapi/compiler/2024.1/lib/libirng.so (0x0000739e49d07000)
    libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x0000739e21800000)
    libimf.so => /opt/intel/oneapi/compiler/2024.1/lib/libimf.so (0x0000739e21200000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000739e52f19000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x0000739e4f5e0000)
    libintlc.so.5 => /opt/intel/oneapi/compiler/2024.1/lib/libintlc.so.5 (0x0000739e4df9e000)
    libsycl.so.7 => /opt/intel/oneapi/compiler/2024.1/lib/libsycl.so.7 (0x0000739e20e00000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000739e20a00000)
    /lib64/ld-linux-x86-64.so.2 (0x0000739ee9b45000)

pip install transformers==4.31.0
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu

  1. Now right at the beginning with
    import torch
    even before importing impex
    import intel_extension_for_pytorch as ipex
    I receive

File "~/.local/lib/python3.10/site-packages/torch/init.py", line 235, in
from torch._C import * # noqa: F403
ImportError: ~/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so: undefined symbol: iJIT_NotifyEvent

Here my output from collect_env.py:

PyTorch version: N/A
PyTorch CXX11 ABI: N/A
IPEX version: N/A
IPEX commit: N/A
Build type: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: N/A
IGC version: 2024.1.0 (2024.1.0.20240308)
CMake version: version 3.26.4
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.5.0-26-generic-x86_64-with-glibc2.35
Is XPU available: N/A
DPCPP runtime version: 2024.1
MKL version: 2024.1
GPU models and configuration:
N/A
Intel OpenCL ICD version: 23.52.28202.39-82122.04
Level Zero version: 1.3.28202.39-821
22.04

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 20
On-line CPU(s) list: 0-19
Vendor ID: GenuineIntel
Model name: 12th Gen Intel(R) Core(TM) i7-12700H
CPU family: 6
Model: 154
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 1
Stepping: 3
CPU max MHz: 4700.0000
CPU min MHz: 400.0000
BogoMIPS: 5376.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize arch_lbr ibt flush_l1d arch_capabilities
Virtualization: VT-x
L1d cache: 544 KiB (14 instances)
L1i cache: 704 KiB (14 instances)
L2 cache: 11.5 MiB (8 instances)
L3 cache: 24 MiB (1 instance)
NUMA node(s): 1
NUMA node0 CPU(s): 0-19
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] intel-extension-for-pytorch==2.1.10+xpu
[pip3] numpy==1.26.4
[pip3] pytorch-lightning==1.9.4
[pip3] pytorch-metric-learning==1.7.3
[pip3] torch==2.1.0a0+cxx11.abi
[pip3] torch-audiomentations==0.11.0
[pip3] torch-pitch-shift==1.2.4
[pip3] torch-stft==0.1.4
[pip3] torchaudio==2.1.0.post0+cxx11.abi
[pip3] torchmetrics==0.11.4
[pip3] torchvision==0.16.0a0+cxx11.abi
[conda] N/A

Marcel-Mueck added a commit to PMBio/deeprvat that referenced this issue Apr 9, 2024
* added all changes from annotation-speedups branch

* added gtf and genotype mock file for github tests

* Delete example/annotations/preprocessing_workdir/preprocessed directory

* Update annotation_colnames_filling_values.yaml

* Corrected fill values for maf columns

* Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

* included rulegraph instead dag

* based on  suggestions from @endast

* added version info for rockdb.yaml file

* updated rulegraph

Updated Documentation

corrected nonfunctional links

* added support for X/Y chromosomes, removed dependency on pvcf file

* excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

* changed way file stems are assumed to include 'double ending' on input files.

* removed unused lines, removed pvcf from config file

* changed if statement for gene_id_file

---------

Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
Co-authored-by: PMBio <PMBio@users.noreply.github.com>
endast added a commit to PMBio/deeprvat that referenced this issue Apr 10, 2024
commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>
endast added a commit to PMBio/deeprvat that referenced this issue Apr 10, 2024
commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>
endast added a commit to PMBio/deeprvat that referenced this issue Apr 10, 2024
commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 628af87
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Thu Apr 4 14:09:22 2024 +0200

    Update preprocessing.md (#60)

    Corrected small spelling mistake

commit 1356ed2
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Mar 1 14:55:55 2024 +0100

    Update dense_gt.py (#56)

    bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

commit 4d9ef64
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Feb 23 12:21:49 2024 +0100

    Feature cv training (#55)

    * performance optimizations

    * train multiple repeats on single node in parallel

    * bug fix

    * fix bug in indexing when subset_samples() removed something

    * sleep between jobs; stop if any job fails

    * format with black

    * bug fixes

    * add test for MultiphenoDataloader

    * update environments

    * uncomment rules

    * bug fixes

    * subset samples in training_dataset rule

    * example config.yaml

    * use gpu queue for compute_burdens

    * bugfix since dask reading didn't work any more

    * allow evaluation of all repeat combinations

    * allow analysis of each n_repeats and for all repeat combinations

    * option to provide burden file

    * allow seed gene alpha to be defined in config

    * change sorting order to get the best model

    * adaptations to analyze multiple repeats and use script wo seed genes

    * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

    * automatize generation of figure 3 (associations & repliation)

    * generate cv splits with related samples in the same split

    * average burdens

    * average burdens

    * cross-validation like trainign

    * add missing cv_utils

    * write average burdens or each combination to single zarr file to avoid zarr issues

    * add logging information

    * make maf column a param

    * add logging

    * pipeline replictaion and plotting

    * evaluate all repeat combis with and without seed genes

    * update lsf.yaml

    * small updates

    * per-gene pval aggregation

    * aggregate pval per gene

    * bugfix- only load burdens if not skip burdens

    * logging info

    * updates and fixes

    * load burdens only for genes analysed in current chunk to save memory

    * small changes to pipeline

    * standardizing/qt-transform of combined test set x/y arrays

    * my_quantile_transform for numpy arrays

    * bugfix

    * remove unnecessary code

    * remove unnecessary wildcards

    * make averaging part of associate.py

    * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

    * updates

    * gene-specific common variant covariates for conditional analysis

    * bugfix

    * post-hoc conditioning on common variants

    * restructure pipelines

    * removing redundant options

    * add cv_utils cli

    * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

    * removal of redundant wildcards, updates and fixes

    * bugfixes

    * baseline discoveries only required for training phenotypes

    * remove not needed code

    * update configs

    * formatting

    * manually merge changes from feature-regenie to account for gene-specific annotations

    * allow different sample orders in phenotype_df and genotypes.h5

    * change sample ids to be bytes as it is in the real data

    * update pipelines

    * update gitignore

    * pipeline updates

    * manually update github actions to be like master

    * bug fixes

    * checkout tests from master

    * make phenotype indices string as they are in real data

    * 'add gene_id' column

    * manually merge with master so tests can pass

    * bugfixes

    * use gene_id column instead of gene_ids

    * pipeline updates and fixes

    * update test config

    * adding age2 and age_sex to example data

    * update config

    * set tests folder to  main version

    * checkout preprocssing files from main

    * checkout from main

    * manually merge sample_id changes from main

    * pipeline bugfixes and renamings

    * fixup! Format Python code with psf/black pull_request

    * remove gene_ids column

    * integrating suggested PR changes

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit ada0aaa
Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
Date:   Wed Feb 21 15:56:14 2024 +0100

    Feature regenie (#52)

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * add function to convert REGENIE output

    * don't show all unmapped samples if the list is long

    * don't parallelize REGENIE step 1

    * separate pipelines with and without REGENIE

    * support gene-specific annotation

    * bug fix

    * bug fix

    * bug fix

    * bug fix

    * correct regenie_step1 --lowmem-prefix

    * modify to work standalone

    * add --association-only option

    * allow gene-specific annotation

    * go back to SEAK/statsmodels

    * bug fixes

    * remove SAIGE code, fix imports and conda envs

    * make pipelines more self-contained

    * don't require burdens.zarr when --skip-burdens is passed

    * udpate utils

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
@syedazi
Copy link

syedazi commented Apr 11, 2024

Is there a resolution to this issue? I faced the same issue using conda.

conda create -n fsdp python=3.10
conda activate fsdp

# Install pytorch and other dependencies
conda install -y pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia

@titeup
Copy link

titeup commented Apr 12, 2024

While not having the exact explanation of the missing symbol, here is what worked for me, if it can help others.

I had the same error after installing ipex_llm.
Just painfully found my way out. Python was mixing packages from intel python in oneAPI 2024.1 and my local cache.

after loading intel env I did:
python -m pip install torch==2.1.0.post0 torchvision==0.16.0.post0 torchaudio==2.1.0.post0 intel-extension-for-pytorch==2.1.20+xpu oneccl_bind_pt==2.1.200+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

This magic combination comes from this Doc

And then (needed in my case)
pip install transformers==4.36.2

I suppose for those who want to use other python, you can load only the env variable of the MKL and use the pytorch install that worked for you before. (haven't tried it myself)

Then everything worked fine.

endast added a commit to PMBio/deeprvat that referenced this issue Apr 12, 2024
commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 628af87
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Thu Apr 4 14:09:22 2024 +0200

    Update preprocessing.md (#60)

    Corrected small spelling mistake

commit 1356ed2
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Mar 1 14:55:55 2024 +0100

    Update dense_gt.py (#56)

    bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

commit 4d9ef64
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Feb 23 12:21:49 2024 +0100

    Feature cv training (#55)

    * performance optimizations

    * train multiple repeats on single node in parallel

    * bug fix

    * fix bug in indexing when subset_samples() removed something

    * sleep between jobs; stop if any job fails

    * format with black

    * bug fixes

    * add test for MultiphenoDataloader

    * update environments

    * uncomment rules

    * bug fixes

    * subset samples in training_dataset rule

    * example config.yaml

    * use gpu queue for compute_burdens

    * bugfix since dask reading didn't work any more

    * allow evaluation of all repeat combinations

    * allow analysis of each n_repeats and for all repeat combinations

    * option to provide burden file

    * allow seed gene alpha to be defined in config

    * change sorting order to get the best model

    * adaptations to analyze multiple repeats and use script wo seed genes

    * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

    * automatize generation of figure 3 (associations & repliation)

    * generate cv splits with related samples in the same split

    * average burdens

    * average burdens

    * cross-validation like trainign

    * add missing cv_utils

    * write average burdens or each combination to single zarr file to avoid zarr issues

    * add logging information

    * make maf column a param

    * add logging

    * pipeline replictaion and plotting

    * evaluate all repeat combis with and without seed genes

    * update lsf.yaml

    * small updates

    * per-gene pval aggregation

    * aggregate pval per gene

    * bugfix- only load burdens if not skip burdens

    * logging info

    * updates and fixes

    * load burdens only for genes analysed in current chunk to save memory

    * small changes to pipeline

    * standardizing/qt-transform of combined test set x/y arrays

    * my_quantile_transform for numpy arrays

    * bugfix

    * remove unnecessary code

    * remove unnecessary wildcards

    * make averaging part of associate.py

    * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

    * updates

    * gene-specific common variant covariates for conditional analysis

    * bugfix

    * post-hoc conditioning on common variants

    * restructure pipelines

    * removing redundant options

    * add cv_utils cli

    * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

    * removal of redundant wildcards, updates and fixes

    * bugfixes

    * baseline discoveries only required for training phenotypes

    * remove not needed code

    * update configs

    * formatting

    * manually merge changes from feature-regenie to account for gene-specific annotations

    * allow different sample orders in phenotype_df and genotypes.h5

    * change sample ids to be bytes as it is in the real data

    * update pipelines

    * update gitignore

    * pipeline updates

    * manually update github actions to be like master

    * bug fixes

    * checkout tests from master

    * make phenotype indices string as they are in real data

    * 'add gene_id' column

    * manually merge with master so tests can pass

    * bugfixes

    * use gene_id column instead of gene_ids

    * pipeline updates and fixes

    * update test config

    * adding age2 and age_sex to example data

    * update config

    * set tests folder to  main version

    * checkout preprocssing files from main

    * checkout from main

    * manually merge sample_id changes from main

    * pipeline bugfixes and renamings

    * fixup! Format Python code with psf/black pull_request

    * remove gene_ids column

    * integrating suggested PR changes

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit ada0aaa
Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
Date:   Wed Feb 21 15:56:14 2024 +0100

    Feature regenie (#52)

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * add function to convert REGENIE output

    * don't show all unmapped samples if the list is long

    * don't parallelize REGENIE step 1

    * separate pipelines with and without REGENIE

    * support gene-specific annotation

    * bug fix

    * bug fix

    * bug fix

    * bug fix

    * correct regenie_step1 --lowmem-prefix

    * modify to work standalone

    * add --association-only option

    * allow gene-specific annotation

    * go back to SEAK/statsmodels

    * bug fixes

    * remove SAIGE code, fix imports and conda envs

    * make pipelines more self-contained

    * don't require burdens.zarr when --skip-burdens is passed

    * udpate utils

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
@jingxu10
Copy link
Collaborator

The reason is that PyTorch was built against an old version of MKL distribution which contains this symbol. However, this symbol got removed in MKL 2024.1.
The PyTorch binary released via conda channel was linked to MKL dynamically, so you got this error.
The PyTorch binary released via pip (pip install) was linked to MKL statically. You can switch to the pip install one to get rid of this error with MKL 2024.1.

endast added a commit to PMBio/deeprvat that referenced this issue Apr 15, 2024
commit ae5c83e
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Mon Apr 15 11:01:03 2024 +0200

    fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

    * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>
@StefanGitHuber
Copy link

Hi Jing Xu,
I never used conda but only pip to install PyTorch. I want to use the XPU of my Intel Arc A730M, running on Ubuntu 22.04.4 LTS on my Notebook.

Here again my steps:

  1. Install oneAPI

bash ./intelpython3-2024.1.0_814-Linux-x86_64.sh -b -u -p ~/intel/oneapi/intelpython
source ~/intel/oneapi/intelpython/env/vars.sh

Opens environment on bash:
(oneapi-intelpython)

  1. Install PyTorch extension

github.com/intel/intel-extension-for-pytorch/tree/xpu-main
python -m pip install torch==2.1.0a0 torchvision==0.16.0a0 torchaudio==2.1.0a0 intel-extension-for-pytorch==2.1.10+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

import torch
import intel_extension_for_pytorch as ipex
ImportError: ~/intel/oneapi/intelpython/lib/python3.9/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so: undefined symbol: _ZNK5torch8autograd4Node4nameB5cxx11Ev

  1. Solve _ZNK5torch8autograd4Node4nameB5cxx11Ev

https://community.intel.com/t5/Intel-Developer-Cloud/ImportError-libintel-ext-pt-gpu-so-undefined-symbol/m-p/1561667
pip install --pre --upgrade bigdl-llm[xpu_2.1] -f https://developer.intel.com/ipex-whl-stable-xpu

  1. Solve iJIT_NotifyEvent

import torch
ImportError: ~/intel/oneapi/intelpython/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so: undefined symbol: iJIT_NotifyEvent

Possibly I shouldn't perform step 3 and "upgrade" via installing the bigdl-llm[xpu_2.1] which leads me to this ImportError of iJIT_NotifyEvent. Only showing the diff from collect_env.py, it actually downgrades from before

PyTorch version: 2.2.2+cu121
PyTorch CXX11 ABI: No
[pip3] intel-extension-for-pytorch==2.1.20+xpu
[pip3] torch==2.2.2
[pip3] torchvision==0.16.0.post0+cxx11.abi
[conda] intel-extension-for-pytorch 2.1.20+xpu pypi_0 pypi
[conda] torch 2.2.2 pypi_0 pypi
[conda] torchvision 0.16.0.post0+cxx11.abi pypi_0 pypi

to

PyTorch version: N/A
PyTorch CXX11 ABI: N/A
[pip3] intel-extension-for-pytorch==2.1.10+xpu
[pip3] numpy==1.26.4
[pip3] torch==2.1.0a0+cxx11.abi
[pip3] torchaudio==2.1.0.post0+cxx11.abi
[pip3] torchvision==0.16.0a0+cxx11.abi
[conda] intel-extension-for-pytorch 2.1.10+xpu pypi_0 pypi
[conda] torch 2.1.0a0+cxx11.abi pypi_0 pypi
[conda] torchvision 0.16.0a0+cxx11.abi pypi_0 pypi

Question:
Apparently there are two torch versions, one installed via pip3 and one installed via conda. How to switch to the one installed via pip, please? I have to set an environment variable pointing to it? How exactly, please?

Thanks in advance ...

Marcel-Mueck pushed a commit to PMBio/deeprvat that referenced this issue Apr 16, 2024
* Add new test files

* Update test_preprocess.py

* Use parquet

* Add brians code

* Update preprocess.py

* sort samples

* Remove threads

* Update exclude calls logic

* Squashed commit of the following:

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 628af87
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Thu Apr 4 14:09:22 2024 +0200

    Update preprocessing.md (#60)

    Corrected small spelling mistake

commit 1356ed2
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Mar 1 14:55:55 2024 +0100

    Update dense_gt.py (#56)

    bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

commit 4d9ef64
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Feb 23 12:21:49 2024 +0100

    Feature cv training (#55)

    * performance optimizations

    * train multiple repeats on single node in parallel

    * bug fix

    * fix bug in indexing when subset_samples() removed something

    * sleep between jobs; stop if any job fails

    * format with black

    * bug fixes

    * add test for MultiphenoDataloader

    * update environments

    * uncomment rules

    * bug fixes

    * subset samples in training_dataset rule

    * example config.yaml

    * use gpu queue for compute_burdens

    * bugfix since dask reading didn't work any more

    * allow evaluation of all repeat combinations

    * allow analysis of each n_repeats and for all repeat combinations

    * option to provide burden file

    * allow seed gene alpha to be defined in config

    * change sorting order to get the best model

    * adaptations to analyze multiple repeats and use script wo seed genes

    * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

    * automatize generation of figure 3 (associations & repliation)

    * generate cv splits with related samples in the same split

    * average burdens

    * average burdens

    * cross-validation like trainign

    * add missing cv_utils

    * write average burdens or each combination to single zarr file to avoid zarr issues

    * add logging information

    * make maf column a param

    * add logging

    * pipeline replictaion and plotting

    * evaluate all repeat combis with and without seed genes

    * update lsf.yaml

    * small updates

    * per-gene pval aggregation

    * aggregate pval per gene

    * bugfix- only load burdens if not skip burdens

    * logging info

    * updates and fixes

    * load burdens only for genes analysed in current chunk to save memory

    * small changes to pipeline

    * standardizing/qt-transform of combined test set x/y arrays

    * my_quantile_transform for numpy arrays

    * bugfix

    * remove unnecessary code

    * remove unnecessary wildcards

    * make averaging part of associate.py

    * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

    * updates

    * gene-specific common variant covariates for conditional analysis

    * bugfix

    * post-hoc conditioning on common variants

    * restructure pipelines

    * removing redundant options

    * add cv_utils cli

    * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

    * removal of redundant wildcards, updates and fixes

    * bugfixes

    * baseline discoveries only required for training phenotypes

    * remove not needed code

    * update configs

    * formatting

    * manually merge changes from feature-regenie to account for gene-specific annotations

    * allow different sample orders in phenotype_df and genotypes.h5

    * change sample ids to be bytes as it is in the real data

    * update pipelines

    * update gitignore

    * pipeline updates

    * manually update github actions to be like master

    * bug fixes

    * checkout tests from master

    * make phenotype indices string as they are in real data

    * 'add gene_id' column

    * manually merge with master so tests can pass

    * bugfixes

    * use gene_id column instead of gene_ids

    * pipeline updates and fixes

    * update test config

    * adding age2 and age_sex to example data

    * update config

    * set tests folder to  main version

    * checkout preprocssing files from main

    * checkout from main

    * manually merge sample_id changes from main

    * pipeline bugfixes and renamings

    * fixup! Format Python code with psf/black pull_request

    * remove gene_ids column

    * integrating suggested PR changes

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit ada0aaa
Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
Date:   Wed Feb 21 15:56:14 2024 +0100

    Feature regenie (#52)

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * add function to convert REGENIE output

    * don't show all unmapped samples if the list is long

    * don't parallelize REGENIE step 1

    * separate pipelines with and without REGENIE

    * support gene-specific annotation

    * bug fix

    * bug fix

    * bug fix

    * bug fix

    * correct regenie_step1 --lowmem-prefix

    * modify to work standalone

    * add --association-only option

    * allow gene-specific annotation

    * go back to SEAK/statsmodels

    * bug fixes

    * remove SAIGE code, fix imports and conda envs

    * make pipelines more self-contained

    * don't require burdens.zarr when --skip-burdens is passed

    * udpate utils

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

* Revert "Squashed commit of the following:"

This reverts commit ebde7c1.

* Remove unused import

* don't use mkl 2024.1.0

* update micromamba@v1.8.1

* Isolate failing test

* test genotype matrix

* Revert "test genotype matrix"

This reverts commit 6deee9b.

* Revert "Isolate failing test"

This reverts commit 6a11fe3.

* fixup! Format Python code with psf/black pull_request

* remove files

* Delete variants.tsv.gz

* Update test_preprocess.py

* Update test_preprocess.py

* fixup! Format Python code with psf/black pull_request

* Update test_preprocess.py

* Update test-runner.yml

* one test

* Revert "one test"

This reverts commit 05e4578.

* Revert "Update test-runner.yml"

This reverts commit ff78d30.

* update call filter test data

* Update expected data

* Update deeprvat_preprocessing_env.yml

Remove joblib

* Squashed commit of the following:

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 628af87
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Thu Apr 4 14:09:22 2024 +0200

    Update preprocessing.md (#60)

    Corrected small spelling mistake

commit 1356ed2
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Mar 1 14:55:55 2024 +0100

    Update dense_gt.py (#56)

    bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

commit 4d9ef64
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Feb 23 12:21:49 2024 +0100

    Feature cv training (#55)

    * performance optimizations

    * train multiple repeats on single node in parallel

    * bug fix

    * fix bug in indexing when subset_samples() removed something

    * sleep between jobs; stop if any job fails

    * format with black

    * bug fixes

    * add test for MultiphenoDataloader

    * update environments

    * uncomment rules

    * bug fixes

    * subset samples in training_dataset rule

    * example config.yaml

    * use gpu queue for compute_burdens

    * bugfix since dask reading didn't work any more

    * allow evaluation of all repeat combinations

    * allow analysis of each n_repeats and for all repeat combinations

    * option to provide burden file

    * allow seed gene alpha to be defined in config

    * change sorting order to get the best model

    * adaptations to analyze multiple repeats and use script wo seed genes

    * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

    * automatize generation of figure 3 (associations & repliation)

    * generate cv splits with related samples in the same split

    * average burdens

    * average burdens

    * cross-validation like trainign

    * add missing cv_utils

    * write average burdens or each combination to single zarr file to avoid zarr issues

    * add logging information

    * make maf column a param

    * add logging

    * pipeline replictaion and plotting

    * evaluate all repeat combis with and without seed genes

    * update lsf.yaml

    * small updates

    * per-gene pval aggregation

    * aggregate pval per gene

    * bugfix- only load burdens if not skip burdens

    * logging info

    * updates and fixes

    * load burdens only for genes analysed in current chunk to save memory

    * small changes to pipeline

    * standardizing/qt-transform of combined test set x/y arrays

    * my_quantile_transform for numpy arrays

    * bugfix

    * remove unnecessary code

    * remove unnecessary wildcards

    * make averaging part of associate.py

    * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

    * updates

    * gene-specific common variant covariates for conditional analysis

    * bugfix

    * post-hoc conditioning on common variants

    * restructure pipelines

    * removing redundant options

    * add cv_utils cli

    * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

    * removal of redundant wildcards, updates and fixes

    * bugfixes

    * baseline discoveries only required for training phenotypes

    * remove not needed code

    * update configs

    * formatting

    * manually merge changes from feature-regenie to account for gene-specific annotations

    * allow different sample orders in phenotype_df and genotypes.h5

    * change sample ids to be bytes as it is in the real data

    * update pipelines

    * update gitignore

    * pipeline updates

    * manually update github actions to be like master

    * bug fixes

    * checkout tests from master

    * make phenotype indices string as they are in real data

    * 'add gene_id' column

    * manually merge with master so tests can pass

    * bugfixes

    * use gene_id column instead of gene_ids

    * pipeline updates and fixes

    * update test config

    * adding age2 and age_sex to example data

    * update config

    * set tests folder to  main version

    * checkout preprocssing files from main

    * checkout from main

    * manually merge sample_id changes from main

    * pipeline bugfixes and renamings

    * fixup! Format Python code with psf/black pull_request

    * remove gene_ids column

    * integrating suggested PR changes

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit ada0aaa
Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
Date:   Wed Feb 21 15:56:14 2024 +0100

    Feature regenie (#52)

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * add function to convert REGENIE output

    * don't show all unmapped samples if the list is long

    * don't parallelize REGENIE step 1

    * separate pipelines with and without REGENIE

    * support gene-specific annotation

    * bug fix

    * bug fix

    * bug fix

    * bug fix

    * correct regenie_step1 --lowmem-prefix

    * modify to work standalone

    * add --association-only option

    * allow gene-specific annotation

    * go back to SEAK/statsmodels

    * bug fixes

    * remove SAIGE code, fix imports and conda envs

    * make pipelines more self-contained

    * don't require burdens.zarr when --skip-burdens is passed

    * udpate utils

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

* Revert change of micromamba

* Ruff check

* Squashed commit of the following:

commit ae5c83e
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Mon Apr 15 11:01:03 2024 +0200

    fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

    * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

---------

Co-authored-by: PMBio <PMBio@users.noreply.github.com>
endast added a commit to PMBio/deeprvat that referenced this issue Apr 16, 2024
commit 24b3af5
Author: Magnus Wahlberg <endast@gmail.com>
Date:   Tue Apr 16 10:40:45 2024 +0200

    Optimize preprocessing (#65)

    * Add new test files

    * Update test_preprocess.py

    * Use parquet

    * Add brians code

    * Update preprocess.py

    * sort samples

    * Remove threads

    * Update exclude calls logic

    * Squashed commit of the following:

    commit 101feb2
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Tue Apr 9 11:56:54 2024 +0200

        Annotations new features (#54)

        * added all changes from annotation-speedups branch

        * added gtf and genotype mock file for github tests

        * Delete example/annotations/preprocessing_workdir/preprocessed directory

        * Update annotation_colnames_filling_values.yaml

        * Corrected fill values for maf columns

        * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

        * included rulegraph instead dag

        * based on  suggestions from @endast

        * added version info for rockdb.yaml file

        * updated rulegraph

        Updated Documentation

        corrected nonfunctional links

        * added support for X/Y chromosomes, removed dependency on pvcf file

        * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

        * changed way file stems are assumed to include 'double ending' on input files.

        * removed unused lines, removed pvcf from config file

        * changed if statement for gene_id_file

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit 628af87
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Thu Apr 4 14:09:22 2024 +0200

        Update preprocessing.md (#60)

        Corrected small spelling mistake

    commit 1356ed2
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Mar 1 14:55:55 2024 +0100

        Update dense_gt.py (#56)

        bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

    commit 4d9ef64
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Feb 23 12:21:49 2024 +0100

        Feature cv training (#55)

        * performance optimizations

        * train multiple repeats on single node in parallel

        * bug fix

        * fix bug in indexing when subset_samples() removed something

        * sleep between jobs; stop if any job fails

        * format with black

        * bug fixes

        * add test for MultiphenoDataloader

        * update environments

        * uncomment rules

        * bug fixes

        * subset samples in training_dataset rule

        * example config.yaml

        * use gpu queue for compute_burdens

        * bugfix since dask reading didn't work any more

        * allow evaluation of all repeat combinations

        * allow analysis of each n_repeats and for all repeat combinations

        * option to provide burden file

        * allow seed gene alpha to be defined in config

        * change sorting order to get the best model

        * adaptations to analyze multiple repeats and use script wo seed genes

        * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

        * automatize generation of figure 3 (associations & repliation)

        * generate cv splits with related samples in the same split

        * average burdens

        * average burdens

        * cross-validation like trainign

        * add missing cv_utils

        * write average burdens or each combination to single zarr file to avoid zarr issues

        * add logging information

        * make maf column a param

        * add logging

        * pipeline replictaion and plotting

        * evaluate all repeat combis with and without seed genes

        * update lsf.yaml

        * small updates

        * per-gene pval aggregation

        * aggregate pval per gene

        * bugfix- only load burdens if not skip burdens

        * logging info

        * updates and fixes

        * load burdens only for genes analysed in current chunk to save memory

        * small changes to pipeline

        * standardizing/qt-transform of combined test set x/y arrays

        * my_quantile_transform for numpy arrays

        * bugfix

        * remove unnecessary code

        * remove unnecessary wildcards

        * make averaging part of associate.py

        * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

        * updates

        * gene-specific common variant covariates for conditional analysis

        * bugfix

        * post-hoc conditioning on common variants

        * restructure pipelines

        * removing redundant options

        * add cv_utils cli

        * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

        * removal of redundant wildcards, updates and fixes

        * bugfixes

        * baseline discoveries only required for training phenotypes

        * remove not needed code

        * update configs

        * formatting

        * manually merge changes from feature-regenie to account for gene-specific annotations

        * allow different sample orders in phenotype_df and genotypes.h5

        * change sample ids to be bytes as it is in the real data

        * update pipelines

        * update gitignore

        * pipeline updates

        * manually update github actions to be like master

        * bug fixes

        * checkout tests from master

        * make phenotype indices string as they are in real data

        * 'add gene_id' column

        * manually merge with master so tests can pass

        * bugfixes

        * use gene_id column instead of gene_ids

        * pipeline updates and fixes

        * update test config

        * adding age2 and age_sex to example data

        * update config

        * set tests folder to  main version

        * checkout preprocssing files from main

        * checkout from main

        * manually merge sample_id changes from main

        * pipeline bugfixes and renamings

        * fixup! Format Python code with psf/black pull_request

        * remove gene_ids column

        * integrating suggested PR changes

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit ada0aaa
    Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
    Date:   Wed Feb 21 15:56:14 2024 +0100

        Feature regenie (#52)

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * add function to convert REGENIE output

        * don't show all unmapped samples if the list is long

        * don't parallelize REGENIE step 1

        * separate pipelines with and without REGENIE

        * support gene-specific annotation

        * bug fix

        * bug fix

        * bug fix

        * bug fix

        * correct regenie_step1 --lowmem-prefix

        * modify to work standalone

        * add --association-only option

        * allow gene-specific annotation

        * go back to SEAK/statsmodels

        * bug fixes

        * remove SAIGE code, fix imports and conda envs

        * make pipelines more self-contained

        * don't require burdens.zarr when --skip-burdens is passed

        * udpate utils

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

    * Revert "Squashed commit of the following:"

    This reverts commit ebde7c1.

    * Remove unused import

    * don't use mkl 2024.1.0

    * update micromamba@v1.8.1

    * Isolate failing test

    * test genotype matrix

    * Revert "test genotype matrix"

    This reverts commit 6deee9b.

    * Revert "Isolate failing test"

    This reverts commit 6a11fe3.

    * fixup! Format Python code with psf/black pull_request

    * remove files

    * Delete variants.tsv.gz

    * Update test_preprocess.py

    * Update test_preprocess.py

    * fixup! Format Python code with psf/black pull_request

    * Update test_preprocess.py

    * Update test-runner.yml

    * one test

    * Revert "one test"

    This reverts commit 05e4578.

    * Revert "Update test-runner.yml"

    This reverts commit ff78d30.

    * update call filter test data

    * Update expected data

    * Update deeprvat_preprocessing_env.yml

    Remove joblib

    * Squashed commit of the following:

    commit 101feb2
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Tue Apr 9 11:56:54 2024 +0200

        Annotations new features (#54)

        * added all changes from annotation-speedups branch

        * added gtf and genotype mock file for github tests

        * Delete example/annotations/preprocessing_workdir/preprocessed directory

        * Update annotation_colnames_filling_values.yaml

        * Corrected fill values for maf columns

        * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

        * included rulegraph instead dag

        * based on  suggestions from @endast

        * added version info for rockdb.yaml file

        * updated rulegraph

        Updated Documentation

        corrected nonfunctional links

        * added support for X/Y chromosomes, removed dependency on pvcf file

        * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

        * changed way file stems are assumed to include 'double ending' on input files.

        * removed unused lines, removed pvcf from config file

        * changed if statement for gene_id_file

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit 628af87
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Thu Apr 4 14:09:22 2024 +0200

        Update preprocessing.md (#60)

        Corrected small spelling mistake

    commit 1356ed2
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Mar 1 14:55:55 2024 +0100

        Update dense_gt.py (#56)

        bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

    commit 4d9ef64
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Feb 23 12:21:49 2024 +0100

        Feature cv training (#55)

        * performance optimizations

        * train multiple repeats on single node in parallel

        * bug fix

        * fix bug in indexing when subset_samples() removed something

        * sleep between jobs; stop if any job fails

        * format with black

        * bug fixes

        * add test for MultiphenoDataloader

        * update environments

        * uncomment rules

        * bug fixes

        * subset samples in training_dataset rule

        * example config.yaml

        * use gpu queue for compute_burdens

        * bugfix since dask reading didn't work any more

        * allow evaluation of all repeat combinations

        * allow analysis of each n_repeats and for all repeat combinations

        * option to provide burden file

        * allow seed gene alpha to be defined in config

        * change sorting order to get the best model

        * adaptations to analyze multiple repeats and use script wo seed genes

        * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

        * automatize generation of figure 3 (associations & repliation)

        * generate cv splits with related samples in the same split

        * average burdens

        * average burdens

        * cross-validation like trainign

        * add missing cv_utils

        * write average burdens or each combination to single zarr file to avoid zarr issues

        * add logging information

        * make maf column a param

        * add logging

        * pipeline replictaion and plotting

        * evaluate all repeat combis with and without seed genes

        * update lsf.yaml

        * small updates

        * per-gene pval aggregation

        * aggregate pval per gene

        * bugfix- only load burdens if not skip burdens

        * logging info

        * updates and fixes

        * load burdens only for genes analysed in current chunk to save memory

        * small changes to pipeline

        * standardizing/qt-transform of combined test set x/y arrays

        * my_quantile_transform for numpy arrays

        * bugfix

        * remove unnecessary code

        * remove unnecessary wildcards

        * make averaging part of associate.py

        * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

        * updates

        * gene-specific common variant covariates for conditional analysis

        * bugfix

        * post-hoc conditioning on common variants

        * restructure pipelines

        * removing redundant options

        * add cv_utils cli

        * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

        * removal of redundant wildcards, updates and fixes

        * bugfixes

        * baseline discoveries only required for training phenotypes

        * remove not needed code

        * update configs

        * formatting

        * manually merge changes from feature-regenie to account for gene-specific annotations

        * allow different sample orders in phenotype_df and genotypes.h5

        * change sample ids to be bytes as it is in the real data

        * update pipelines

        * update gitignore

        * pipeline updates

        * manually update github actions to be like master

        * bug fixes

        * checkout tests from master

        * make phenotype indices string as they are in real data

        * 'add gene_id' column

        * manually merge with master so tests can pass

        * bugfixes

        * use gene_id column instead of gene_ids

        * pipeline updates and fixes

        * update test config

        * adding age2 and age_sex to example data

        * update config

        * set tests folder to  main version

        * checkout preprocssing files from main

        * checkout from main

        * manually merge sample_id changes from main

        * pipeline bugfixes and renamings

        * fixup! Format Python code with psf/black pull_request

        * remove gene_ids column

        * integrating suggested PR changes

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit ada0aaa
    Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
    Date:   Wed Feb 21 15:56:14 2024 +0100

        Feature regenie (#52)

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * add function to convert REGENIE output

        * don't show all unmapped samples if the list is long

        * don't parallelize REGENIE step 1

        * separate pipelines with and without REGENIE

        * support gene-specific annotation

        * bug fix

        * bug fix

        * bug fix

        * bug fix

        * correct regenie_step1 --lowmem-prefix

        * modify to work standalone

        * add --association-only option

        * allow gene-specific annotation

        * go back to SEAK/statsmodels

        * bug fixes

        * remove SAIGE code, fix imports and conda envs

        * make pipelines more self-contained

        * don't require burdens.zarr when --skip-burdens is passed

        * udpate utils

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

    * Revert change of micromamba

    * Ruff check

    * Squashed commit of the following:

    commit ae5c83e
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Mon Apr 15 11:01:03 2024 +0200

        fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

        * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    ---------

    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit ae5c83e
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Mon Apr 15 11:01:03 2024 +0200

    fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

    * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>
endast added a commit to PMBio/deeprvat that referenced this issue Apr 16, 2024
* add qc_indmiss

* Update preprocess_with_qc.snakefile

* Fix csv

* add process_individual_missingness cmd

* add process_individual_missingness

* Use separate variable for sample_path

* Only write sample to indmiss file

* add test_process_individual_missingness tests

* Add sample missingness to workflow

* Update dag images in doc

* Update test_preprocess.py

* add back create_excluded_samples_dir

* Cleanup pipeline

* fixup! Format Python code with psf/black pull_request

* Update preprocess.py

* fixup! Format Python code with psf/black pull_request

* Fix ruff errors

* Squashed commit of the following:

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

* Squashed commit of the following:

commit ae5c83e
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Mon Apr 15 11:01:03 2024 +0200

    fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

    * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

* Squashed commit of the following:

commit 24b3af5
Author: Magnus Wahlberg <endast@gmail.com>
Date:   Tue Apr 16 10:40:45 2024 +0200

    Optimize preprocessing (#65)

    * Add new test files

    * Update test_preprocess.py

    * Use parquet

    * Add brians code

    * Update preprocess.py

    * sort samples

    * Remove threads

    * Update exclude calls logic

    * Squashed commit of the following:

    commit 101feb2
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Tue Apr 9 11:56:54 2024 +0200

        Annotations new features (#54)

        * added all changes from annotation-speedups branch

        * added gtf and genotype mock file for github tests

        * Delete example/annotations/preprocessing_workdir/preprocessed directory

        * Update annotation_colnames_filling_values.yaml

        * Corrected fill values for maf columns

        * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

        * included rulegraph instead dag

        * based on  suggestions from @endast

        * added version info for rockdb.yaml file

        * updated rulegraph

        Updated Documentation

        corrected nonfunctional links

        * added support for X/Y chromosomes, removed dependency on pvcf file

        * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

        * changed way file stems are assumed to include 'double ending' on input files.

        * removed unused lines, removed pvcf from config file

        * changed if statement for gene_id_file

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit 628af87
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Thu Apr 4 14:09:22 2024 +0200

        Update preprocessing.md (#60)

        Corrected small spelling mistake

    commit 1356ed2
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Mar 1 14:55:55 2024 +0100

        Update dense_gt.py (#56)

        bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

    commit 4d9ef64
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Feb 23 12:21:49 2024 +0100

        Feature cv training (#55)

        * performance optimizations

        * train multiple repeats on single node in parallel

        * bug fix

        * fix bug in indexing when subset_samples() removed something

        * sleep between jobs; stop if any job fails

        * format with black

        * bug fixes

        * add test for MultiphenoDataloader

        * update environments

        * uncomment rules

        * bug fixes

        * subset samples in training_dataset rule

        * example config.yaml

        * use gpu queue for compute_burdens

        * bugfix since dask reading didn't work any more

        * allow evaluation of all repeat combinations

        * allow analysis of each n_repeats and for all repeat combinations

        * option to provide burden file

        * allow seed gene alpha to be defined in config

        * change sorting order to get the best model

        * adaptations to analyze multiple repeats and use script wo seed genes

        * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

        * automatize generation of figure 3 (associations & repliation)

        * generate cv splits with related samples in the same split

        * average burdens

        * average burdens

        * cross-validation like trainign

        * add missing cv_utils

        * write average burdens or each combination to single zarr file to avoid zarr issues

        * add logging information

        * make maf column a param

        * add logging

        * pipeline replictaion and plotting

        * evaluate all repeat combis with and without seed genes

        * update lsf.yaml

        * small updates

        * per-gene pval aggregation

        * aggregate pval per gene

        * bugfix- only load burdens if not skip burdens

        * logging info

        * updates and fixes

        * load burdens only for genes analysed in current chunk to save memory

        * small changes to pipeline

        * standardizing/qt-transform of combined test set x/y arrays

        * my_quantile_transform for numpy arrays

        * bugfix

        * remove unnecessary code

        * remove unnecessary wildcards

        * make averaging part of associate.py

        * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

        * updates

        * gene-specific common variant covariates for conditional analysis

        * bugfix

        * post-hoc conditioning on common variants

        * restructure pipelines

        * removing redundant options

        * add cv_utils cli

        * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

        * removal of redundant wildcards, updates and fixes

        * bugfixes

        * baseline discoveries only required for training phenotypes

        * remove not needed code

        * update configs

        * formatting

        * manually merge changes from feature-regenie to account for gene-specific annotations

        * allow different sample orders in phenotype_df and genotypes.h5

        * change sample ids to be bytes as it is in the real data

        * update pipelines

        * update gitignore

        * pipeline updates

        * manually update github actions to be like master

        * bug fixes

        * checkout tests from master

        * make phenotype indices string as they are in real data

        * 'add gene_id' column

        * manually merge with master so tests can pass

        * bugfixes

        * use gene_id column instead of gene_ids

        * pipeline updates and fixes

        * update test config

        * adding age2 and age_sex to example data

        * update config

        * set tests folder to  main version

        * checkout preprocssing files from main

        * checkout from main

        * manually merge sample_id changes from main

        * pipeline bugfixes and renamings

        * fixup! Format Python code with psf/black pull_request

        * remove gene_ids column

        * integrating suggested PR changes

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit ada0aaa
    Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
    Date:   Wed Feb 21 15:56:14 2024 +0100

        Feature regenie (#52)

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * add function to convert REGENIE output

        * don't show all unmapped samples if the list is long

        * don't parallelize REGENIE step 1

        * separate pipelines with and without REGENIE

        * support gene-specific annotation

        * bug fix

        * bug fix

        * bug fix

        * bug fix

        * correct regenie_step1 --lowmem-prefix

        * modify to work standalone

        * add --association-only option

        * allow gene-specific annotation

        * go back to SEAK/statsmodels

        * bug fixes

        * remove SAIGE code, fix imports and conda envs

        * make pipelines more self-contained

        * don't require burdens.zarr when --skip-burdens is passed

        * udpate utils

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

    * Revert "Squashed commit of the following:"

    This reverts commit ebde7c1.

    * Remove unused import

    * don't use mkl 2024.1.0

    * update micromamba@v1.8.1

    * Isolate failing test

    * test genotype matrix

    * Revert "test genotype matrix"

    This reverts commit 6deee9b.

    * Revert "Isolate failing test"

    This reverts commit 6a11fe3.

    * fixup! Format Python code with psf/black pull_request

    * remove files

    * Delete variants.tsv.gz

    * Update test_preprocess.py

    * Update test_preprocess.py

    * fixup! Format Python code with psf/black pull_request

    * Update test_preprocess.py

    * Update test-runner.yml

    * one test

    * Revert "one test"

    This reverts commit 05e4578.

    * Revert "Update test-runner.yml"

    This reverts commit ff78d30.

    * update call filter test data

    * Update expected data

    * Update deeprvat_preprocessing_env.yml

    Remove joblib

    * Squashed commit of the following:

    commit 101feb2
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Tue Apr 9 11:56:54 2024 +0200

        Annotations new features (#54)

        * added all changes from annotation-speedups branch

        * added gtf and genotype mock file for github tests

        * Delete example/annotations/preprocessing_workdir/preprocessed directory

        * Update annotation_colnames_filling_values.yaml

        * Corrected fill values for maf columns

        * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

        * included rulegraph instead dag

        * based on  suggestions from @endast

        * added version info for rockdb.yaml file

        * updated rulegraph

        Updated Documentation

        corrected nonfunctional links

        * added support for X/Y chromosomes, removed dependency on pvcf file

        * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

        * changed way file stems are assumed to include 'double ending' on input files.

        * removed unused lines, removed pvcf from config file

        * changed if statement for gene_id_file

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit 628af87
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Thu Apr 4 14:09:22 2024 +0200

        Update preprocessing.md (#60)

        Corrected small spelling mistake

    commit 1356ed2
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Mar 1 14:55:55 2024 +0100

        Update dense_gt.py (#56)

        bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

    commit 4d9ef64
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Feb 23 12:21:49 2024 +0100

        Feature cv training (#55)

        * performance optimizations

        * train multiple repeats on single node in parallel

        * bug fix

        * fix bug in indexing when subset_samples() removed something

        * sleep between jobs; stop if any job fails

        * format with black

        * bug fixes

        * add test for MultiphenoDataloader

        * update environments

        * uncomment rules

        * bug fixes

        * subset samples in training_dataset rule

        * example config.yaml

        * use gpu queue for compute_burdens

        * bugfix since dask reading didn't work any more

        * allow evaluation of all repeat combinations

        * allow analysis of each n_repeats and for all repeat combinations

        * option to provide burden file

        * allow seed gene alpha to be defined in config

        * change sorting order to get the best model

        * adaptations to analyze multiple repeats and use script wo seed genes

        * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

        * automatize generation of figure 3 (associations & repliation)

        * generate cv splits with related samples in the same split

        * average burdens

        * average burdens

        * cross-validation like trainign

        * add missing cv_utils

        * write average burdens or each combination to single zarr file to avoid zarr issues

        * add logging information

        * make maf column a param

        * add logging

        * pipeline replictaion and plotting

        * evaluate all repeat combis with and without seed genes

        * update lsf.yaml

        * small updates

        * per-gene pval aggregation

        * aggregate pval per gene

        * bugfix- only load burdens if not skip burdens

        * logging info

        * updates and fixes

        * load burdens only for genes analysed in current chunk to save memory

        * small changes to pipeline

        * standardizing/qt-transform of combined test set x/y arrays

        * my_quantile_transform for numpy arrays

        * bugfix

        * remove unnecessary code

        * remove unnecessary wildcards

        * make averaging part of associate.py

        * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

        * updates

        * gene-specific common variant covariates for conditional analysis

        * bugfix

        * post-hoc conditioning on common variants

        * restructure pipelines

        * removing redundant options

        * add cv_utils cli

        * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

        * removal of redundant wildcards, updates and fixes

        * bugfixes

        * baseline discoveries only required for training phenotypes

        * remove not needed code

        * update configs

        * formatting

        * manually merge changes from feature-regenie to account for gene-specific annotations

        * allow different sample orders in phenotype_df and genotypes.h5

        * change sample ids to be bytes as it is in the real data

        * update pipelines

        * update gitignore

        * pipeline updates

        * manually update github actions to be like master

        * bug fixes

        * checkout tests from master

        * make phenotype indices string as they are in real data

        * 'add gene_id' column

        * manually merge with master so tests can pass

        * bugfixes

        * use gene_id column instead of gene_ids

        * pipeline updates and fixes

        * update test config

        * adding age2 and age_sex to example data

        * update config

        * set tests folder to  main version

        * checkout preprocssing files from main

        * checkout from main

        * manually merge sample_id changes from main

        * pipeline bugfixes and renamings

        * fixup! Format Python code with psf/black pull_request

        * remove gene_ids column

        * integrating suggested PR changes

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit ada0aaa
    Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
    Date:   Wed Feb 21 15:56:14 2024 +0100

        Feature regenie (#52)

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * add function to convert REGENIE output

        * don't show all unmapped samples if the list is long

        * don't parallelize REGENIE step 1

        * separate pipelines with and without REGENIE

        * support gene-specific annotation

        * bug fix

        * bug fix

        * bug fix

        * bug fix

        * correct regenie_step1 --lowmem-prefix

        * modify to work standalone

        * add --association-only option

        * allow gene-specific annotation

        * go back to SEAK/statsmodels

        * bug fixes

        * remove SAIGE code, fix imports and conda envs

        * make pipelines more self-contained

        * don't require burdens.zarr when --skip-burdens is passed

        * udpate utils

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

    * Revert change of micromamba

    * Ruff check

    * Squashed commit of the following:

    commit ae5c83e
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Mon Apr 15 11:01:03 2024 +0200

        fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

        * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    ---------

    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit ae5c83e
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Mon Apr 15 11:01:03 2024 +0200

    fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

    * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

* Revert "Squashed commit of the following:"

This reverts commit 4e9b47d.

---------

Co-authored-by: PMBio <PMBio@users.noreply.github.com>
@yanbing-j
Copy link
Collaborator

Hi @LiutongZhou , according to https://github.com/pytorch/builder/blob/main/conda/pytorch-nightly/meta.yaml#L18-L45, PyTorch conda wheel is using dynamic linked MKL, which version should be MKL 2023.x. And it is compatible up to MKL 2024.0. Please downgrade MKL <= 2024.0, or do not specify MKL version.

Meanwhile, PyTorch conda wheel and pip wheel are using different MKL version, conda wheel will dynamic link MKL 2023, and pip wheel will static link MKL 2022. Hi @malfet , could you please clarify why conda/pip wheels use different channels to build? Does https://github.com/pytorch/builder/blob/main/conda/pytorch-nightly/meta.yaml#L18-L45 control MKL version in conda wheel? Does https://github.com/pytorch/builder/blob/main/common/install_mkl.sh control MKl version in pip wheel? Thanks!

JSchlensok added a commit to JSchlensok/VespaG that referenced this issue Apr 23, 2024
@moi90
Copy link
Contributor

moi90 commented Apr 24, 2024

Please downgrade MKL <= 2024.0

Thanks, @yanbing-j! this fixed the problem for me.

@Tianci-Wen
Copy link

I solved this problem by adding - mkl==2024.0 to the environment.yml.

name: scaffold_gs
channels:
  - pytorch
  - pyg
  - conda-forge
  - defaults
dependencies:
  - cudatoolkit=11.6
  - plyfile=0.8.1
  - python=3.7.13
  - pip=22.3.1
  - pytorch=1.12.1
  - torchaudio=0.12.1
  - torchvision=0.13.1
  - pytorch-scatter
  - tqdm
  - mkl==2024.0
  - pip:
    - einops
    - wandb
    - lpips
    - laspy
    - submodules/diff-gaussian-rasterization
    - submodules/simple-knn

@avivko
Copy link

avivko commented May 11, 2024

Seems to be mkl indeed. All it takes is (here with mamba, but conda would work too) this and then you can import torch successfully:
mamba install mkl==2024.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: binaries Anything related to official binaries that we release to users module: mkl Related to our MKL support triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests