Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with unordered_map in remap.pyx scikit-image 0.17.2 #5232

Closed
anirudh2290 opened this issue Feb 11, 2021 · 8 comments
Closed

Issue with unordered_map in remap.pyx scikit-image 0.17.2 #5232

anirudh2290 opened this issue Feb 11, 2021 · 8 comments

Comments

@anirudh2290
Copy link

Description

We found an issue with working of Sagemaker Distributed Model Parallel with scikit image 0.17.2 on python3.6.

To provide more details on the crash, we see that unordered_map indexing in Sagemaker Distributed Model Parallel code (smdistributed/modelparallel/backend/threads.cc) calls into

#6 0x00007fad51fa2a89 in std::_Hashtable<int, std::pair<int const, int>, std::allocator<std::pair<int const, int> >, std::__detail::_Select1st, std::equal_to<int>, std::hash<int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<int const, int>, false>*) ()
from /opt/conda/lib/python3.6/site-packages/skimage/util/_remap.cpython-36m-x86_64-linux-gnu.so

Our workaround was to downgrade skimage to 0.16.0 or change the import order.

Below is the import order with the issue:

import torchvision.datasets as datasets
import torchvision.models as models
import torchvision.transforms as transforms
from albumentations import (
    RandomResizedCrop ,HorizontalFlip, IAAPerspective, ShiftScaleRotate, CLAHE, RandomRotate90,
    Transpose, ShiftScaleRotate, Blur, OpticalDistortion, GridDistortion, HueSaturationValue,
    IAAAdditiveGaussianNoise, GaussNoise, MotionBlur, MedianBlur, RandomBrightnessContrast, IAAPiecewiseAffine,
    IAASharpen, IAAEmboss, Flip, OneOf, Compose, Resize, VerticalFlip, HorizontalFlip, CenterCrop,Normalize
)
import smdistributed.modelparallel.torch as smp 

below is the import order without the issue:

from albumentations import (
    RandomResizedCrop ,HorizontalFlip, IAAPerspective, ShiftScaleRotate, CLAHE, RandomRotate90,
    Transpose, ShiftScaleRotate, Blur, OpticalDistortion, GridDistortion, HueSaturationValue,
    IAAAdditiveGaussianNoise, GaussNoise, MotionBlur, MedianBlur, RandomBrightnessContrast, IAAPiecewiseAffine,
    IAASharpen, IAAEmboss, Flip, OneOf, Compose, Resize, VerticalFlip, HorizontalFlip, CenterCrop,Normalize
)
import torchvision.datasets as datasets
import torchvision.models as models
import torchvision.transforms as transforms
import smdistributed.modelparallel.torch as smp 

scikit-image is a dependency of albumentations package.

Would appreciate any insights on the issue.

@jni
Copy link
Member

jni commented Feb 15, 2021

Wow. All I can say is that I would not put it past pytorch to redefine C++ stdlib linkage at import time. remap.pyx uses unordered_map from the C++ stdlib, which needs relatively recent stdlib, see these lines from #4612.

Overall, my feeling is that if:

from skimage.util import map_array

works, but

import torchvision.datasets as datasets
import torchvision.models as models
import torchvision.transforms as transforms
import smdistributed.modelparallel.torch as smp 
from skimage.util import map_array

fails, then the issue is a pytorch issue and not a scikit-image issue... Could you check that for us, @anirudh2290?

@anirudh2290
Copy link
Author

Thank you @jni for your reply! We observe the following:
Below works fine:
1.

from skimage.util import map_array
from torchvision import datasets, models, transforms
import smdistributed.modelparallel.torch as smp

Below fails with the error i pasted above:
2.

from torchvision import datasets, models, transforms
from skimage.util import map_array
import smdistributed.modelparallel.torch as smp

Below works fine:
3.

from torchvision import datasets, models, transforms
import smdistributed.modelparallel.torch as smp
from skimage.util import map_array

@jni
Copy link
Member

jni commented Feb 16, 2021

Wow, that's insane. LOL

To clarify, does

from torchvision import datasets, models, transforms
from skimage.util import map_array

fail by itself? Or do you need the third line?

I have no insight whatsoever into how the smdistributed import could "fix" the skimage import.

@anirudh2290
Copy link
Author

anirudh2290 commented Feb 16, 2021

Hi @jni . To clarify below is the stacktrace that we see :

#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007fad77a6b921 in __GI_abort () at abort.c:79
#2 0x00007fad77ab4967 in __libc_message (action=action@entry=do_abort,
fmt=fmt@entry=0x7fad77be1b0d "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007fad77abb9da in malloc_printerr (
str=str@entry=0x7fad77be3720 "munmap_chunk(): invalid pointer") at malloc.c:5342
#4 0x00007fad77ac2fbc in munmap_chunk (p=0x7fac40d284e8 <smp::backend::state+5512>)
at malloc.c:2846
#5 __GI___libc_free (mem=0x7fac40d284f8 <smp::backend::state+5528>) at malloc.c:3127
#6 0x00007fad51fa2a89 in std::_Hashtable<int, std::pair<int const, int>, std::allocator<std::pair<int const, int> >, std::__detail::_Select1st, std::equal_to<int>, std::hash<int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<int const, int>, false>*) ()
from /opt/conda/lib/python3.6/site-packages/skimage/util/_remap.cpython-36m-x86_64-linux-gnu.so
#7 0x00007fac40c512ef in std::__detail::_Map_base<int, std::pair<int const, int>, std::allocator<std::pair<int const, int> >, std::__detail::_Select1st, std::equal_to<int>, std::hash<int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true>, true>::operator[] (
__k=@0x7fac11ae292c: 1, this=<optimized out>)
at /usr/include/c++/7/bits/hashtable_policy.h:728
#8 0x00007fac40c5a231 in std::unordered_map<int, int, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, int> > >::operator[] (__k=@0x7fac11ae292c: 1,
this=<optimized out>) at /usr/include/c++/7/bits/unordered_map.h:973

The unordered_map in "#8" (operator[] call) comes from a cc file in smdistributed when python tries to execute a smp.step decorated function.

So without smdistributed import we cannot reproduce the issue (though potentially this may be happening with other python libraries with cpp backend (using unordered_map) too).

As you mentioned, could it be that the torch import causes the remap.pyx to link with an incompatible stdlib causing this issue?

@jni
Copy link
Member

jni commented Feb 16, 2021

Sorry, can you point me to where the unordered_map call lives in smp? Looking at the source code I don't find any .cc files?

As you mentioned, could it be that the torch import causes the remap.pyx to link with an incompatible stdlib causing this issue?

I kinda thought so, but I have no insight as to why putting smp in the middle would fix it, nor why it would crash when smp is imported after...

By the way, is any of this code actually calling skimage code? Or is it really just an import?

@anirudh2290
Copy link
Author

SMP (Sagemaker Distributed Model Parallel) is closed source, so I wont be able to share the source, but it doesn't depend on skimage. Its just that the import of the skimage "in between" breaks the unordered_map operator[] call inside smdistributed.

@anirudh2290
Copy link
Author

Thanks a lot @jni for the help and inputs. Since I am unable to provide more information to help with the issue here, I am going to close the issue.

@jni
Copy link
Member

jni commented Feb 24, 2021

🙏 I hope the discussion was somehow useful in helping you move forward at least... If you do figure out something we can do differently to prevent it, we'd be happy to consider it, but for now I don't see anything we're doing obviously wrong to cause this... Sorry we couldn't get to the bottom of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants