New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in `python': double free or corruption (!prev) #6968

Closed
yaroslavvb opened this Issue Jan 20, 2017 · 44 comments

Comments

Projects
None yet
@yaroslavvb
Contributor

yaroslavvb commented Jan 20, 2017

I'm consistently getting this error when stopping training (CTRL+C) on version built from head on Jan17. On other hand, running on version from Jan5 head does not exhibit this behavior

tf.__git__version = '0.12.1-1934-g27fca7d-dirty'

session.run completed in 0.01 sec with .0.500000 acc
session.run completed in 0.02 sec with .0.000000 acc
^CTraceback (most recent call last):
  File "train.py", line 247, in <module>
    a,_ = sess.run([train_acc,optimizer], feed_dict)
  File "/home/yaroslav/.conda/envs/tim-jan17/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 767, in run
    run_metadata_ptr)
  File "/home/yaroslav/.conda/envs/tim-jan17/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 965, in _run
    feed_dict_string, options, run_metadata)
  File "/home/yaroslav/.conda/envs/tim-jan17/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
    target_list, options, run_metadata)
  File "/home/yaroslav/.conda/envs/tim-jan17/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1022, in _do_call
    return fn(*args)
  File "/home/yaroslav/.conda/envs/tim-jan17/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1004, in _run_fn
    status, run_metadata)
KeyboardInterrupt
*** Error in `python': double free or corruption (!prev): 0x00000000016c55d0 ***
Aborted (core dumped)

Looking at core, it looks like dictionary deletion.

#0  0x00007fe9cbf8a01f in _int_free (av=0x7fe9cc2c9760 <main_arena>, p=<optimized out>, have_lock=0) at malloc.c:3996
#1  0x00007fe9cceb500a in dict_dealloc (mp=0x7fe9558073c8) at Objects/dictobject.c:1596
#2  0x00007fe9cced121f in subtype_dealloc (self=0x7fe95580a080) at Objects/typeobject.c:1193
#3  0x00007fe9cceb023f in free_keys_object (keys=0x24f9620) at Objects/dictobject.c:354
#4  0x00007fe9cced3936 in type_clear (type=0x24f9c68) at Objects/typeobject.c:3270
#5  0x00007fe9ccf8a97c in delete_garbage (old=<optimized out>, collectable=<optimized out>) at Modules/gcmodule.c:866
#6  collect (generation=2, n_collected=0x0, n_uncollectable=0x0, nofail=1) at Modules/gcmodule.c:1014
#7  0x00007fe9ccf8aedd in _PyGC_CollectNoFail () at Modules/gcmodule.c:1605
#8  0x00007fe9ccf5e6d5 in PyImport_Cleanup () at Python/import.c:428
#9  0x00007fe9ccf6a90e in Py_Finalize () at Python/pylifecycle.c:576
#10 0x00007fe9ccf891b9 in Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:789
#11 0x0000000000400add in main (argc=2, argv=0x7ffde1cf3f98) at ./Programs/python.c:65
@yaroslavvb

This comment has been minimized.

Show comment
Hide comment
@yaroslavvb

yaroslavvb Jan 20, 2017

Contributor

I suspect this error is connected to jemalloc since that got added recently. Turning on tcmalloc through export LD_PRELOAD="/usr/lib/libtcmalloc.so.4" gets rid of the error

Contributor

yaroslavvb commented Jan 20, 2017

I suspect this error is connected to jemalloc since that got added recently. Turning on tcmalloc through export LD_PRELOAD="/usr/lib/libtcmalloc.so.4" gets rid of the error

@yaroslavvb yaroslavvb changed the title from Error in `python': double free or corruption (!prev) to (jemalloc) Error in `python': double free or corruption (!prev) Jan 20, 2017

@mrry

This comment has been minimized.

Show comment
Hide comment
@mrry

mrry Jan 20, 2017

Contributor

It's possible that the malloc changes are responsible... @jhseu would know best, since he made those changes.

Contributor

mrry commented Jan 20, 2017

It's possible that the malloc changes are responsible... @jhseu would know best, since he made those changes.

@jhseu

This comment has been minimized.

Show comment
Hide comment
@jhseu

jhseu Jan 24, 2017

Member

I haven't been able to reproduce it yet. Do you happen to have a script that I can use?

It seems unlikely to be related to the jemalloc change. We don't (and technically can't) override Python's malloc/free. My guess is that setting tcmalloc is just hiding the error and that the issue is in the script itself.

If you have time, mind disabling jemalloc in ./configure and trying again?

Member

jhseu commented Jan 24, 2017

I haven't been able to reproduce it yet. Do you happen to have a script that I can use?

It seems unlikely to be related to the jemalloc change. We don't (and technically can't) override Python's malloc/free. My guess is that setting tcmalloc is just hiding the error and that the issue is in the script itself.

If you have time, mind disabling jemalloc in ./configure and trying again?

@yaroslavvb

This comment has been minimized.

Show comment
Hide comment
@yaroslavvb

yaroslavvb Jan 24, 2017

Contributor

I think I'll close this for now as unreproducible and reopen when someone can provide more info

Contributor

yaroslavvb commented Jan 24, 2017

I think I'll close this for now as unreproducible and reopen when someone can provide more info

@yaroslavvb yaroslavvb closed this Jan 24, 2017

@Montmorency

This comment has been minimized.

Show comment
Hide comment
@Montmorency

Montmorency Feb 7, 2017

Not sure if this is related but I am unable to build with jemalloc on a cluster with gcc 4.8.2 and an older version of libc.so.6 (old enough that I have to build tensor flow from source). The build fails with a message:

ERROR:~/.cache/bazel/bazel/e924d9c3ba75314415252c6f4f93bb86/external/jemalloc/BUILD:10:1: C++ compilation of rule '@jemalloc//:jemalloc' failed: gcc failed: error executing command /opt/apps/compilers/gcc/4.8.2/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -B/opt/apps/compilers/gcc/4.8.2/bin -B/usr/bin -Wunused-but-set-parameter -Wno-free-nonheap-object ... (remaining 38 argument(s) skipped): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1.

Disabling jemalloc at the configure stage means I can finish the tensor flow build and install. I can then train my models (i.e. model.fit()) on the cluster but when I try to run model.predict_prob() I am getting this error:
*** glibc detected *** python: double free or corruption (!prev): 0x00000000013fcda0 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3e22e75e66]
/lib64/libc.so.6[0x3e22e789b3]
/lib64/ld-linux-x86-64.so.2(_dl_deallocate_tls+0x67)[0x3e226112f7]
/lib64/libpthread.so.0[0x3e2320675d]
/lib64/libpthread.so.0[0x3e232078ea]
/lib64/libpthread.so.0(pthread_join+0xd4)[0x3e232081f4]
/opt/apps/compilers/gcc/4.8.2/lib64/libstdc++.so.6(_ZNSt6thread4joinEv+0x27)[0x7f657b6263c7]
/users/k1511981/.pyenv/versions/t2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(+0x2567e60)[0x7f657dde2e60]
/users/k1511981/.pyenv/versions/t2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(_ZN10tensorflow6thread10ThreadPool4ImplD0Ev+0xb3)[0x7f657ddbf5b3]
/users/k1511981/.pyenv/versions/t2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(_ZN10tensorflow6thread10ThreadPoolD1Ev+0x1a)[0x7f657ddbfc1a]
/users/k1511981/.pyenv/versions/t2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(_ZN10tensorflow10FileSystem16GetMatchingPathsERKSsPSt6vectorISsSaISsEE+0x592)[0x7f657dddf8a2]
/users/k1511981/.pyenv/versions/t2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(_ZN10tensorflow3Env16GetMatchingPathsERKSsPSt6vectorISsSaISsEE+0x9b)[0x7f657dddbc1b]
/users/k1511981/.pyenv/versions/t2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(+0xadc66b)[0x7f657c35766b]
/users/k1511981/.pyenv/versions/t2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(+0xade610)[0x7f657c359610]

So just wondering if there is some conflict with compiler/glibc libraries that is causing these issues?

Montmorency commented Feb 7, 2017

Not sure if this is related but I am unable to build with jemalloc on a cluster with gcc 4.8.2 and an older version of libc.so.6 (old enough that I have to build tensor flow from source). The build fails with a message:

ERROR:~/.cache/bazel/bazel/e924d9c3ba75314415252c6f4f93bb86/external/jemalloc/BUILD:10:1: C++ compilation of rule '@jemalloc//:jemalloc' failed: gcc failed: error executing command /opt/apps/compilers/gcc/4.8.2/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -B/opt/apps/compilers/gcc/4.8.2/bin -B/usr/bin -Wunused-but-set-parameter -Wno-free-nonheap-object ... (remaining 38 argument(s) skipped): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1.

Disabling jemalloc at the configure stage means I can finish the tensor flow build and install. I can then train my models (i.e. model.fit()) on the cluster but when I try to run model.predict_prob() I am getting this error:
*** glibc detected *** python: double free or corruption (!prev): 0x00000000013fcda0 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3e22e75e66]
/lib64/libc.so.6[0x3e22e789b3]
/lib64/ld-linux-x86-64.so.2(_dl_deallocate_tls+0x67)[0x3e226112f7]
/lib64/libpthread.so.0[0x3e2320675d]
/lib64/libpthread.so.0[0x3e232078ea]
/lib64/libpthread.so.0(pthread_join+0xd4)[0x3e232081f4]
/opt/apps/compilers/gcc/4.8.2/lib64/libstdc++.so.6(_ZNSt6thread4joinEv+0x27)[0x7f657b6263c7]
/users/k1511981/.pyenv/versions/t2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(+0x2567e60)[0x7f657dde2e60]
/users/k1511981/.pyenv/versions/t2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(_ZN10tensorflow6thread10ThreadPool4ImplD0Ev+0xb3)[0x7f657ddbf5b3]
/users/k1511981/.pyenv/versions/t2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(_ZN10tensorflow6thread10ThreadPoolD1Ev+0x1a)[0x7f657ddbfc1a]
/users/k1511981/.pyenv/versions/t2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(_ZN10tensorflow10FileSystem16GetMatchingPathsERKSsPSt6vectorISsSaISsEE+0x592)[0x7f657dddf8a2]
/users/k1511981/.pyenv/versions/t2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(_ZN10tensorflow3Env16GetMatchingPathsERKSsPSt6vectorISsSaISsEE+0x9b)[0x7f657dddbc1b]
/users/k1511981/.pyenv/versions/t2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(+0xadc66b)[0x7f657c35766b]
/users/k1511981/.pyenv/versions/t2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(+0xade610)[0x7f657c359610]

So just wondering if there is some conflict with compiler/glibc libraries that is causing these issues?

@jhseu

This comment has been minimized.

Show comment
Hide comment
@jhseu

jhseu Feb 8, 2017

Member

Strangely, I'm getting the exact same stack trace for #7338 (and it happens even with jemalloc disabled). Pretty hard to track down the root cause...

Member

jhseu commented Feb 8, 2017

Strangely, I'm getting the exact same stack trace for #7338 (and it happens even with jemalloc disabled). Pretty hard to track down the root cause...

@dennybritz

This comment has been minimized.

Show comment
Hide comment
@dennybritz

dennybritz Feb 10, 2017

Contributor

Can confirm this. I ran into this on my CI server and the the following fixed it:

sudo apt-get install libtcmalloc-minimal4
export LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4"
Contributor

dennybritz commented Feb 10, 2017

Can confirm this. I ran into this on my CI server and the the following fixed it:

sudo apt-get install libtcmalloc-minimal4
export LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4"
@jhseu

This comment has been minimized.

Show comment
Hide comment
@jhseu

jhseu Feb 10, 2017

Member

What Linux distribution and version were you running?

In #7338, it seemed to be an issue with pypi's numpy on Ubuntu 14.04. Compiling from source fixed it.

Member

jhseu commented Feb 10, 2017

What Linux distribution and version were you running?

In #7338, it seemed to be an issue with pypi's numpy on Ubuntu 14.04. Compiling from source fixed it.

@dennybritz

This comment has been minimized.

Show comment
Hide comment
@dennybritz

dennybritz Feb 10, 2017

Contributor

Using Ubuntu 14.04 - https://circleci.com/docs/build-image-trusty/. I was using the tf1.0-rc2 pip package.

Contributor

dennybritz commented Feb 10, 2017

Using Ubuntu 14.04 - https://circleci.com/docs/build-image-trusty/. I was using the tf1.0-rc2 pip package.

@jhseu

This comment has been minimized.

Show comment
Hide comment
@jhseu

jhseu Feb 11, 2017

Member

So I think this is some interaction between pypi's numpy and Ubuntu 14.04. Either upgrading to Ubuntu 16.04 or building numpy from source (pip install --no-binary) fixes it in every case I've tried (GPU/no GPU, jemalloc disabled/enabled, Python 2.7/3.4/3.5)

Member

jhseu commented Feb 11, 2017

So I think this is some interaction between pypi's numpy and Ubuntu 14.04. Either upgrading to Ubuntu 16.04 or building numpy from source (pip install --no-binary) fixes it in every case I've tried (GPU/no GPU, jemalloc disabled/enabled, Python 2.7/3.4/3.5)

@AjayTalati

This comment has been minimized.

Show comment
Hide comment
@AjayTalati

AjayTalati Feb 11, 2017

Hi, I got a similar Error in "python": double free or corruption (!prev) error. Using tf 0.12, built from source, today with Ubuntu 14.04.

This was solved with @dennybritz 's fix, four posts above. Thank you very much @dennybritz

Will try upgrading to 16.04 as @jhseu recommends

AjayTalati commented Feb 11, 2017

Hi, I got a similar Error in "python": double free or corruption (!prev) error. Using tf 0.12, built from source, today with Ubuntu 14.04.

This was solved with @dennybritz 's fix, four posts above. Thank you very much @dennybritz

Will try upgrading to 16.04 as @jhseu recommends

@taion

This comment has been minimized.

Show comment
Hide comment
@taion

taion Feb 13, 2017

Contributor

Would it be possible to rebuild the Docker images on gcr.io? The underlying nvidia/cuda:8.0-cudnn5-devel uses Ubuntu 16.04 now, but the gcr.io/tensorflow/tensorflow:1.0.0-rc2-devel-gpu image still uses Ubuntu 14.04.

Contributor

taion commented Feb 13, 2017

Would it be possible to rebuild the Docker images on gcr.io? The underlying nvidia/cuda:8.0-cudnn5-devel uses Ubuntu 16.04 now, but the gcr.io/tensorflow/tensorflow:1.0.0-rc2-devel-gpu image still uses Ubuntu 14.04.

@jhseu

This comment has been minimized.

Show comment
Hide comment
@jhseu

jhseu Feb 13, 2017

Member

I'm updating all the Docker images, but the gcr.io ones won't be updated until after 1.0. In the meantime, you can start from nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04 and just add RUN pip install tensorflow_gpu

Member

jhseu commented Feb 13, 2017

I'm updating all the Docker images, but the gcr.io ones won't be updated until after 1.0. In the meantime, you can start from nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04 and just add RUN pip install tensorflow_gpu

@taion

This comment has been minimized.

Show comment
Hide comment
@taion

taion Feb 13, 2017

Contributor

Sorry, I mean the Docker images for the 1.0.0 RCs specifically – or at least any upcoming ones?

This is actually a bit weird to me, because the nvidia/cuda:8.0-cudnn5-devel (with no OS suffix) image on Docker Hub has been pointing at Ubuntu 16.04 since at least 24 days ago, which predates all the TensorFlow 1.0.0 RC releases.

Contributor

taion commented Feb 13, 2017

Sorry, I mean the Docker images for the 1.0.0 RCs specifically – or at least any upcoming ones?

This is actually a bit weird to me, because the nvidia/cuda:8.0-cudnn5-devel (with no OS suffix) image on Docker Hub has been pointing at Ubuntu 16.04 since at least 24 days ago, which predates all the TensorFlow 1.0.0 RC releases.

@jhseu

This comment has been minimized.

Show comment
Hide comment
@jhseu

jhseu Feb 13, 2017

Member

Yeah, Docker caches image versions and we had 14.04 cached. @caisq is planning to upgrade those.

Member

jhseu commented Feb 13, 2017

Yeah, Docker caches image versions and we had 14.04 cached. @caisq is planning to upgrade those.

@jhseu

This comment has been minimized.

Show comment
Hide comment
@jhseu

jhseu Feb 15, 2017

Member

To narrow down the issue: someone internally noticed that this crash only happens when numpy is installed with OpenBLAS support on Ubuntu 14.04. I haven't tested whether upgrading libopenblas fixes it.

So, if you're on Ubuntu, the workaround is to make sure you don't have libopenblas-dev installed and pip install --no-binary=:all: numpy

If someone encounters this bug and has time to test out newer versions of libopenblas-dev, that'd be useful.

Member

jhseu commented Feb 15, 2017

To narrow down the issue: someone internally noticed that this crash only happens when numpy is installed with OpenBLAS support on Ubuntu 14.04. I haven't tested whether upgrading libopenblas fixes it.

So, if you're on Ubuntu, the workaround is to make sure you don't have libopenblas-dev installed and pip install --no-binary=:all: numpy

If someone encounters this bug and has time to test out newer versions of libopenblas-dev, that'd be useful.

@brando90

This comment has been minimized.

Show comment
Hide comment
@brando90

brando90 Feb 23, 2017

I found this problem when using 1.0.0 docker image gcr.io/tensorflow/tensorflow:1.0.0-devel-py3. Is this the source of the problem? Do I need to do anything to my containers/images/dockerfile to remove this issue?

brando90 commented Feb 23, 2017

I found this problem when using 1.0.0 docker image gcr.io/tensorflow/tensorflow:1.0.0-devel-py3. Is this the source of the problem? Do I need to do anything to my containers/images/dockerfile to remove this issue?

@jhseu

This comment has been minimized.

Show comment
Hide comment
@jhseu

jhseu Feb 23, 2017

Member

The nightly docker images are on Ubuntu 16.04, which should make the problem go away. For now, I'd recommend using the nightly.

The official builds will be on Ubuntu 16.04 when TensorFlow 1.1 is out.

Member

jhseu commented Feb 23, 2017

The nightly docker images are on Ubuntu 16.04, which should make the problem go away. For now, I'd recommend using the nightly.

The official builds will be on Ubuntu 16.04 when TensorFlow 1.1 is out.

@brando90

This comment has been minimized.

Show comment
Hide comment
@brando90

brando90 Feb 23, 2017

nightly docker?

brando90 commented Feb 23, 2017

nightly docker?

@caisq

This comment has been minimized.

Show comment
Hide comment
@caisq

caisq Feb 23, 2017

Contributor

@brando90: Nightly TF docker images are pushed to Docker Hub. Example command line to use them:
docker run -it --rm tensorflow/tensorflow:nightly /bin/bash
nvidia-docker run -it --rm tensorflow/tensorflow:nightly-gpu /bin/bash
nvidia-docker run -it --rm tensorflow/tensorflow:nightly-gpu-py3 /bin/bash

Contributor

caisq commented Feb 23, 2017

@brando90: Nightly TF docker images are pushed to Docker Hub. Example command line to use them:
docker run -it --rm tensorflow/tensorflow:nightly /bin/bash
nvidia-docker run -it --rm tensorflow/tensorflow:nightly-gpu /bin/bash
nvidia-docker run -it --rm tensorflow/tensorflow:nightly-gpu-py3 /bin/bash

@brando90

This comment has been minimized.

Show comment
Hide comment
@brando90

brando90 Feb 23, 2017

ah ok, thanks. I will try nightly.

When is 1.1 out?

brando90 commented Feb 23, 2017

ah ok, thanks. I will try nightly.

When is 1.1 out?

@jhseu

This comment has been minimized.

Show comment
Hide comment
@jhseu

jhseu Feb 24, 2017

Member

Sometime in March

Member

jhseu commented Feb 24, 2017

Sometime in March

@crimsonlander

This comment has been minimized.

Show comment
Hide comment
@crimsonlander

crimsonlander Feb 26, 2017

I am getting segmentation fault at exit on Ubuntu 14.04, Tensorflow 1.0.0 from pip. Using tcmalloc makes things worse: it crashes on session creation.

crimsonlander commented Feb 26, 2017

I am getting segmentation fault at exit on Ubuntu 14.04, Tensorflow 1.0.0 from pip. Using tcmalloc makes things worse: it crashes on session creation.

@yaroslavvb

This comment has been minimized.

Show comment
Hide comment
@yaroslavvb

yaroslavvb Feb 26, 2017

Contributor

@crimsonlander stack traces might be useful, as well as possible ideas of what could be special about your configuration, since I've used tcmalloc with TensorFlow from 0.9 on Ubuntu 14.04 with no crashes

Contributor

yaroslavvb commented Feb 26, 2017

@crimsonlander stack traces might be useful, as well as possible ideas of what could be special about your configuration, since I've used tcmalloc with TensorFlow from 0.9 on Ubuntu 14.04 with no crashes

@crimsonlander

This comment has been minimized.

Show comment
Hide comment
@crimsonlander

crimsonlander Feb 26, 2017

@yaroslavvb I had the issue with LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4". Setting it to LD_PRELOAD="/usr/lib/libtcmalloc.so.4" resolved the issue. Both from standard 14.04 repository.

crimsonlander commented Feb 26, 2017

@yaroslavvb I had the issue with LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4". Setting it to LD_PRELOAD="/usr/lib/libtcmalloc.so.4" resolved the issue. Both from standard 14.04 repository.

@jhseu

This comment has been minimized.

Show comment
Hide comment
@jhseu

jhseu Feb 26, 2017

Member

@crimsonlander Try the numpy/disabling OpenBLAS workaround above? Upgrading to Ubuntu 16.04 is the most reliable fix, though.

Member

jhseu commented Feb 26, 2017

@crimsonlander Try the numpy/disabling OpenBLAS workaround above? Upgrading to Ubuntu 16.04 is the most reliable fix, though.

@yaroslavvb

This comment has been minimized.

Show comment
Hide comment
@yaroslavvb

yaroslavvb Feb 26, 2017

Contributor

@jhseu is internal testing done on 16.04 now? We've been holding off on upgrading from 14.04 partly because TF testing was on 14.04

Contributor

yaroslavvb commented Feb 26, 2017

@jhseu is internal testing done on 16.04 now? We've been holding off on upgrading from 14.04 partly because TF testing was on 14.04

@jhseu

This comment has been minimized.

Show comment
Hide comment
@jhseu

jhseu Feb 26, 2017

Member

@yaroslavvb Yep, Jenkins and all our Docker images use 16.04 now.

Member

jhseu commented Feb 26, 2017

@yaroslavvb Yep, Jenkins and all our Docker images use 16.04 now.

@jhseu jhseu changed the title from (jemalloc) Error in `python': double free or corruption (!prev) to Error in `python': double free or corruption (!prev) Mar 7, 2017

@estelll

This comment has been minimized.

Show comment
Hide comment
@estelll

estelll Mar 9, 2017

I have met the same problem when I ran the tutorial python script in mnist : fully_connected_feed.py...
After executing 2 episodes it stopped and printed "*** Error in 'usr/bin/python3.4': double free or corruption(!prev) 0x000000000242efa0***"
Then I ran it again ( python fully_connected_feed.py), it still stopped after 2 episodes and print "Error in '/usr/bin/python3.4':invalid pointer : 0x0000000002960900***"
Does it matter? and how to fixed it? What should I do?

estelll commented Mar 9, 2017

I have met the same problem when I ran the tutorial python script in mnist : fully_connected_feed.py...
After executing 2 episodes it stopped and printed "*** Error in 'usr/bin/python3.4': double free or corruption(!prev) 0x000000000242efa0***"
Then I ran it again ( python fully_connected_feed.py), it still stopped after 2 episodes and print "Error in '/usr/bin/python3.4':invalid pointer : 0x0000000002960900***"
Does it matter? and how to fixed it? What should I do?

@jtoy

This comment has been minimized.

Show comment
Hide comment
@jtoy

jtoy Mar 9, 2017

As of march 9, i get this error testing on im2txt on all docker images up to 1.0.1, but the error is not in latest-devel

jtoy commented Mar 9, 2017

As of march 9, i get this error testing on im2txt on all docker images up to 1.0.1, but the error is not in latest-devel

@AKSHAYUBHAT

This comment has been minimized.

Show comment
Hide comment
@AKSHAYUBHAT

AKSHAYUBHAT Mar 31, 2017

I am facing this issue regardless of Ubuntu version used (Xenial vs Trusty) or Docker image (Devel or Nightly or 1.0). If numpy is installed via apt-get python-numpy or gets installed while installing say OpenCV . A simple
import numpy import tensorflow
is sometimes sufficient to cause a segmentation fault. I also think that this happened due to changes in the upstream Dockerfile, rather than tensorflow itself. since even when I had pinned dockerfile to a specific version of Tensorflow it stopped working few weeks ago.

AKSHAYUBHAT commented Mar 31, 2017

I am facing this issue regardless of Ubuntu version used (Xenial vs Trusty) or Docker image (Devel or Nightly or 1.0). If numpy is installed via apt-get python-numpy or gets installed while installing say OpenCV . A simple
import numpy import tensorflow
is sometimes sufficient to cause a segmentation fault. I also think that this happened due to changes in the upstream Dockerfile, rather than tensorflow itself. since even when I had pinned dockerfile to a specific version of Tensorflow it stopped working few weeks ago.

@jhseu

This comment has been minimized.

Show comment
Hide comment
@jhseu

jhseu Mar 31, 2017

Member

@AKSHAYUBHAT Your error is unrelated.

Looking at the stack trace, it's pulling a symbol from torch when it shouldn't be:
/usr/local/lib/python2.7/dist-packages/torch/lib/libshm.so(_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l+0x1c5)[0x7f3edd107235]

No guesses as to why that's happening without understanding your environment.

Member

jhseu commented Mar 31, 2017

@AKSHAYUBHAT Your error is unrelated.

Looking at the stack trace, it's pulling a symbol from torch when it shouldn't be:
/usr/local/lib/python2.7/dist-packages/torch/lib/libshm.so(_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l+0x1c5)[0x7f3edd107235]

No guesses as to why that's happening without understanding your environment.

@jhseu

This comment has been minimized.

Show comment
Hide comment
@jhseu

jhseu Mar 31, 2017

Member

A quick glance at pytorch code: it's exporting some libstdc++ symbols with RTLD_GLOBAL when it shouldn't be. The bug is likely in pytorch.

Member

jhseu commented Mar 31, 2017

A quick glance at pytorch code: it's exporting some libstdc++ symbols with RTLD_GLOBAL when it shouldn't be. The bug is likely in pytorch.

@AKSHAYUBHAT

This comment has been minimized.

Show comment
Hide comment
@AKSHAYUBHAT

AKSHAYUBHAT commented Mar 31, 2017

@jhseu Thanks a lot!

@AKSHAYUBHAT

This comment has been minimized.

Show comment
Hide comment
@AKSHAYUBHAT

AKSHAYUBHAT Apr 1, 2017

I checked and it started working after removing pytorch, so pytorch was the likely culprit.

AKSHAYUBHAT commented Apr 1, 2017

I checked and it started working after removing pytorch, so pytorch was the likely culprit.

ClimbsRocks added a commit to ClimbsRocks/auto_ml that referenced this issue Apr 11, 2017

ClimbsRocks added a commit to ClimbsRocks/auto_ml that referenced this issue Apr 11, 2017

works on saving any number of deep learning models throughout the pip…
…eline

travis issues: trying to minimize what we import

tries not importing tensorflow

fixes two bugs, and tries to fix travis issue by installing ocl-icd-opencl-dev

tries changing the order of imports to bypass travis error

tries fix from tensorflow/tensorflow#6968
@OmarMAmin

This comment has been minimized.

Show comment
Hide comment
@OmarMAmin

OmarMAmin Jul 12, 2017

I got the same error when I just import cv2, I installed opencv3 after installing opencv2

and this caused the below error:

The code that causes this error is just
import cv2
print m

NameError: name 'm' is not defined
*** Error in `python': double free or corruption (out): 0x0000000000d74fd0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fdecbba67e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x7fe0a)[0x7fdecbbaee0a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fdecbbb298c]
/usr/lib/x86_64-linux-gnu/libprotobuf.so.9(_ZN6google8protobuf8internal28DestroyDefaultRepeatedFieldsEv+0x1f)[0x7fdebc49c8af]
/usr/lib/x86_64-linux-gnu/libprotobuf.so.9(_ZN6google8protobuf23ShutdownProtobufLibraryEv+0x8b)[0x7fdebc49bb3b]
/usr/lib/x86_64-linux-gnu/libmirprotobuf.so.3(+0x20329)[0x7fde6fd4f329]
/lib64/ld-linux-x86-64.so.2(+0x10c17)[0x7fdecc125c17]
/lib/x86_64-linux-gnu/libc.so.6(+0x39ff8)[0x7fdecbb68ff8]
/lib/x86_64-linux-gnu/libc.so.6(+0x3a045)[0x7fdecbb69045]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf7)[0x7fdecbb4f837]
python(_start+0x29)[0x49d9d9]

OmarMAmin commented Jul 12, 2017

I got the same error when I just import cv2, I installed opencv3 after installing opencv2

and this caused the below error:

The code that causes this error is just
import cv2
print m

NameError: name 'm' is not defined
*** Error in `python': double free or corruption (out): 0x0000000000d74fd0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fdecbba67e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x7fe0a)[0x7fdecbbaee0a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fdecbbb298c]
/usr/lib/x86_64-linux-gnu/libprotobuf.so.9(_ZN6google8protobuf8internal28DestroyDefaultRepeatedFieldsEv+0x1f)[0x7fdebc49c8af]
/usr/lib/x86_64-linux-gnu/libprotobuf.so.9(_ZN6google8protobuf23ShutdownProtobufLibraryEv+0x8b)[0x7fdebc49bb3b]
/usr/lib/x86_64-linux-gnu/libmirprotobuf.so.3(+0x20329)[0x7fde6fd4f329]
/lib64/ld-linux-x86-64.so.2(+0x10c17)[0x7fdecc125c17]
/lib/x86_64-linux-gnu/libc.so.6(+0x39ff8)[0x7fdecbb68ff8]
/lib/x86_64-linux-gnu/libc.so.6(+0x3a045)[0x7fdecbb69045]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf7)[0x7fdecbb4f837]
python(_start+0x29)[0x49d9d9]

@tjdevWorks

This comment has been minimized.

Show comment
Hide comment
@tjdevWorks

tjdevWorks Sep 24, 2017

For people who run into this type of issue on Arch Linux:

*** glibc detected *** python: double free or corruption (!prev): 0x00000000013fcda0 *** ======= Backtrace: ========= /lib64/libc.so.6[0x3e22e75e66] /lib64/libc.so.6[0x3e22e789b3] /lib64/ld-linux-x86-64.so.2(_dl_deallocate_tls+0x67)[0x3e226112f7] /lib64/libpthread.so.0[0x3e2320675d] /lib64/libpthread.so.0[0x3e232078ea] /lib64/libpthread.so.0(pthread_join+0xd4)[0x3e232081f4]
..... and so on

Install the gperftools package

sudo pacman -S gperftools

It will most likely solve the issue.

tjdevWorks commented Sep 24, 2017

For people who run into this type of issue on Arch Linux:

*** glibc detected *** python: double free or corruption (!prev): 0x00000000013fcda0 *** ======= Backtrace: ========= /lib64/libc.so.6[0x3e22e75e66] /lib64/libc.so.6[0x3e22e789b3] /lib64/ld-linux-x86-64.so.2(_dl_deallocate_tls+0x67)[0x3e226112f7] /lib64/libpthread.so.0[0x3e2320675d] /lib64/libpthread.so.0[0x3e232078ea] /lib64/libpthread.so.0(pthread_join+0xd4)[0x3e232081f4]
..... and so on

Install the gperftools package

sudo pacman -S gperftools

It will most likely solve the issue.

@alsrgv

This comment has been minimized.

Show comment
Hide comment
@alsrgv

alsrgv Oct 29, 2017

Contributor

I got this same issue trying to run real data benchmark with Horovod on TF 1.4.0rc1, with Open MPI OpenIB transport (which installs memory hooks). TCP transport is unaffected.

Generating model                                                                                                                                                 [992/9693]
*** Error in `python': double free or corruption (!prev): 0x0000000001f721f0 ***
[opusgpu39-wbu2:32804] *** Process received signal ***
[opusgpu39-wbu2:32804] Signal: Aborted (6)
[opusgpu39-wbu2:32804] Signal code:  (-6)
[opusgpu39-wbu2:32804] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf890)[0x7f0c06bba890]
[opusgpu39-wbu2:32804] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f0c05f12067]
[opusgpu39-wbu2:32804] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f0c05f13448]
[opusgpu39-wbu2:32804] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x731b4)[0x7f0c05f501b4]
[opusgpu39-wbu2:32804] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x7898e)[0x7f0c05f5598e]
[opusgpu39-wbu2:32804] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x79696)[0x7f0c05f56696]
[opusgpu39-wbu2:32804] [ 6] /lib64/ld-linux-x86-64.so.2(_dl_deallocate_tls+0x58)[0x7f0c06dd9958]
[opusgpu39-wbu2:32804] [ 7] /lib/x86_64-linux-gnu/libpthread.so.0(+0x7107)[0x7f0c06bb2107]
[opusgpu39-wbu2:32804] [ 8] /lib/x86_64-linux-gnu/libpthread.so.0(+0x721f)[0x7f0c06bb221f]
[opusgpu39-wbu2:32804] [ 9] /lib/x86_64-linux-gnu/libpthread.so.0(pthread_join+0xe4)[0x7f0c06bb44d4]
[opusgpu39-wbu2:32804] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSt6thread4joinEv+0x27)[0x7f0b69baa837]
[opusgpu39-wbu2:32804] [11] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so(+0x5131d0)[0x7f0b708b61d0]
[opusgpu39-wbu2:32804] [12] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow6thread10ThreadP$
ol4ImplD0Ev+0xbb)[0x7f0b7088d43b]
[opusgpu39-wbu2:32804] [13] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow6thread10ThreadP$
olD1Ev+0x1a)[0x7f0b7088d73a]
[opusgpu39-wbu2:32804] [14] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow10FileSystem16Ge$
MatchingPathsERKSsPSt6vectorISsSaISsEE+0x56c)[0x7f0b708b2aec]
[opusgpu39-wbu2:32804] [15] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow3Env16GetMatchin$
PathsERKSsPSt6vectorISsSaISsEE+0xa3)[0x7f0b708aca43]
[opusgpu39-wbu2:32804] [16] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_Z16GetMatchingFilesRKSsP9TF_S$
atus+0x4b)[0x7f0b7227090b]
[opusgpu39-wbu2:32804] [17] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x1080604)[0x7f0b72273604]
[opusgpu39-wbu2:32804] [18] python(PyEval_EvalFrameEx+0x614)[0x4cddf4]
[opusgpu39-wbu2:32804] [19] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] [20] python(PyEval_EvalFrameEx+0x6500)[0x4d3ce0]
[opusgpu39-wbu2:32804] [21] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] [22] python(PyEval_EvalFrameEx+0x5e0a)[0x4d35ea]
[opusgpu39-wbu2:32804] [23] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] [24] python(PyEval_EvalFrameEx+0x5e0a)[0x4d35ea]
[opusgpu39-wbu2:32804] [25] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] [26] python(PyEval_EvalFrameEx+0x6500)[0x4d3ce0]
[opusgpu39-wbu2:32804] [27] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] [28] python(PyEval_EvalFrameEx+0x6500)[0x4d3ce0]
[opusgpu39-wbu2:32804] [29] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] *** End of error message ***

I was able to make it work by adding -x LD_PRELOAD=/usr/local/lib/libtcmalloc.so.4.4.5.

Seems other folks are still hitting this issue in other use cases, too. Any ideas or plans for the fix?

Contributor

alsrgv commented Oct 29, 2017

I got this same issue trying to run real data benchmark with Horovod on TF 1.4.0rc1, with Open MPI OpenIB transport (which installs memory hooks). TCP transport is unaffected.

Generating model                                                                                                                                                 [992/9693]
*** Error in `python': double free or corruption (!prev): 0x0000000001f721f0 ***
[opusgpu39-wbu2:32804] *** Process received signal ***
[opusgpu39-wbu2:32804] Signal: Aborted (6)
[opusgpu39-wbu2:32804] Signal code:  (-6)
[opusgpu39-wbu2:32804] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf890)[0x7f0c06bba890]
[opusgpu39-wbu2:32804] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f0c05f12067]
[opusgpu39-wbu2:32804] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f0c05f13448]
[opusgpu39-wbu2:32804] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x731b4)[0x7f0c05f501b4]
[opusgpu39-wbu2:32804] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x7898e)[0x7f0c05f5598e]
[opusgpu39-wbu2:32804] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x79696)[0x7f0c05f56696]
[opusgpu39-wbu2:32804] [ 6] /lib64/ld-linux-x86-64.so.2(_dl_deallocate_tls+0x58)[0x7f0c06dd9958]
[opusgpu39-wbu2:32804] [ 7] /lib/x86_64-linux-gnu/libpthread.so.0(+0x7107)[0x7f0c06bb2107]
[opusgpu39-wbu2:32804] [ 8] /lib/x86_64-linux-gnu/libpthread.so.0(+0x721f)[0x7f0c06bb221f]
[opusgpu39-wbu2:32804] [ 9] /lib/x86_64-linux-gnu/libpthread.so.0(pthread_join+0xe4)[0x7f0c06bb44d4]
[opusgpu39-wbu2:32804] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSt6thread4joinEv+0x27)[0x7f0b69baa837]
[opusgpu39-wbu2:32804] [11] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so(+0x5131d0)[0x7f0b708b61d0]
[opusgpu39-wbu2:32804] [12] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow6thread10ThreadP$
ol4ImplD0Ev+0xbb)[0x7f0b7088d43b]
[opusgpu39-wbu2:32804] [13] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow6thread10ThreadP$
olD1Ev+0x1a)[0x7f0b7088d73a]
[opusgpu39-wbu2:32804] [14] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow10FileSystem16Ge$
MatchingPathsERKSsPSt6vectorISsSaISsEE+0x56c)[0x7f0b708b2aec]
[opusgpu39-wbu2:32804] [15] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow3Env16GetMatchin$
PathsERKSsPSt6vectorISsSaISsEE+0xa3)[0x7f0b708aca43]
[opusgpu39-wbu2:32804] [16] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_Z16GetMatchingFilesRKSsP9TF_S$
atus+0x4b)[0x7f0b7227090b]
[opusgpu39-wbu2:32804] [17] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x1080604)[0x7f0b72273604]
[opusgpu39-wbu2:32804] [18] python(PyEval_EvalFrameEx+0x614)[0x4cddf4]
[opusgpu39-wbu2:32804] [19] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] [20] python(PyEval_EvalFrameEx+0x6500)[0x4d3ce0]
[opusgpu39-wbu2:32804] [21] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] [22] python(PyEval_EvalFrameEx+0x5e0a)[0x4d35ea]
[opusgpu39-wbu2:32804] [23] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] [24] python(PyEval_EvalFrameEx+0x5e0a)[0x4d35ea]
[opusgpu39-wbu2:32804] [25] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] [26] python(PyEval_EvalFrameEx+0x6500)[0x4d3ce0]
[opusgpu39-wbu2:32804] [27] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] [28] python(PyEval_EvalFrameEx+0x6500)[0x4d3ce0]
[opusgpu39-wbu2:32804] [29] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] *** End of error message ***

I was able to make it work by adding -x LD_PRELOAD=/usr/local/lib/libtcmalloc.so.4.4.5.

Seems other folks are still hitting this issue in other use cases, too. Any ideas or plans for the fix?

@jhseu

This comment has been minimized.

Show comment
Hide comment
@jhseu

jhseu Nov 2, 2017

Member

We still don't think there's a bug in TensorFlow here. Any Python module that messes with memory allocation can cause this, so perhaps trying importing those last?

Member

jhseu commented Nov 2, 2017

We still don't think there's a bug in TensorFlow here. Any Python module that messes with memory allocation can cause this, so perhaps trying importing those last?

@DMTSource

This comment has been minimized.

Show comment
Hide comment
@DMTSource

DMTSource Nov 24, 2017

I ran into this and other "Error in `python'" issues using the module mayavi.mlab with TensorFlow when creating multiple mlab figures. @jhseu 's suggestion to import the module after TF fixed it instantly. Thank you.

DMTSource commented Nov 24, 2017

I ran into this and other "Error in `python'" issues using the module mayavi.mlab with TensorFlow when creating multiple mlab figures. @jhseu 's suggestion to import the module after TF fixed it instantly. Thank you.

@LogicHolmes

This comment has been minimized.

Show comment
Hide comment
@LogicHolmes

LogicHolmes Dec 10, 2017

I don't know why I send the command:
bazel-bin/im2txt/train --input_file_pattern="${MSCOCO_DIR}/train-?????-of-00256" --inception_checkpoint_file="${INCEPTION_CHECKPOINT}" --train_dir="${MODEL_DIR}/train" --train_inception=false --number_of_steps=1000000

there have a error:
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcufft.so.8.0. LD_LIBRARY_PATH:
I tensorflow/stream_executor/cuda/cuda_fft.cc:344] Unable to load cuFFT DSO.
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
*** Error in `/usr/bin/python': double free or corruption (!prev): 0x000000000231f8e0 ***
I don't know the error

LogicHolmes commented Dec 10, 2017

I don't know why I send the command:
bazel-bin/im2txt/train --input_file_pattern="${MSCOCO_DIR}/train-?????-of-00256" --inception_checkpoint_file="${INCEPTION_CHECKPOINT}" --train_dir="${MODEL_DIR}/train" --train_inception=false --number_of_steps=1000000

there have a error:
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcufft.so.8.0. LD_LIBRARY_PATH:
I tensorflow/stream_executor/cuda/cuda_fft.cc:344] Unable to load cuFFT DSO.
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
*** Error in `/usr/bin/python': double free or corruption (!prev): 0x000000000231f8e0 ***
I don't know the error

@siraj0019

This comment has been minimized.

Show comment
Hide comment
@siraj0019

siraj0019 Mar 26, 2018

I was facing the same issue and later I found that I wrote a Destructor of the class which was clearing the memory but I had already cleared it in another function which i was calling before the Destructor and then repeated the same which caused same error when I removed the Destructor Problem Solved. (I was loading my preTrained Model)

siraj0019 commented Mar 26, 2018

I was facing the same issue and later I found that I wrote a Destructor of the class which was clearing the memory but I had already cleared it in another function which i was calling before the Destructor and then repeated the same which caused same error when I removed the Destructor Problem Solved. (I was loading my preTrained Model)

@inscite

This comment has been minimized.

Show comment
Hide comment
@inscite

inscite Apr 6, 2018

I am experiencing this issue while restoring partial weights/biases thru the method: tensorflow.contrib.framework.python.ops.assign_from_checkpoint. Currently working on CUDA 8/9, tf r1.4 w/ native source build, and horovod 0.11.2. I am also planning to upgrade tf to very recent version or successor (r1.5), and then I'll leave some progress about error.

Below is an example of process backtrace. (automatically generated)
process_dead_backtrace.txt

inscite commented Apr 6, 2018

I am experiencing this issue while restoring partial weights/biases thru the method: tensorflow.contrib.framework.python.ops.assign_from_checkpoint. Currently working on CUDA 8/9, tf r1.4 w/ native source build, and horovod 0.11.2. I am also planning to upgrade tf to very recent version or successor (r1.5), and then I'll leave some progress about error.

Below is an example of process backtrace. (automatically generated)
process_dead_backtrace.txt

@yijie0710

This comment has been minimized.

Show comment
Hide comment
@yijie0710

yijie0710 Apr 11, 2018

I used jemalloc instead of tcmalloc and the problem was also solved on CentOS 6.
Just install jemalloc and run with LD_PRELOAD=/usr/local/lib/libjemalloc.so
(follow this post: https://zapier.com/engineering/celery-python-jemalloc/)

yijie0710 commented Apr 11, 2018

I used jemalloc instead of tcmalloc and the problem was also solved on CentOS 6.
Just install jemalloc and run with LD_PRELOAD=/usr/local/lib/libjemalloc.so
(follow this post: https://zapier.com/engineering/celery-python-jemalloc/)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment