Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not allocate pinned host memory of size: 2147483648 #2

Closed
GrahamboJangles opened this issue Sep 12, 2019 · 16 comments
Closed

Could not allocate pinned host memory of size: 2147483648 #2

GrahamboJangles opened this issue Sep 12, 2019 · 16 comments

Comments

@GrahamboJangles
Copy link

Running !python2 generation.py --model_dir "/content/ctrl/seqlen256_v1.ckpt" in Colab outputs this:

WARNING: Logging before flag parsing goes to stderr.
W0912 03:52:40.595153 139689530402688 deprecation_wrapper.py:119] From generation.py:6: The name tf.enable_eager_execution is deprecated. Please use tf.compat.v1.enable_eager_execution instead.

W0912 03:52:40.605669 139689530402688 deprecation_wrapper.py:119] From generation.py:35: The name tf.random.set_random_seed is deprecated. Please use tf.compat.v1.random.set_random_seed instead.

246534 unique words
2019-09-12 03:52:40.930801: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-09-12 03:52:40.971309: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:52:40.971914: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:04.0
2019-09-12 03:52:40.972273: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-09-12 03:52:40.973635: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-09-12 03:52:40.975007: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-09-12 03:52:40.975404: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-09-12 03:52:40.976992: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-09-12 03:52:40.978135: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-09-12 03:52:40.981770: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-09-12 03:52:40.981927: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:52:40.982547: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:52:40.983109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-09-12 03:52:40.983494: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-09-12 03:52:41.114324: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:52:41.115113: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5574d0e20d80 executing computations on platform CUDA. Devices:
2019-09-12 03:52:41.115150: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2019-09-12 03:52:41.117511: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2000170000 Hz
2019-09-12 03:52:41.117862: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5574d0e212c0 executing computations on platform Host. Devices:
2019-09-12 03:52:41.117916: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-09-12 03:52:41.118114: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:52:41.118668: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:04.0
2019-09-12 03:52:41.118728: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-09-12 03:52:41.118748: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-09-12 03:52:41.118766: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-09-12 03:52:41.118784: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-09-12 03:52:41.118811: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-09-12 03:52:41.118840: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-09-12 03:52:41.118858: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-09-12 03:52:41.118934: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:52:41.119479: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:52:41.120052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-09-12 03:52:41.120121: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-09-12 03:52:41.121241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-09-12 03:52:41.121268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-09-12 03:52:41.121280: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2019-09-12 03:52:41.121403: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:52:41.121995: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:52:41.122491: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:40] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2019-09-12 03:52:41.122537: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14221 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
W0912 03:52:58.330300 139689530402688 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0912 03:52:58.330642 139689530402688 deprecation_wrapper.py:119] From generation.py:124: The name tf.train.AdagradOptimizer is deprecated. Please use tf.compat.v1.train.AdagradOptimizer instead.

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            [(None, 256)]        0                                            
__________________________________________________________________________________________________
tied_embedding_softmax (TiedEmb multiple             315810054   input_1[0][0]                    
                                                                 encoder[0][0]                    
__________________________________________________________________________________________________
encoder (Encoder)               (None, 256, 1280)    1322154496  tied_embedding_softmax[0][0]     
==================================================================================================
Total params: 1,637,964,550
Trainable params: 1,637,964,550
Non-trainable params: 0
__________________________________________________________________________________________________
None
2019-09-12 03:52:58.496625: W tensorflow/core/framework/allocator.cc:107] Allocation of 1262254080 exceeds 10% of system memory.
tcmalloc: large alloc 1262256128 bytes == 0x557523406000 @  0x7f0c00918b6b 0x7f0c00938379 0x7f0bbd80d754 0x7f0bbd7c8c8a 0x7f0bbd505f11 0x7f0bbd518f08 0x7f0bc366a00c 0x7f0bc3660298 0x7f0bc10448c7 0x7f0bc0fbc97c 0x7f0bc0fbed9d 0x5574cfe6af6e 0x5574cfe6152a 0x5574cfe68fce 0x5574cfe6152a 0x5574cfe68fce 0x5574cfe6152a 0x5574cfe7d03c 0x5574cfe4cf1e 0x5574cfe662d5 0x5574cfe6152a 0x5574cfe695d6 0x5574cfe6152a 0x5574cfe695d6 0x5574cfe6152a 0x5574cfe695d6 0x5574cfe6152a 0x5574cfe695d6 0x5574cfe6152a 0x5574cfe695d6 0x5574cfe6152a
tcmalloc: large alloc 1262256128 bytes == 0x55756e7ce000 @  0x7f0c009361e7 0x7f0bfe37c771 0x7f0bfe3e4028 0x7f0bfe3d90d5 0x7f0bfe46ff77 0x5574cfe63e8a 0x5574cfe6152a 0x5574cfe695d6 0x5574cfe6152a 0x5574cfe695d6 0x5574cfe6152a 0x5574cfe695d6 0x5574cfe6152a 0x5574cfe695d6 0x5574cfe6152a 0x5574cfe695d6 0x5574cfe6152a 0x5574cfe695d6 0x5574cfe6152a 0x5574cfe68fce 0x5574cfe6152a 0x5574cfe68fce 0x5574cfe6152a 0x5574cfe60fb9 0x5574cfe91e7f 0x5574cfe8cc12 0x5574cfe8c09d 0x5574cfe3ad6b 0x7f0c00533b97 0x5574cfe3a5ea
W0912 03:53:06.230777 139689530402688 deprecation.py:506] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/initializers.py:143: calling __init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0912 03:53:11.251795 139689530402688 deprecation.py:506] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/init_ops.py:1251: calling __init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
2019-09-12 03:53:24.403230: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:53:24.403729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:04.0
2019-09-12 03:53:24.403847: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-09-12 03:53:24.403869: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-09-12 03:53:24.403910: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-09-12 03:53:24.403931: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-09-12 03:53:24.403952: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-09-12 03:53:24.403975: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-09-12 03:53:24.403994: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-09-12 03:53:24.404096: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:53:24.404475: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:53:24.404802: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-09-12 03:53:24.404864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-09-12 03:53:24.404878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-09-12 03:53:24.404901: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2019-09-12 03:53:24.405005: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:53:24.405377: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:53:24.405756: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14221 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
2019-09-12 03:53:32.494371: E tensorflow/stream_executor/cuda/cuda_driver.cc:890] failed to alloc 2147483648 bytes on host: CUDA_ERROR_INVALID_VALUE: invalid argument
2019-09-12 03:53:32.511468: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 2147483648
@minimaxir
Copy link

minimaxir commented Sep 12, 2019

I set up a notebook independently in Colaboratory and hit the same stack trace: I suspect the issue is that the Colaboratory VM has too little normal-RAM (13 GB) and the VM is going OOM.

@AdamDanielKing
Copy link

AdamDanielKing commented Sep 12, 2019

Like Max said, I think you're running out of RAM. For me, generation.py takes 36 GB of RAM even while idle.

@AdamDanielKing
Copy link

AdamDanielKing commented Sep 12, 2019

I see you're using a T4 GPU. After some testing it appears that while you can start generations.py on a T4, you cannot actually generate text without it running out of memory. Even with TF_FORCE_GPU_ALLOW_GROWTH=true.

A V100 GPU with 16 GB works fine. Edit: A P100 (16 GB) also works, which makes it the cheapest GPU with enough memory. 🙂

@minimaxir
Copy link

minimaxir commented Sep 12, 2019

That's weird; a T4 and a V100 both have 16GB of VRAM. AI is funny.

Volta may have more memory efficiencies but I thought Turing got some of those too.

@AdamDanielKing
Copy link

@minimaxir It is strange. I'm not sure what makes the difference, but the amount of memory shown as available from a T4 (in TensorFlow logs or nvidia-smi) is less than with a V100. Right now nvidia-smi shows me:

GPU Memory
T4 15079MiB
V100 16130MiB
P100 16280MiB

Bryan McCann tweeted that the model needs 15458 MiB so this seems to explain why the T4 is the only "16 GB" GPU that can't fit it. I also noticed this in one of my own projects: a batch size that worked on a V100 would be too much for a T4.

@keskarnitish
Copy link
Contributor

There might be a way to hack a version of the code with (slightly) smaller memory requirement.
Let me explore and update here.

@Disciple7
Copy link

I'm doing the same thing on Colab too. I'll be grateful if you may fix it ^_^

@AdamDanielKing
Copy link

AdamDanielKing commented Sep 12, 2019

@Disciple7 One option with Colab is to create a more capable machine with one of Google's Deep Learning VM images, then configure Colab to use it. Similar to this blog post but with a P100 and at least about 45 GB of disk space (the model is big). For this you will need to request a quota increase from Google for global GPUs 0->1 and P100 GPUs 0->1 in a region that has P100s such as us-central1-f (find other regions here).

Don't forget to delete the machine after, since it's fairly expensive.

@Disciple7
Copy link

@AdamDanielKing Thank you, I'll take a look.

@minimaxir
Copy link

Given that this app only has a CLI at the moment, using a local runtime for Colab seems redundant; might as well run it directly on the VM by SSHing into the instance if we're going to have one up.

The VMs can be built as preemptible: for the config described it'll be about $0.50/hr, which is reasonable. I also believe that new GCP projects come with some GPU quota by default now; i'll double check.

Additionally, the VMs must be launched with full GCP API access in order for gsutil to be able to get the model.

I can write up a guide once I get things working.

@GrahamboJangles
Copy link
Author

GrahamboJangles commented Sep 12, 2019

So I have a couple questions:

  1. When you guys say RAM, do you mean just GPU RAM?
  2. I was only trying to use the 256 model. I'm guessing the 512 model needs even more RAM.
  3. Like @keskarnitish said, there must be a way to decrease the RAM usage just enough to get it to run, even if it has some downsides.
  4. @minimaxir You can SSH into a Colab instance??

Also, there should be a RAM requirement mentioned on the README or somewhere, unless I'm missing that.

@AdamDanielKing
Copy link

@GrahamboJangles

  1. Both RAM and GPU RAM. The process seems to use about 36 GB of host RAM when idle, and it requires about 15.5 GB of GPU memory.
  2. I've only tried the 256 model as well. But I think the 512 model is the same size. The readme says: "The model architecture is identical for both checkpoints. The former is trained with lower training sequence length (256) while the latter is trained with a larger one (512)." In a transformer, the sequence length doesn't affect the number of parameters.
  3. I think @keskarnitish is talking about reducing the GPU memory requirement from 15458 MiB to less than the 15079 MiB that a T4 GPU can handle.
  4. @minimaxir is talking about SSHing into a raw Google Compute Engine instance, which is probably a good idea. Trying to use Colab for this might just make things more difficult.

@minimaxir
Copy link

Yeah, I meant SSHing into the raw GCE instance.

@minimaxir
Copy link

Even with a P100/V100 and generous system RAM, loading the model hits the VRAM ceiling and errors out.

2019-09-14 21:02:27.021420: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocated_bytes_: 15928269056 memory_limit_: 15928269210 available bytes: 154 curr_region_allocation_bytes_: 31856538624
2019-09-14 21:02:27.021430: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats: 
Limit:                 15928269210
InUse:                 15591875840
MaxInUse:              15633793280
NumAllocs:                    4093
MaxAllocSize:           1262254080

...

tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Dst tensor is not initialized.
	 [[{{node _arg_Placeholder_0_0}}]]
	 [[ReadVariableOp_269/_5009]]
  (1) Internal: Dst tensor is not initialized.
	 [[{{node _arg_Placeholder_0_0}}]]

@keskarnitish
Copy link
Contributor

I added a new branch which allows for inference on GPUs with lower available memory. I tested it on K80s on Collaboratory here https://colab.research.google.com/drive/1hVveBQShDru1Mjnhe4C21uQv4A2eH1tV

The details on how to use it can be found at the top of the README (Update @ Sep 19, 2019 subsection).

This is still in testing phase so expect a few bumps.
I will merge it into master once it stabilizes.

Closing this for now, please reopen if there are issues.

@dimitri320
Copy link

On a similar topic, I've managed to start generating words on a V100, but the only words that are generated is the last work from the input prompt over and over again. Any advice what's wrong? I'm using the 512 version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants