Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: 使用PaddleNlp训练在保存模型时报显存不足Out of memory error on GPU 1 #5267

Closed
ht896632 opened this issue Mar 16, 2023 · 3 comments
Assignees
Labels
question Further information is requested triage

Comments

@ht896632
Copy link

ht896632 commented Mar 16, 2023

请提出你的问题

使用PaddleNlp训练在保存模型时报显存不足
Paddle版本:2.4.2
PaddleNlp版本:2.5.2.post
python版本:3.7.13

任务类型:文本分类-单分类
硬件环境:4张V100 32G GPU
训练数据集总大小:88.3M

详细异常提示:
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
I0316 08:06:42.398526 41084 tcp_utils.cc:130] Successfully connected to 172.17.0.2:41382
W0316 08:06:47.663002 41084 gpu_resources.cc:61] Please NOTE: device: 1, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0316 08:06:47.665423 41084 gpu_resources.cc:91] device: 1, cuDNN Version: 8.2.
[2023-03-16 08:06:53,052] [ INFO] topology.py:215 - HybridParallelInfo: rank_id: 1, mp_degree: 1, sharding_degree: 4, pp_degree: 1, dp_degree: 1, mp_group: [1], sharding_group: [0, 1, 2, 3], pp_group: [1], dp_group: [1], check/clip
group: [0, 1, 2, 3]
[2023-03-16 08:06:53,054] [ INFO] - +==============================================================================+
| |
| DistributedStrategy Overview |
| |
+==============================================================================+
| a_sync=True <-> a_sync_configs |
+------------------------------------------------------------------------------+
| k_steps -1 |
| max_merge_var_num 1 |
| send_queue_size 16 |
| independent_recv_thread False |
| min_send_grad_num_before_recv 1 |
| thread_pool_size 1 |
| send_wait_times 1 |
| runtime_split_send_recv False |
| launch_barrier True |
| heter_worker_device_guard cpu |
| lr_decay_steps 10 |
| use_ps_gpu 0 |
+==============================================================================+
| Environment Flags, Communication Flags |
+------------------------------------------------------------------------------+
| mode 1 |
| elastic False |
| auto False |
| sync_nccl_allreduce True |
| nccl_comm_num 1 |
| use_hierarchical_allreduce False |
| hierarchical_allreduce_inter_nranks 1 |
| sync_batch_norm False |
| fuse_all_reduce_ops True |
| fuse_grad_size_in_MB 32 |
| fuse_grad_size_in_TFLOPS 50.0 |
| cudnn_exhaustive_search False |
| conv_workspace_size_limit 512 |
| cudnn_batchnorm_spatial_persistent False |
| fp16_allreduce False |
| last_comm_group_size_MB 1.0 |
| find_unused_parameters False |
| without_graph_optimization False |
| fuse_grad_size_in_num 8 |
| calc_comm_same_stream False |
| asp False |
| fuse_grad_merge False |
| semi_auto False |
| adam_d2sum False |
| auto_search False |
| heter_ccl_mode False |
| is_fl_ps_mode False |
| with_coordinator False |
| split_data True |
| downpour_table_param [] |
| fs_client_param |
+==============================================================================+
| Build Strategy |
+------------------------------------------------------------------------------+
| enable_sequential_execution False |
| fuse_elewise_add_act_ops False |
| fuse_bn_act_ops False |
| fuse_relu_depthwise_conv False |
| fuse_broadcast_ops False |
| fuse_all_optimizer_ops False |
| enable_inplace False |
| enable_backward_optimizer_op_deps True |
| cache_runtime_context False |
| fuse_bn_add_act_ops True |
| enable_auto_fusion False |
| enable_addto False |
| fix_op_run_order False |
| allow_cuda_graph_capture False |
| reduce_strategy 0 |
| fuse_gemm_epilogue False |
| debug_graphviz_path |
+==============================================================================+
| Execution Strategy |
+------------------------------------------------------------------------------+
| num_threads 1 |
| num_iteration_per_drop_scope 10 |
| num_iteration_per_run 1 |
| use_thread_barrier False |
+==============================================================================+

[2023-03-16 08:06:53,055] [ INFO] - The default value for the training argument --report_to will change in v5 (from all installed integrations to none). In v5, you will need to use --report_to all to get the same behavior
as now. You should start updating your code and make this info disappear :-).
[2023-03-16 08:06:53,056] [ INFO] - ============================================================
[2023-03-16 08:06:53,056] [ INFO] - Model Configuration Arguments
[2023-03-16 08:06:53,056] [ INFO] - paddle commit id :0e92adceae06b6b7463f2dc7790ffb0601730009
[2023-03-16 08:06:53,056] [ INFO] - export_model_dir :/home/project/deploy/export/sort/cn
[2023-03-16 08:06:53,056] [ INFO] - model_name_or_path :ernie-3.0-tiny-micro-v2-zh
[2023-03-16 08:06:53,056] [ INFO] -
[2023-03-16 08:06:53,056] [ INFO] - ============================================================
[2023-03-16 08:06:53,056] [ INFO] - Data Configuration Arguments
[2023-03-16 08:06:53,057] [ INFO] - paddle commit id :0e92adceae06b6b7463f2dc7790ffb0601730009
[2023-03-16 08:06:53,057] [ INFO] - bad_case_path :./data/bad_case.txt
[2023-03-16 08:06:53,057] [ INFO] - debug :False
[2023-03-16 08:06:53,057] [ INFO] - dev_path :/home/project/paddle_class/data/target/sort/cn/dev.txt
[2023-03-16 08:06:53,057] [ INFO] - early_stopping :True
[2023-03-16 08:06:53,057] [ INFO] - early_stopping_patience :3
[2023-03-16 08:06:53,057] [ INFO] - label_path :/home/project/paddle_class/data/target/sort/cn/label.txt
[2023-03-16 08:06:53,057] [ INFO] - max_length :128
[2023-03-16 08:06:53,057] [ INFO] - test_path :./data/dev.txt
[2023-03-16 08:06:53,057] [ INFO] - train_path :/home/project/paddle_class/data/target/sort/cn/train.txt
[2023-03-16 08:06:53,057] [ INFO] -
[2023-03-16 08:06:53,148] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification'> to load 'ernie-3.0-tiny-micro-v2-zh'.
[2023-03-16 08:06:54,312] [ INFO] - All model checkpoint weights were used when initializing ErnieForSequenceClassification.

[2023-03-16 08:06:54,312] [ WARNING] - Some weights of ErnieForSequenceClassification were not initialized from the model checkpoint at ernie-3.0-tiny-micro-v2-zh and are newly initialized: ['ernie.pooler.dense.bias', 'classifie
r.bias', 'ernie.pooler.dense.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[2023-03-16 08:06:54,313] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'ernie-3.0-tiny-micro-v2-zh'.
[2023-03-16 08:06:54,313] [ INFO] - Already cached /root/.paddlenlp/models/ernie-3.0-tiny-micro-v2-zh/ernie_3.0_tiny_micro_v2_vocab.txt
[2023-03-16 08:06:54,335] [ INFO] - tokenizer config file saved in /root/.paddlenlp/models/ernie-3.0-tiny-micro-v2-zh/tokenizer_config.json
[2023-03-16 08:06:54,335] [ INFO] - Special tokens file saved in /root/.paddlenlp/models/ernie-3.0-tiny-micro-v2-zh/special_tokens_map.json
[2023-03-16 08:06:55,785] [ WARNING] - Accessing enable_recompute through model.enable_recompute will be deprecated after v2.6.0. Instead, do model.config.enable_recompute
[2023-03-16 08:06:55,786] [ WARNING] - Accessing enable_recompute through model.enable_recompute will be deprecated after v2.6.0. Instead, do model.config.enable_recompute
[2023-03-16 08:06:55,786] [ INFO] - ============================================================
[2023-03-16 08:06:55,786] [ INFO] - Training Configuration Arguments
[2023-03-16 08:06:55,786] [ INFO] - paddle commit id :0e92adceae06b6b7463f2dc7790ffb0601730009
[2023-03-16 08:06:55,786] [ INFO] - _no_sync_in_gradient_accumulation:True
[2023-03-16 08:06:55,786] [ INFO] - activation_quantize_type :None
[2023-03-16 08:06:55,786] [ INFO] - adam_beta1 :0.9
[2023-03-16 08:06:55,786] [ INFO] - adam_beta2 :0.999
[2023-03-16 08:06:55,786] [ INFO] - adam_epsilon :1e-08
[2023-03-16 08:06:55,787] [ INFO] - algo_list :None
[2023-03-16 08:06:55,787] [ INFO] - batch_num_list :None
[2023-03-16 08:06:55,787] [ INFO] - batch_size_list :None
[2023-03-16 08:06:55,787] [ INFO] - bf16 :False
[2023-03-16 08:06:55,787] [ INFO] - bf16_full_eval :False
[2023-03-16 08:06:55,787] [ INFO] - bias_correction :False
[2023-03-16 08:06:55,787] [ INFO] - current_device :gpu:1
[2023-03-16 08:06:55,787] [ INFO] - dataloader_drop_last :False
[2023-03-16 08:06:55,787] [ INFO] - dataloader_num_workers :0
[2023-03-16 08:06:55,787] [ INFO] - device :gpu
[2023-03-16 08:06:55,787] [ INFO] - disable_tqdm :True
[2023-03-16 08:06:55,787] [ INFO] - do_compress :False
[2023-03-16 08:06:55,787] [ INFO] - do_eval :True
[2023-03-16 08:06:55,787] [ INFO] - do_export :True
[2023-03-16 08:06:55,788] [ INFO] - do_predict :False
[2023-03-16 08:06:55,788] [ INFO] - do_train :True
[2023-03-16 08:06:55,788] [ INFO] - dp_degree :1
[2023-03-16 08:06:55,788] [ INFO] - eval_batch_size :420
[2023-03-16 08:06:55,788] [ INFO] - eval_steps :None
[2023-03-16 08:06:55,788] [ INFO] - evaluation_strategy :IntervalStrategy.EPOCH
[2023-03-16 08:06:55,788] [ INFO] - flatten_param_grads :False
[2023-03-16 08:06:55,788] [ INFO] - fp16 :False
[2023-03-16 08:06:55,788] [ INFO] - fp16_full_eval :False
[2023-03-16 08:06:55,788] [ INFO] - fp16_opt_level :O1
[2023-03-16 08:06:55,788] [ INFO] - gradient_accumulation_steps :1
[2023-03-16 08:06:55,788] [ INFO] - greater_is_better :True
[2023-03-16 08:06:55,788] [ INFO] - ignore_data_skip :False
[2023-03-16 08:06:55,788] [ INFO] - input_dtype :int64
[2023-03-16 08:06:55,789] [ INFO] - input_infer_model_path :None
[2023-03-16 08:06:55,789] [ INFO] - label_names :None
[2023-03-16 08:06:55,789] [ INFO] - lazy_data_processing :True
[2023-03-16 08:06:55,789] [ INFO] - learning_rate :3e-05
[2023-03-16 08:06:55,789] [ INFO] - load_best_model_at_end :True
[2023-03-16 08:06:55,789] [ INFO] - local_process_index :1
[2023-03-16 08:06:55,789] [ INFO] - local_rank :1
[2023-03-16 08:06:55,789] [ INFO] - log_level :-1
[2023-03-16 08:06:55,789] [ INFO] - log_level_replica :-1
[2023-03-16 08:06:55,789] [ INFO] - log_on_each_node :True
[2023-03-16 08:06:55,789] [ INFO] - logging_dir :/home/project/paddle_class/checkpoint/sort/cn/runs/Mar16_08-06-42_6aac71f1edd1
[2023-03-16 08:06:55,789] [ INFO] - logging_first_step :False
[2023-03-16 08:06:55,789] [ INFO] - logging_steps :5
[2023-03-16 08:06:55,789] [ INFO] - logging_strategy :IntervalStrategy.STEPS
[2023-03-16 08:06:55,789] [ INFO] - lr_scheduler_type :SchedulerType.LINEAR
[2023-03-16 08:06:55,790] [ INFO] - max_grad_norm :1.0
[2023-03-16 08:06:55,790] [ INFO] - max_steps :-1
[2023-03-16 08:06:55,790] [ INFO] - metric_for_best_model :accuracy
[2023-03-16 08:06:55,790] [ INFO] - minimum_eval_times :None
[2023-03-16 08:06:55,790] [ INFO] - moving_rate :0.9
[2023-03-16 08:06:55,790] [ INFO] - no_cuda :False
[2023-03-16 08:06:55,790] [ INFO] - num_train_epochs :100.0
[2023-03-16 08:06:55,790] [ INFO] - onnx_format :True
[2023-03-16 08:06:55,790] [ INFO] - optim :OptimizerNames.ADAMW
[2023-03-16 08:06:55,790] [ INFO] - output_dir :/home/project/paddle_class/checkpoint/sort/cn
[2023-03-16 08:06:55,790] [ INFO] - overwrite_output_dir :False
[2023-03-16 08:06:55,790] [ INFO] - past_index :-1
[2023-03-16 08:06:55,790] [ INFO] - per_device_eval_batch_size :420
[2023-03-16 08:06:55,790] [ INFO] - per_device_train_batch_size :420
[2023-03-16 08:06:55,790] [ INFO] - prediction_loss_only :False
[2023-03-16 08:06:55,791] [ INFO] - process_index :1
[2023-03-16 08:06:55,791] [ INFO] - prune_embeddings :False
[2023-03-16 08:06:55,791] [ INFO] - recompute :True
[2023-03-16 08:06:55,791] [ INFO] - remove_unused_columns :True
[2023-03-16 08:06:55,791] [ INFO] - report_to :['visualdl']
[2023-03-16 08:06:55,791] [ INFO] - resume_from_checkpoint :True
[2023-03-16 08:06:55,791] [ INFO] - round_type :round
[2023-03-16 08:06:55,791] [ INFO] - run_name :/home/project/paddle_class/checkpoint/sort/cn
[2023-03-16 08:06:55,791] [ INFO] - save_on_each_node :False
[2023-03-16 08:06:55,791] [ INFO] - save_steps :100
[2023-03-16 08:06:55,791] [ INFO] - save_strategy :IntervalStrategy.EPOCH
[2023-03-16 08:06:55,791] [ INFO] - save_total_limit :1
[2023-03-16 08:06:55,791] [ INFO] - scale_loss :32768
[2023-03-16 08:06:55,791] [ INFO] - seed :42
[2023-03-16 08:06:55,792] [ INFO] - sharding :[<ShardingOption.SHARD_GRAD_OP: 'stage2'>]
[2023-03-16 08:06:55,792] [ INFO] - sharding_degree :4
[2023-03-16 08:06:55,792] [ INFO] - should_log :False
[2023-03-16 08:06:55,792] [ INFO] - should_save :False
[2023-03-16 08:06:55,792] [ INFO] - skip_memory_metrics :True
[2023-03-16 08:06:55,792] [ INFO] - strategy :dynabert+ptq
[2023-03-16 08:06:55,792] [ INFO] - train_batch_size :420
[2023-03-16 08:06:55,792] [ INFO] - use_pact :True
[2023-03-16 08:06:55,792] [ INFO] - warmup_ratio :0.1
[2023-03-16 08:06:55,792] [ INFO] - warmup_steps :0
[2023-03-16 08:06:55,792] [ INFO] - weight_decay :0.0
[2023-03-16 08:06:55,792] [ INFO] - weight_quantize_type :channel_wise_abs_max
[2023-03-16 08:06:55,792] [ INFO] - width_mult_list :None
[2023-03-16 08:06:55,792] [ INFO] - world_size :4
[2023-03-16 08:06:55,793] [ INFO] -
WARNING:root:While using ClipGradByGlobalNorm in GroupShardedOptimizerStage2, the grad clip of original optimizer will be changed.
[2023-03-16 08:06:57,190] [ INFO] - ***** Running training *****
[2023-03-16 08:06:57,191] [ INFO] - Num examples = 800000
[2023-03-16 08:06:57,191] [ INFO] - Num Epochs = 100
[2023-03-16 08:06:57,191] [ INFO] - Instantaneous batch size per device = 420
[2023-03-16 08:06:57,191] [ INFO] - Total train batch size (w. parallel, distributed & accumulation) = 1680
[2023-03-16 08:06:57,191] [ INFO] - Gradient Accumulation steps = 1
[2023-03-16 08:06:57,191] [ INFO] - Total optimization steps = 47700.0
[2023-03-16 08:06:57,191] [ INFO] - Total num train samples = 80000000.0
[2023-03-16 08:06:57,193] [ INFO] - Number of trainable parameters = 98054787
Can not add param: embedding_0.w_0, param's shape: [40000, 384], param align: 0, grad_storages fill: 0,
Can not add param: linear_25.w_0, param's shape: [384, 193923], param align: 0, grad_storages fill: 2419200,
[2023-03-16 08:10:42,126] [ INFO] - ***** Running Evaluation *****
[2023-03-16 08:10:42,126] [ INFO] - Num examples = 199993
[2023-03-16 08:10:42,127] [ INFO] - Total prediction steps = 120
[2023-03-16 08:10:42,127] [ INFO] - Pre device batch size = 420
[2023-03-16 08:10:42,127] [ INFO] - Total Batch size = 1680
terminate called after throwing an instance of 'paddle::memory::allocation::BadAlloc'
what():


C++ Traceback (most recent call last):

0 concat_ad_func(std::vector<paddle::experimental::Tensor, std::allocatorpaddle::experimental::Tensor > const&, paddle::experimental::ScalarBasepaddle::experimental::Tensor)
1 paddle::experimental::concat(std::vector<paddle::experimental::Tensor, std::allocatorpaddle::experimental::Tensor > const&, paddle::experimental::ScalarBasepaddle::experimental::Tensor const&)
2 void phi::ConcatKernel<float, phi::GPUContext>(phi::GPUContext const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, paddle::experimental::ScalarBasephi::DenseTensor const&, phi::DenseTens
or*)
3 phi::DenseTensor::mutable_data(phi::Place const&, paddle::experimental::DataType, unsigned long)
4 paddle::memory::AllocShared(phi::Place const&, unsigned long)
5 paddle::memory::allocation::AllocatorFacade::AllocShared(phi::Place const&, unsigned long)
6 paddle::memory::allocation::AllocatorFacade::Alloc(phi::Place const&, unsigned long)
7 paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
8 paddle::memory::allocation::Allocator::Allocate(unsigned long)
9 paddle::memory::allocation::Allocator::Allocate(unsigned long)
10 paddle::memory::allocation::Allocator::Allocate(unsigned long)
11 paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
12 std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
13 phi::enforce::GetCurrentTraceBackStringabi:cxx11


Error Message Summary:

ResourceExhaustedError:

Out of memory error on GPU 1. Cannot allocate 14.563977GB memory on GPU 1, 17.327148GB memory has been allocated and available memory is only 14.421387GB.

Please check whether there is any other process using GPU 1.

  1. If yes, please stop them, or start PaddlePaddle on another GPU.
  2. If no, please decrease the batch size of your model.
    If the above ways do not solve the out of memory problem, you can try to use CUDA managed memory. The command is export FLAGS_use_cuda_managed_memory=false.
    (at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:95)
@ht896632 ht896632 added the question Further information is requested label Mar 16, 2023
@w5688414
Copy link
Contributor

batch_size太大了,可以调小一点

@ht896632
Copy link
Author

batch_size太大了,可以调小一点

尝试过了,调小也是一样的,同样会报错,在保存模型的时候,显存占用急剧上升;而且总数据量并不大,4张V100 32G卡,基本显存都被占用完

@526872163
Copy link

先调到8或者更低 ,成功运行再往上调高

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested triage
Projects
None yet
Development

No branches or pull requests

4 participants