[Question]: 使用PaddleNlp训练在保存模型时报显存不足Out of memory error on GPU 1

### 请提出你的问题

使用PaddleNlp训练在保存模型时报显存不足
Paddle版本：2.4.2
PaddleNlp版本：2.5.2.post
python版本：3.7.13

任务类型：文本分类-单分类
硬件环境：4张V100 32G GPU
训练数据集总大小：88.3M 

详细异常提示：
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
I0316 08:06:42.398526 41084 tcp_utils.cc:130] Successfully connected to 172.17.0.2:41382
W0316 08:06:47.663002 41084 gpu_resources.cc:61] Please NOTE: device: 1, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0316 08:06:47.665423 41084 gpu_resources.cc:91] device: 1, cuDNN Version: 8.2.
[2023-03-16 08:06:53,052] [    INFO] topology.py:215 - HybridParallelInfo: rank_id: 1, mp_degree: 1, sharding_degree: 4, pp_degree: 1, dp_degree: 1, mp_group: [1],  sharding_group: [0, 1, 2, 3], pp_group: [1], dp_group: [1], check/clip
 group: [0, 1, 2, 3]
[2023-03-16 08:06:53,054] [    INFO] -     +==============================================================================+
    |                                                                              |
    |                         DistributedStrategy Overview                         |
    |                                                                              |
    +==============================================================================+
    |                        a_sync=True <-> a_sync_configs                        |
    +------------------------------------------------------------------------------+
    |                               k_steps                    -1                  |
    |                     max_merge_var_num                    1                   |
    |                       send_queue_size                    16                  |
    |               independent_recv_thread                  False                 |
    |         min_send_grad_num_before_recv                    1                   |
    |                      thread_pool_size                    1                   |
    |                       send_wait_times                    1                   |
    |               runtime_split_send_recv                  False                 |
    |                        launch_barrier                   True                 |
    |             heter_worker_device_guard                   cpu                  |
    |                        lr_decay_steps                    10                  |
    |                            use_ps_gpu                    0                   |
    +==============================================================================+
    |                    Environment Flags, Communication Flags                    |
    +------------------------------------------------------------------------------+
    |                                  mode                    1                   |
    |                               elastic                  False                 |
    |                                  auto                  False                 |
    |                   sync_nccl_allreduce                   True                 |
    |                         nccl_comm_num                    1                   |
    |            use_hierarchical_allreduce                  False                 |
    |   hierarchical_allreduce_inter_nranks                    1                   |
    |                       sync_batch_norm                  False                 |
    |                   fuse_all_reduce_ops                   True                 |
    |                  fuse_grad_size_in_MB                    32                  |
    |              fuse_grad_size_in_TFLOPS                   50.0                 |
    |               cudnn_exhaustive_search                  False                 |
    |             conv_workspace_size_limit                   512                  |
    |    cudnn_batchnorm_spatial_persistent                  False                 |
    |                        fp16_allreduce                  False                 |
    |               last_comm_group_size_MB                   1.0                  |
    |                find_unused_parameters                  False                 |
    |            without_graph_optimization                  False                 |
    |                 fuse_grad_size_in_num                    8                   |
    |                 calc_comm_same_stream                  False                 |
    |                                   asp                  False                 |
    |                       fuse_grad_merge                  False                 |
    |                             semi_auto                  False                 |
    |                            adam_d2sum                  False                 |
    |                           auto_search                  False                 |
    |                        heter_ccl_mode                  False                 |
    |                         is_fl_ps_mode                  False                 |
    |                      with_coordinator                  False                 |
    |                            split_data                   True                 |
    |                  downpour_table_param                    []                  |
    |                       fs_client_param                                        |
    +==============================================================================+
    |                                Build Strategy                                |
    +------------------------------------------------------------------------------+
    |           enable_sequential_execution                  False                 |
    |              fuse_elewise_add_act_ops                  False                 |
    |                       fuse_bn_act_ops                  False                 |
    |              fuse_relu_depthwise_conv                  False                 |
    |                    fuse_broadcast_ops                  False                 |
    |                fuse_all_optimizer_ops                  False                 |
    |                        enable_inplace                  False                 |
    |     enable_backward_optimizer_op_deps                   True                 |
    |                 cache_runtime_context                  False                 |
    |                   fuse_bn_add_act_ops                   True                 |
    |                    enable_auto_fusion                  False                 |
    |                          enable_addto                  False                 |
    |                      fix_op_run_order                  False                 |
    |              allow_cuda_graph_capture                  False                 |
    |                       reduce_strategy                    0                   |
    |                    fuse_gemm_epilogue                  False                 |
    |                   debug_graphviz_path                                        |
    +==============================================================================+
    |                              Execution Strategy                              |
    +------------------------------------------------------------------------------+
    |                           num_threads                    1                   |
    |          num_iteration_per_drop_scope                    10                  |
    |                 num_iteration_per_run                    1                   |
    |                    use_thread_barrier                  False                 |
    +==============================================================================+

[2023-03-16 08:06:53,055] [    INFO] - The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior
 as now. You should start updating your code and make this info disappear :-).
[2023-03-16 08:06:53,056] [    INFO] - ============================================================
[2023-03-16 08:06:53,056] [    INFO] -      Model Configuration Arguments      
[2023-03-16 08:06:53,056] [    INFO] - paddle commit id              :0e92adceae06b6b7463f2dc7790ffb0601730009
[2023-03-16 08:06:53,056] [    INFO] - export_model_dir              :/home/project/deploy/export/sort/cn
[2023-03-16 08:06:53,056] [    INFO] - model_name_or_path            :ernie-3.0-tiny-micro-v2-zh
[2023-03-16 08:06:53,056] [    INFO] - 
[2023-03-16 08:06:53,056] [    INFO] - ============================================================
[2023-03-16 08:06:53,056] [    INFO] -       Data Configuration Arguments      
[2023-03-16 08:06:53,057] [    INFO] - paddle commit id              :0e92adceae06b6b7463f2dc7790ffb0601730009
[2023-03-16 08:06:53,057] [    INFO] - bad_case_path                 :./data/bad_case.txt
[2023-03-16 08:06:53,057] [    INFO] - debug                         :False
[2023-03-16 08:06:53,057] [    INFO] - dev_path                      :/home/project/paddle_class/data/target/sort/cn/dev.txt
[2023-03-16 08:06:53,057] [    INFO] - early_stopping                :True
[2023-03-16 08:06:53,057] [    INFO] - early_stopping_patience       :3
[2023-03-16 08:06:53,057] [    INFO] - label_path                    :/home/project/paddle_class/data/target/sort/cn/label.txt
[2023-03-16 08:06:53,057] [    INFO] - max_length                    :128
[2023-03-16 08:06:53,057] [    INFO] - test_path                     :./data/dev.txt
[2023-03-16 08:06:53,057] [    INFO] - train_path                    :/home/project/paddle_class/data/target/sort/cn/train.txt
[2023-03-16 08:06:53,057] [    INFO] - 
[2023-03-16 08:06:53,148] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification'> to load 'ernie-3.0-tiny-micro-v2-zh'.
[2023-03-16 08:06:54,312] [    INFO] - All model checkpoint weights were used when initializing ErnieForSequenceClassification.

[2023-03-16 08:06:54,312] [ WARNING] - Some weights of ErnieForSequenceClassification were not initialized from the model checkpoint at ernie-3.0-tiny-micro-v2-zh and are newly initialized: ['ernie.pooler.dense.bias', 'classifie
r.bias', 'ernie.pooler.dense.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[2023-03-16 08:06:54,313] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'ernie-3.0-tiny-micro-v2-zh'.
[2023-03-16 08:06:54,313] [    INFO] - Already cached /root/.paddlenlp/models/ernie-3.0-tiny-micro-v2-zh/ernie_3.0_tiny_micro_v2_vocab.txt
[2023-03-16 08:06:54,335] [    INFO] - tokenizer config file saved in /root/.paddlenlp/models/ernie-3.0-tiny-micro-v2-zh/tokenizer_config.json
[2023-03-16 08:06:54,335] [    INFO] - Special tokens file saved in /root/.paddlenlp/models/ernie-3.0-tiny-micro-v2-zh/special_tokens_map.json
[2023-03-16 08:06:55,785] [ WARNING] - Accessing `enable_recompute` through `model.enable_recompute` will be deprecated after v2.6.0. Instead, do `model.config.enable_recompute`
[2023-03-16 08:06:55,786] [ WARNING] - Accessing `enable_recompute` through `model.enable_recompute` will be deprecated after v2.6.0. Instead, do `model.config.enable_recompute`
[2023-03-16 08:06:55,786] [    INFO] - ============================================================
[2023-03-16 08:06:55,786] [    INFO] -     Training Configuration Arguments    
[2023-03-16 08:06:55,786] [    INFO] - paddle commit id              :0e92adceae06b6b7463f2dc7790ffb0601730009
[2023-03-16 08:06:55,786] [    INFO] - _no_sync_in_gradient_accumulation:True
[2023-03-16 08:06:55,786] [    INFO] - activation_quantize_type      :None
[2023-03-16 08:06:55,786] [    INFO] - adam_beta1                    :0.9
[2023-03-16 08:06:55,786] [    INFO] - adam_beta2                    :0.999
[2023-03-16 08:06:55,786] [    INFO] - adam_epsilon                  :1e-08
[2023-03-16 08:06:55,787] [    INFO] - algo_list                     :None
[2023-03-16 08:06:55,787] [    INFO] - batch_num_list                :None
[2023-03-16 08:06:55,787] [    INFO] - batch_size_list               :None
[2023-03-16 08:06:55,787] [    INFO] - bf16                          :False
[2023-03-16 08:06:55,787] [    INFO] - bf16_full_eval                :False
[2023-03-16 08:06:55,787] [    INFO] - bias_correction               :False
[2023-03-16 08:06:55,787] [    INFO] - current_device                :gpu:1
[2023-03-16 08:06:55,787] [    INFO] - dataloader_drop_last          :False
[2023-03-16 08:06:55,787] [    INFO] - dataloader_num_workers        :0
[2023-03-16 08:06:55,787] [    INFO] - device                        :gpu
[2023-03-16 08:06:55,787] [    INFO] - disable_tqdm                  :True
[2023-03-16 08:06:55,787] [    INFO] - do_compress                   :False
[2023-03-16 08:06:55,787] [    INFO] - do_eval                       :True
[2023-03-16 08:06:55,787] [    INFO] - do_export                     :True
[2023-03-16 08:06:55,788] [    INFO] - do_predict                    :False
[2023-03-16 08:06:55,788] [    INFO] - do_train                      :True
[2023-03-16 08:06:55,788] [    INFO] - dp_degree                     :1
[2023-03-16 08:06:55,788] [    INFO] - eval_batch_size               :420
[2023-03-16 08:06:55,788] [    INFO] - eval_steps                    :None
[2023-03-16 08:06:55,788] [    INFO] - evaluation_strategy           :IntervalStrategy.EPOCH
[2023-03-16 08:06:55,788] [    INFO] - flatten_param_grads           :False
[2023-03-16 08:06:55,788] [    INFO] - fp16                          :False
[2023-03-16 08:06:55,788] [    INFO] - fp16_full_eval                :False
[2023-03-16 08:06:55,788] [    INFO] - fp16_opt_level                :O1
[2023-03-16 08:06:55,788] [    INFO] - gradient_accumulation_steps   :1
[2023-03-16 08:06:55,788] [    INFO] - greater_is_better             :True
[2023-03-16 08:06:55,788] [    INFO] - ignore_data_skip              :False
[2023-03-16 08:06:55,788] [    INFO] - input_dtype                   :int64
[2023-03-16 08:06:55,789] [    INFO] - input_infer_model_path        :None
[2023-03-16 08:06:55,789] [    INFO] - label_names                   :None
[2023-03-16 08:06:55,789] [    INFO] - lazy_data_processing          :True
[2023-03-16 08:06:55,789] [    INFO] - learning_rate                 :3e-05
[2023-03-16 08:06:55,789] [    INFO] - load_best_model_at_end        :True
[2023-03-16 08:06:55,789] [    INFO] - local_process_index           :1
[2023-03-16 08:06:55,789] [    INFO] - local_rank                    :1
[2023-03-16 08:06:55,789] [    INFO] - log_level                     :-1
[2023-03-16 08:06:55,789] [    INFO] - log_level_replica             :-1
[2023-03-16 08:06:55,789] [    INFO] - log_on_each_node              :True
[2023-03-16 08:06:55,789] [    INFO] - logging_dir                   :/home/project/paddle_class/checkpoint/sort/cn/runs/Mar16_08-06-42_6aac71f1edd1
[2023-03-16 08:06:55,789] [    INFO] - logging_first_step            :False
[2023-03-16 08:06:55,789] [    INFO] - logging_steps                 :5
[2023-03-16 08:06:55,789] [    INFO] - logging_strategy              :IntervalStrategy.STEPS
[2023-03-16 08:06:55,789] [    INFO] - lr_scheduler_type             :SchedulerType.LINEAR
[2023-03-16 08:06:55,790] [    INFO] - max_grad_norm                 :1.0
[2023-03-16 08:06:55,790] [    INFO] - max_steps                     :-1
[2023-03-16 08:06:55,790] [    INFO] - metric_for_best_model         :accuracy
[2023-03-16 08:06:55,790] [    INFO] - minimum_eval_times            :None
[2023-03-16 08:06:55,790] [    INFO] - moving_rate                   :0.9
[2023-03-16 08:06:55,790] [    INFO] - no_cuda                       :False
[2023-03-16 08:06:55,790] [    INFO] - num_train_epochs              :100.0
[2023-03-16 08:06:55,790] [    INFO] - onnx_format                   :True
[2023-03-16 08:06:55,790] [    INFO] - optim                         :OptimizerNames.ADAMW
[2023-03-16 08:06:55,790] [    INFO] - output_dir                    :/home/project/paddle_class/checkpoint/sort/cn
[2023-03-16 08:06:55,790] [    INFO] - overwrite_output_dir          :False
[2023-03-16 08:06:55,790] [    INFO] - past_index                    :-1
[2023-03-16 08:06:55,790] [    INFO] - per_device_eval_batch_size    :420
[2023-03-16 08:06:55,790] [    INFO] - per_device_train_batch_size   :420
[2023-03-16 08:06:55,790] [    INFO] - prediction_loss_only          :False
[2023-03-16 08:06:55,791] [    INFO] - process_index                 :1
[2023-03-16 08:06:55,791] [    INFO] - prune_embeddings              :False
[2023-03-16 08:06:55,791] [    INFO] - recompute                     :True
[2023-03-16 08:06:55,791] [    INFO] - remove_unused_columns         :True
[2023-03-16 08:06:55,791] [    INFO] - report_to                     :['visualdl']
[2023-03-16 08:06:55,791] [    INFO] - resume_from_checkpoint        :True
[2023-03-16 08:06:55,791] [    INFO] - round_type                    :round
[2023-03-16 08:06:55,791] [    INFO] - run_name                      :/home/project/paddle_class/checkpoint/sort/cn
[2023-03-16 08:06:55,791] [    INFO] - save_on_each_node             :False
[2023-03-16 08:06:55,791] [    INFO] - save_steps                    :100
[2023-03-16 08:06:55,791] [    INFO] - save_strategy                 :IntervalStrategy.EPOCH
[2023-03-16 08:06:55,791] [    INFO] - save_total_limit              :1
[2023-03-16 08:06:55,791] [    INFO] - scale_loss                    :32768
[2023-03-16 08:06:55,791] [    INFO] - seed                          :42
[2023-03-16 08:06:55,792] [    INFO] - sharding                      :[<ShardingOption.SHARD_GRAD_OP: 'stage2'>]
[2023-03-16 08:06:55,792] [    INFO] - sharding_degree               :4
[2023-03-16 08:06:55,792] [    INFO] - should_log                    :False
[2023-03-16 08:06:55,792] [    INFO] - should_save                   :False
[2023-03-16 08:06:55,792] [    INFO] - skip_memory_metrics           :True
[2023-03-16 08:06:55,792] [    INFO] - strategy                      :dynabert+ptq
[2023-03-16 08:06:55,792] [    INFO] - train_batch_size              :420
[2023-03-16 08:06:55,792] [    INFO] - use_pact                      :True
[2023-03-16 08:06:55,792] [    INFO] - warmup_ratio                  :0.1
[2023-03-16 08:06:55,792] [    INFO] - warmup_steps                  :0
[2023-03-16 08:06:55,792] [    INFO] - weight_decay                  :0.0
[2023-03-16 08:06:55,792] [    INFO] - weight_quantize_type          :channel_wise_abs_max
[2023-03-16 08:06:55,792] [    INFO] - width_mult_list               :None
[2023-03-16 08:06:55,792] [    INFO] - world_size                    :4
[2023-03-16 08:06:55,793] [    INFO] - 
WARNING:root:While using ClipGradByGlobalNorm in GroupShardedOptimizerStage2, the grad clip of original optimizer will be changed.
[2023-03-16 08:06:57,190] [    INFO] - ***** Running training *****
[2023-03-16 08:06:57,191] [    INFO] -   Num examples = 800000
[2023-03-16 08:06:57,191] [    INFO] -   Num Epochs = 100
[2023-03-16 08:06:57,191] [    INFO] -   Instantaneous batch size per device = 420
[2023-03-16 08:06:57,191] [    INFO] -   Total train batch size (w. parallel, distributed & accumulation) = 1680
[2023-03-16 08:06:57,191] [    INFO] -   Gradient Accumulation steps = 1
[2023-03-16 08:06:57,191] [    INFO] -   Total optimization steps = 47700.0
[2023-03-16 08:06:57,191] [    INFO] -   Total num train samples = 80000000.0
[2023-03-16 08:06:57,193] [    INFO] -   Number of trainable parameters = 98054787
Can not add param: embedding_0.w_0, param's shape: [40000, 384], param align: 0, grad_storages fill: 0, 
Can not add param: linear_25.w_0, param's shape: [384, 193923], param align: 0, grad_storages fill: 2419200, 
[2023-03-16 08:10:42,126] [    INFO] - ***** Running Evaluation *****
[2023-03-16 08:10:42,126] [    INFO] -   Num examples = 199993
[2023-03-16 08:10:42,127] [    INFO] -   Total prediction steps = 120
[2023-03-16 08:10:42,127] [    INFO] -   Pre device batch size = 420
[2023-03-16 08:10:42,127] [    INFO] -   Total Batch size = 1680
terminate called after throwing an instance of 'paddle::memory::allocation::BadAlloc'
  what():  

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   concat_ad_func(std::vector<paddle::experimental::Tensor, std::allocator<paddle::experimental::Tensor> > const&, paddle::experimental::ScalarBase<paddle::experimental::Tensor>)
1   paddle::experimental::concat(std::vector<paddle::experimental::Tensor, std::allocator<paddle::experimental::Tensor> > const&, paddle::experimental::ScalarBase<paddle::experimental::Tensor> const&)
2   void phi::ConcatKernel<float, phi::GPUContext>(phi::GPUContext const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, paddle::experimental::ScalarBase<phi::DenseTensor> const&, phi::DenseTens
or*)
3   phi::DenseTensor::mutable_data(phi::Place const&, paddle::experimental::DataType, unsigned long)
4   paddle::memory::AllocShared(phi::Place const&, unsigned long)
5   paddle::memory::allocation::AllocatorFacade::AllocShared(phi::Place const&, unsigned long)
6   paddle::memory::allocation::AllocatorFacade::Alloc(phi::Place const&, unsigned long)
7   paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
8   paddle::memory::allocation::Allocator::Allocate(unsigned long)
9   paddle::memory::allocation::Allocator::Allocate(unsigned long)
10  paddle::memory::allocation::Allocator::Allocate(unsigned long)
11  paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
12  std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
13  phi::enforce::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
ResourceExhaustedError: 

Out of memory error on GPU 1. Cannot allocate 14.563977GB memory on GPU 1, 17.327148GB memory has been allocated and available memory is only 14.421387GB.

Please check whether there is any other process using GPU 1.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model. 
If the above ways do not solve the out of memory problem, you can try to use CUDA managed memory. The command is `export FLAGS_use_cuda_managed_memory=false`.
 (at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:95)






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question]: 使用PaddleNlp训练在保存模型时报显存不足Out of memory error on GPU 1 #5267

请提出你的问题

C++ Traceback (most recent call last):

Error Message Summary:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question]: 使用PaddleNlp训练在保存模型时报显存不足Out of memory error on GPU 1 #5267

Description

请提出你的问题

C++ Traceback (most recent call last):

Error Message Summary:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions