fine tuning on seqcls task with deepspeed hit RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. #9

guathwa · 2023-02-01T07:24:28Z

Hi team,

I am trying to fine tune on a seqcls task (using the provided script) on my own dataset but was OOM on my gpu (DGX A100, 40Gb). So now I am trying to run it with deepspeed but encounter the following error.

Please advise. Thanks.

*I am a begineer in deep learning...

================
Try with deepspeed
deepspeed_config.json
{
"zero_optimization": {
"stage": 1,
"reduce_bucket_size": 5e8
},
"train_batch_size":"auto"
}

================

cd /home/dro/guathwa/pubmedgpt/finetune/seqcls_tr
export task=tr
export datadir=data/$task
export outdir=runs/$task/GPT2
export seed=100
export name=test2
export lr=4e-5
export OMP_NUM_THREADS=1
export WANDB_DISABLED=True
export train_batch_size=4
export max_seq_length=128
export grad_accum=4

(pubmedgpt) dro@dro-DGX-Station-A100:~/guathwa/pubmedgpt/finetune/seqcls_tr$ python -m torch.distributed.launch --nproc_per_node=1 --nnodes=1 --node_rank=0 run_seqcls_gpt_tr_v0.2_dro.py --model_name_or_path "stanford-crfm-pubmedgpt" --train_file $datadir/train5000.csv --validation_file $datadir/dev.csv --test_file $datadir/test.csv --do_train --do_eval --do_predict --per_device_train_batch_size $train_batch_size --learning_rate $lr --warmup_ratio 0.5 --num_train_epochs 1 --max_seq_length $max_seq_length --logging_steps 100 --save_strategy no --evaluation_strategy no --output_dir $outdir --overwrite_output_dir --fp16 --seed $seed --run_name $name --ddp_find_unused_parameters False --weight_decay 0.0 --deepspeed deepspeed_config.json

/home/dro/anaconda3/envs/pubmedgpt/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
01/30/2023 17:11:23 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
01/30/2023 17:11:23 - INFO - main - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=False,
ddp_timeout=1800,
debug=[],
deepspeed=deepspeed_config.json,
disable_tqdm=False,
do_eval=True,
do_predict=True,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=4e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=passive,
log_on_each_node=True,
logging_dir=runs/tr/GPT2/runs/Jan30_17-11-23_dro-DGX-Station-A100
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=100,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=1.0,
optim=adamw_hf,
output_dir=runs/tr/GPT2,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=4,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=test2,
save_on_each_node=False,
save_steps=500,
save_strategy=no,
save_total_limit=None,
seed=100,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.5,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
01/30/2023 17:11:23 - INFO - main - load a local file for train: data/tr/train5000.csv
01/30/2023 17:11:23 - INFO - main - load a local file for validation: data/tr/dev.csv
01/30/2023 17:11:23 - INFO - main - load a local file for test: data/tr/test.csv
01/30/2023 17:11:23 - WARNING - datasets.builder - Using custom data configuration default-b4448ac955faff7e
01/30/2023 17:11:23 - INFO - datasets.info - Loading Dataset Infos from /home/dro/anaconda3/envs/pubmedgpt/lib/python3.8/site-packages/datasets/packaged_modules/csv
01/30/2023 17:11:23 - INFO - datasets.builder - Overwrite dataset info from restored data version.
01/30/2023 17:11:23 - INFO - datasets.info - Loading Dataset info from /home/dro/.cache/huggingface/datasets/csv/default-b4448ac955faff7e/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317
01/30/2023 17:11:23 - WARNING - datasets.builder - Found cached dataset csv (/home/dro/.cache/huggingface/datasets/csv/default-b4448ac955faff7e/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317)
01/30/2023 17:11:23 - INFO - datasets.info - Loading Dataset info from /home/dro/.cache/huggingface/datasets/csv/default-b4448ac955faff7e/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317
100%|█████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1399.35it/s]

label_list [0, 1, 2, 3]
[INFO|configuration_utils.py:652] 2023-01-30 17:11:23,271 >> loading configuration file stanford-crfm-pubmedgpt/config.json
[INFO|configuration_utils.py:706] 2023-01-30 17:11:23,272 >> Model config GPT2Config {
"_name_or_path": "stanford-crfm-pubmedgpt",
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 28895,
"embd_pdrop": 0.1,
"eos_token_id": 28895,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1",
"2": "LABEL_2",
"3": "LABEL_3"
},
"initializer_range": 0.02,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1,
"LABEL_2": 2,
"LABEL_3": 3
},
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 2560,
"n_head": 20,
"n_inner": null,
"n_layer": 32,
"n_positions": 1024,
"reorder_and_upcast_attn": false,
"resid_pdrop": 0.1,
"scale_attn_by_inverse_layer_idx": true,
"scale_attn_weights": true,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"torch_dtype": "float32",
"transformers_version": "4.24.0",
"use_cache": false,
"vocab_size": 28896
}

[INFO|tokenization_utils_base.py:1773] 2023-01-30 17:11:23,273 >> loading file vocab.json
[INFO|tokenization_utils_base.py:1773] 2023-01-30 17:11:23,273 >> loading file merges.txt
[INFO|tokenization_utils_base.py:1773] 2023-01-30 17:11:23,273 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:1773] 2023-01-30 17:11:23,273 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:1773] 2023-01-30 17:11:23,273 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:1773] 2023-01-30 17:11:23,273 >> loading file tokenizer_config.json
[INFO|modeling_utils.py:2155] 2023-01-30 17:11:23,302 >> loading weights file stanford-crfm-pubmedgpt/pytorch_model.bin
[WARNING|modeling_utils.py:2598] 2023-01-30 17:11:42,278 >> Some weights of the model checkpoint at stanford-crfm-pubmedgpt were not used when initializing GPT2ForSequenceClassification: ['lm_head.weight']

This IS expected if you are initializing GPT2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing GPT2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[WARNING|modeling_utils.py:2610] 2023-01-30 17:11:42,279 >> Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at stanford-crfm-pubmedgpt and are newly initialized: ['classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Adding [PAD] token to tokenizer and model word embeddings.
[INFO|tokenization_utils_base.py:898] 2023-01-30 17:11:42,597 >> Assigning [PAD] to the pad_token key of the tokenizer
01/30/2023 17:11:43 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/dro/.cache/huggingface/datasets/csv/default-b4448ac955faff7e/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-84075b8e214e8641.arrow
01/30/2023 17:11:43 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/dro/.cache/huggingface/datasets/csv/default-b4448ac955faff7e/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-f2c30d0b5792242f.arrow
01/30/2023 17:11:43 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/dro/.cache/huggingface/datasets/csv/default-b4448ac955faff7e/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-a8ef74afd83d72ba.arrow
[INFO|trainer.py:557] 2023-01-30 17:11:43,480 >> Using cuda_amp half precision backend
[INFO|trainer.py:725] 2023-01-30 17:11:43,480 >> The following columns in the training set don't have a corresponding argument in GPT2ForSequenceClassification.forward and have been ignored: sentence. If sentence are not expected by GPT2ForSequenceClassification.forward, you can safely ignore this message.
/home/dro/anaconda3/envs/pubmedgpt/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
warnings.warn(
[2023-01-30 17:11:43,488] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown
Traceback (most recent call last):
File "run_seqcls_gpt_tr_v0.1.py", line 638, in
main()
File "run_seqcls_gpt_tr_v0.1.py", line 563, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/dro/anaconda3/envs/pubmedgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1501, in train
return inner_training_loop(
File "/home/dro/anaconda3/envs/pubmedgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1570, in _inner_training_loop
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
File "/home/dro/anaconda3/envs/pubmedgpt/lib/python3.8/site-packages/transformers/deepspeed.py", line 344, in deepspeed_init
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/home/dro/anaconda3/envs/pubmedgpt/lib/python3.8/site-packages/deepspeed/init.py", line 125, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/dro/anaconda3/envs/pubmedgpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 301, in init
self._configure_distributed_model(model)
File "/home/dro/anaconda3/envs/pubmedgpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1187, in _configure_distributed_model
self._broadcast_model()
File "/home/dro/anaconda3/envs/pubmedgpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1102, in _broadcast_model
dist.broadcast(p,
File "/home/dro/anaconda3/envs/pubmedgpt/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 127, in log_wrapper
return func(*args, **kwargs)
File "/home/dro/anaconda3/envs/pubmedgpt/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 232, in broadcast
return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
File "/home/dro/anaconda3/envs/pubmedgpt/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 70, in broadcast
return torch.distributed.broadcast(tensor=tensor,
File "/home/dro/anaconda3/envs/pubmedgpt/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1201, in broadcast
work.wait()
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 48532) of binary: /home/dro/anaconda3/envs/pubmedgpt/bin/python
Traceback (most recent call last):
File "/home/dro/anaconda3/envs/pubmedgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/dro/anaconda3/envs/pubmedgpt/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/dro/anaconda3/envs/pubmedgpt/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/dro/anaconda3/envs/pubmedgpt/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/dro/anaconda3/envs/pubmedgpt/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/dro/anaconda3/envs/pubmedgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/dro/anaconda3/envs/pubmedgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/dro/anaconda3/envs/pubmedgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_seqcls_gpt_tr_v0.1.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-01-30_17:11:47
host : ...
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 48532)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

The text was updated successfully, but these errors were encountered:

J38 · 2023-02-01T12:16:44Z

Hi I'll work today on running a basic fine tuning example and looking at the memory footprint and get back to you!

guathwa · 2023-02-01T14:18:22Z

Thanks!

J38 · 2023-02-02T13:14:45Z

Hi I'll get you a sample command in the next day or two, but this link here explains using deepspeed:

https://huggingface.co/docs/transformers/main_classes/deepspeed#deployment-with-one-gpu

You should be using the deepspeed command and make sure you've installed deepspeed.

I think if you only have 1 GPU you're going to need to try ZeRO-offload ... this is explained in that link I provided.

But I'll try to get this working on my own and let you know ...

J38 · 2023-02-02T13:15:34Z

This is another good link:

huggingface/transformers#8771 (comment)

J38 · 2023-02-02T13:16:03Z

Over the next few days I'll try to get this working so we have a great working example of fine-tuning the model with 1 GPU !

guathwa · 2023-02-02T13:22:32Z

Thank you, I will read up n try out also on my 1 GPU!

J38 · 2023-02-04T11:53:34Z

I've gotten the code running and it uses 20GB of GPU memory and 50GB of RAM. So as long as the machine with your A100 has plenty of RAM this could work with 1 GPU.

Set up environment:

# create conda environment
conda create -n biomedlm python=3.8.12 pytorch=1.12.1 torchdata cudatoolkit=11.6.0 -c pytorch -c nvidia

# activate conda environment
conda activate biomedlm

# install python dependencies
# note that flash attention can take 30m to install so it is normal for it to do nothing for 30m
pip install flash-attn
pip install numpy
pip install transformers==4.26.0 datasets=2.9.0 omegaconf wandb
pip install fairscale
pip install accelerate

DeepSpeed config: deepspeed_config.json

{
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 2e-06,
      "betas": [
        0.9,
        0.999
      ],
      "eps": 1e-8,
      "weight_decay": 0.0
    }
  },

  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "total_num_steps": "auto",
      "warmup_max_lr": 2e-06,
      "warmup_num_steps": "auto"
    }
  },

  "zero_optimization": {
    "stage": 1,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "cpu_offload": true
  },
  
  "train_batch_size": "auto",

  "fp16": {
   "enabled": true
  }

}

Command I ran in seqcls directory:

task=pubmedqa_hf ; datadir=data/$task ; export WANDB_PROJECT=biomedical-nlp-eval

deepspeed --num_gpus 1 --num_nodes 1 run_seqcls_gpt.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path /path/to/model --train_file $datadir/train.json --validation_file $datadir/dev.json --test_file $datadir/test.json --do_train --do_eval --do_predict --per_device_train_batch_size 1 --gradient_accumulation_steps 2 --learning_rate 2e-06 --warmup_ratio 0.5 --num_train_epochs 20  --max_seq_length 560  --logging_steps 100 --save_strategy no --evaluation_strategy no --output_dir pubmedqa-finetune-demo --overwrite_output_dir --fp16 --use_flash true  --seed 1 --run_name pubmedqa-finetune-demo --deepspeed deepspeed_config.json

Please let me know if you can get this working!

J38 · 2023-02-04T12:08:47Z

So to summarize it looks like you can run the sequence classification with 1 GPU and 40GB GPU memory (maybe even 20GB GPU memory) ... but I do think you are going to need something like 50GB of machine RAM to take advantage of the CPU offloading

guathwa · 2023-02-04T12:53:43Z

That's really great news! Thank you so much for your help! I will try out and let you know. This machine has plenty of RAM too.

guathwa · 2023-02-09T11:10:04Z

Hi J38, I am happy to share that I am able to complete the training following your instructions, without using --use_flash true. If I include --use_flash true, it will give me the following error. Still trying to troubleshoot what could be the cause. If you have any clue, do let me know. Thanks.

(biomedlm) dro@dro-DGX-Station:~/guathwa/pubmedgpt/finetune/seqcls_tr_dro$ deepspeed --num_gpus 1 --num_nodes 1 run_seqcls_gpt.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path /home/dro/guathwa/pubmedgpt/finetune/seqcls_tr_dro/stanford-crfm-pubmedgpt --train_file $datadir/train.csv --validation_file $datadir/dev.csv --test_file $datadir/test.csv --do_train --do_eval --do_predict --per_device_train_batch_size 1 --gradient_accumulation_steps 2 --learning_rate 2e-06 --warmup_ratio 0.5 --num_train_epochs 20 --max_seq_length 560 --logging_steps 100 --save_strategy no --evaluation_strategy no --output_dir tr-finetune-demo --overwrite_output_dir --fp16 --use_flash true --seed 1 --run_name tr-finetune-demo --deepspeed deepspeed_config.json
[2023-02-09 12:38:17,346] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-02-09 12:38:17,590] [INFO] [runner.py:548:main] cmd = /home/dro/anaconda3/envs/biomedlm/bin/python -u -m deepspeed.launcher.launch --world_info=xxx --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_seqcls_gpt.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path /home/dro/guathwa/pubmedgpt/finetune/seqcls_tr_dro/stanford-crfm-pubmedgpt --train_file data/tr/train.csv --validation_file data/tr/dev.csv --test_file data/tr/test.csv --do_train --do_eval --do_predict --per_device_train_batch_size 1 --gradient_accumulation_steps 2 --learning_rate 2e-06 --warmup_ratio 0.5 --num_train_epochs 20 --max_seq_length 560 --logging_steps 100 --save_strategy no --evaluation_strategy no --output_dir tr-finetune-demo --overwrite_output_dir --fp16 --use_flash true --seed 1 --run_name tr-finetune-demo --deepspeed deepspeed_config.json
[2023-02-09 12:38:18,956] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-02-09 12:38:18,956] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-02-09 12:38:18,956] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-02-09 12:38:18,956] [INFO] [launch.py:162:main] dist_world_size=1
[2023-02-09 12:38:18,956] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-02-09 12:38:24,928] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
File "run_seqcls_gpt.py", line 634, in
main()
File "run_seqcls_gpt.py", line 221, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/dro/anaconda3/envs/biomedlm/lib/python3.8/site-packages/transformers/hf_argparser.py", line 341, in parse_args_into_dataclasses
raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--use_flash', 'true']
[2023-02-09 12:38:25,968] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 78542
[2023-02-09 12:38:25,969] [ERROR] [launch.py:324:sigkill_handler] ['/home/dro/anaconda3/envs/biomedlm/bin/python', '-u', 'run_seqcls_gpt.py', '--local_rank=0', '--tokenizer_name', 'stanford-crfm/pubmed_gpt_tokenizer', '--model_name_or_path', '/home/dro/guathwa/pubmedgpt/finetune/seqcls_tr_dro/stanford-crfm-pubmedgpt', '--train_file', 'data/tr/train.csv', '--validation_file', 'data/tr/dev.csv', '--test_file', 'data/tr/test.csv', '--do_train', '--do_eval', '--do_predict', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '2', '--learning_rate', '2e-06', '--warmup_ratio', '0.5', '--num_train_epochs', '20', '--max_seq_length', '560', '--logging_steps', '100', '--save_strategy', 'no', '--evaluation_strategy', 'no', '--output_dir', 'tr-finetune-demo', '--overwrite_output_dir', '--fp16', '--use_flash', 'true', '--seed', '1', '--run_name', 'tr-finetune-demo', '--deepspeed', 'deepspeed_config.json'] exits with return code = 1

J38 · 2023-02-14T13:37:48Z

By the way I wasn't seeing any performance gain using flash attention, not sure if it just doesn't help or a bug in my system ... this bug you're reporting is because I forgot to push the updated code that has the flash attention option ... will try to push that soon !

J38 · 2023-02-14T13:40:59Z

Okay I pushed the updated code!

guathwa · 2023-02-14T14:50:47Z

Saw the updated codes. I will close this issue. Thanks for the great help!

brando90 · 2023-02-16T03:48:11Z

is there any code that works? e.g. colab? thanks!

guathwa closed this as completed Feb 14, 2023

J38 mentioned this issue Mar 13, 2023

How to run the evaluator for MedQA-USMLE #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fine tuning on seqcls task with deepspeed hit RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. #9

fine tuning on seqcls task with deepspeed hit RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. #9

guathwa commented Feb 1, 2023

J38 commented Feb 1, 2023

guathwa commented Feb 1, 2023

J38 commented Feb 2, 2023

J38 commented Feb 2, 2023

J38 commented Feb 2, 2023

guathwa commented Feb 2, 2023

J38 commented Feb 4, 2023

J38 commented Feb 4, 2023

guathwa commented Feb 4, 2023

guathwa commented Feb 9, 2023 •

edited

Loading

J38 commented Feb 14, 2023

J38 commented Feb 14, 2023

guathwa commented Feb 14, 2023

brando90 commented Feb 16, 2023

fine tuning on seqcls task with deepspeed hit RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. #9

fine tuning on seqcls task with deepspeed hit RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. #9

Comments

guathwa commented Feb 1, 2023

Failures: <NO_OTHER_FAILURES>

J38 commented Feb 1, 2023

guathwa commented Feb 1, 2023

J38 commented Feb 2, 2023

J38 commented Feb 2, 2023

J38 commented Feb 2, 2023

guathwa commented Feb 2, 2023

J38 commented Feb 4, 2023

J38 commented Feb 4, 2023

guathwa commented Feb 4, 2023

guathwa commented Feb 9, 2023 • edited Loading

J38 commented Feb 14, 2023

J38 commented Feb 14, 2023

guathwa commented Feb 14, 2023

brando90 commented Feb 16, 2023

Failures:
<NO_OTHER_FAILURES>

guathwa commented Feb 9, 2023 •

edited

Loading