Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"OOM during optimization" when fine-tuning NLLB #4930

Open
zgerrard opened this issue Dec 29, 2022 · 10 comments
Open

"OOM during optimization" when fine-tuning NLLB #4930

zgerrard opened this issue Dec 29, 2022 · 10 comments

Comments

@zgerrard
Copy link

zgerrard commented Dec 29, 2022

❓ Questions and Help

What is your question?

Hi, I am getting "OOM during optimization, irrecoverable" when trying to fine-tune the 3.3B parameter NLLB model.

Stack trace:
Traceback (most recent call last):
  File "/home/x/projects/nllb/fairseq/slurm_snapshot_code/2022-12-28T22_01_31.150636/fairseq/trainer.py", line 1147, in train_step
    raise e
  File "/home/x/projects/nllb/fairseq/slurm_snapshot_code/2022-12-28T22_01_31.150636/fairseq/trainer.py", line 1099, in train_step
    self.task.optimizer_step(
  File "/home/x/projects/nllb/fairseq/slurm_snapshot_code/2022-12-28T22_01_31.150636/fairseq/tasks/fairseq_task.py", line 550, in optimizer_step
    optimizer.step()
  File "/home/x/projects/nllb/fairseq/slurm_snapshot_code/2022-12-28T22_01_31.150636/fairseq/optim/fp16_optimizer.py", line 440, in step
    self.wrapped_optimizer.step(
  File "/home/x/projects/nllb/fairseq/slurm_snapshot_code/2022-12-28T22_01_31.150636/fairseq/optim/fairseq_optimizer.py", line 120, in step
    self.optimizer.step(closure, scale=scale)
  File "/home/x/.local/lib/python3.10/site-packages/torch/optim/optimizer.py", line 140, in wrapper
    out = func(*args, **kwargs)
  File "/home/x/projects/nllb/fairseq/slurm_snapshot_code/2022-12-28T22_01_31.150636/fairseq/optim/fused_adam.py", line 209, in step
    exp_avg = exp_avg.float() * state["exp_avg_scale"]
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.11 GiB (GPU 0; 23.70 GiB total capacity; 20.43 GiB already allocated; 2.13 GiB free; 20.44 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any ideas? Any help will be greatly appreciated.

What have you tried?

Tried fine-tuning smaller models and only the 600M param. (smallest) model didn't cause the error above.

What's your environment?

  • GPU models and configuration: 24Gb GPU (RTX 3090)
@zgerrard zgerrard changed the title OOM during optimization when fine-tuning NLLB "OOM during optimization" when fine-tuning NLLB Dec 29, 2022
@FayZ676
Copy link

FayZ676 commented Dec 29, 2022

What were your hyperparameter settings?

@zgerrard
Copy link
Author

zgerrard commented Dec 29, 2022

@FayZ676 I used default parameters from nllb200_dense3.3B_finetune_on_fbseed.yaml, just changed the dataset path. Also, tried changing the max_tokens to a smaller number, but it didn’t fix the error.

@zgerrard
Copy link
Author

zgerrard commented Dec 30, 2022

All hyperparameters:

{'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': 'json', 'log_file': None, 'tensorboard_logdir': 'out/tb/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', 'wandb_project': None, 'azureml_logging': False, 'seed': 2, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': True, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma', 'log_nvidia_smi': False, 'use_tutel_moe': False}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None, 'is_moe': False, 'moe_generation': False}, 'distributed_training': {'_name': None, 'distributed_world_size': 1, 'distributed_num_procs': 1, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': None, 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': False, 'ddp_backend': 'fully_sharded', 'ddp_comm_hook': 'none', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'gradient_as_bucket_view': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_base_algorithm': 'localsgd', 'localsgd_frequency': 3, 'nprocs_per_node': 1, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': True, 'memory_efficient_fp16': True, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': False, 'not_fsdp_flatten_parameters': False, 'freeze_up_to_layer': None}, 'dataset': {'_name': None, 'num_workers': 1, 'num_workers_valid': 0, 'skip_invalid_size_inputs_valid_test': True, 'max_tokens': 100, 'batch_size': None, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 1000, 'validate_interval_updates': 10, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': 100, 'batch_size_valid': None, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0, 'grouped_shuffling': False, 'update_epoch_batch_itr': False, 'update_ordered_indices_seed': False}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 50, 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [5e-05], 'stop_min_lr': 1e-09, 'use_bmuf': False, 'train_with_epoch_remainder_batch': False}, 'checkpoint': {'_name': None, 'save_dir': 'out/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', 'restore_file': 'checkpoint_last.pt', 'continue_once': None, 'finetune_from_model': '/media/x/data/nllb_checkpoint/3.3/checkpoint.pt', 'ignore_suffix': False, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1000, 'save_interval_updates': 50, 'keep_interval_updates': 1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': 1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_best_checkpoints': False, 'no_save_optimizer_state': False, 'no_save_optimizer_state_on_training_finished': False, 'synchronize_checkpoints_before_copy': False, 'symlink_best_and_last_checkpoints': False, 'best_checkpoint_metric': 'nll_loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': False, 's3_upload_path': None, 'replication_count': 1, 'model_parallel_size': 1}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 1}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807, 'stats_path': None, 'max_valid_steps': None}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': Namespace(no_progress_bar=False, log_interval=100, log_format='json', log_file=None, tensorboard_logdir='out/tb/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', wandb_project=None, azureml_logging=False, seed=2, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=True, memory_efficient_fp16=True, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, on_cpu_convert_precision=False, min_loss_scale=0.0001, threshold_loss_scale=None, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, quantization_config_path=None, profile=False, reset_logging=False, suppress_crashes=False, use_plasma_view=False, plasma_path='/tmp/plasma', log_nvidia_smi=False, use_tutel_moe=False, tokenizer=None, bpe=None, optimizer='adam', lr_scheduler='inverse_sqrt', scoring='bleu', criterion='label_smoothed_cross_entropy', task='translation_multi_simple_epoch', num_workers=1, num_workers_valid=0, skip_invalid_size_inputs_valid_test=True, max_tokens=100, batch_size=None, required_batch_size_multiple=8, required_seq_len_multiple=1, dataset_impl=None, data_buffer_size=10, train_subset='train', valid_subset='valid', combine_valid_subsets=None, ignore_unused_valid_subsets=False, validate_interval=1000, validate_interval_updates=10, validate_after_updates=0, fixed_validation_seed=None, disable_validation=False, max_tokens_valid=100, batch_size_valid=None, max_valid_steps=None, curriculum=0, gen_subset='test', num_shards=1, shard_id=0, grouped_shuffling=False, update_epoch_batch_itr=False, update_ordered_indices_seed=False, distributed_world_size=1, distributed_num_procs=1, distributed_rank=0, distributed_backend='nccl', distributed_init_method=None, distributed_port=-1, device_id=0, distributed_no_spawn=False, ddp_backend='fully_sharded', ddp_comm_hook='none', bucket_cap_mb=25, fix_batches_to_gpus=False, find_unused_parameters=False, gradient_as_bucket_view=False, fast_stat_sync=False, heartbeat_timeout=-1, broadcast_buffers=False, slowmo_momentum=None, slowmo_base_algorithm='localsgd', localsgd_frequency=3, nprocs_per_node=1, pipeline_model_parallel=False, pipeline_balance=None, pipeline_devices=None, pipeline_chunks=0, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_checkpoint='never', zero_sharding='none', no_reshard_after_forward=False, fp32_reduce_scatter=False, cpu_offload=False, use_sharded_state=False, not_fsdp_flatten_parameters=False, freeze_up_to_layer=None, arch='transformer', max_epoch=0, max_update=50, stop_time_hours=0, clip_norm=0.0, sentence_avg=False, update_freq=[1], lr=[5e-05], stop_min_lr=1e-09, use_bmuf=False, train_with_epoch_remainder_batch=False, save_dir='out/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', restore_file='checkpoint_last.pt', continue_once=None, finetune_from_model='/media/x/data/nllb_checkpoint/3.3/checkpoint.pt', ignore_suffix=False, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, optimizer_overrides='{}', save_interval=1000, save_interval_updates=50, keep_interval_updates=1, keep_interval_updates_pattern=-1, keep_last_epochs=1, keep_best_checkpoints=-1, no_save=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_best_checkpoints=False, no_save_optimizer_state=False, no_save_optimizer_state_on_training_finished=False, synchronize_checkpoints_before_copy=False, symlink_best_and_last_checkpoints=False, best_checkpoint_metric='nll_loss', maximize_best_checkpoint_metric=False, patience=-1, checkpoint_suffix='', checkpoint_shard_count=1, load_checkpoint_on_all_dp_ranks=False, write_checkpoints_asynchronously=False, s3_upload_path=None, replication_count=1, store_ema=False, ema_decay=0.9999, ema_start_update=0, ema_seed_model=None, ema_update_freq=1, ema_fp32=False, source_lang=None, target_lang=None, lang_pairs='eng_Latn-fra_Latn', keep_inference_langtok=False, one_dataset_per_batch=False, sampling_method='temperature', sampling_temperature=1.0, data='/home/x/projects/stopes/stopes/pipelines/prepare_data/outputs/2022-12-27/22-02-43/prepped_data_new_valid_skip_blank/data_bin/shard000', langs=['ace_Arab', 'ace_Latn', 'acm_Arab', 'acq_Arab', 'aeb_Arab', 'afr_Latn', 'ajp_Arab', 'aka_Latn', 'amh_Ethi', 'apc_Arab', 'arb_Arab', 'ars_Arab', 'ary_Arab', 'arz_Arab', 'asm_Beng', 'ast_Latn', 'awa_Deva', 'ayr_Latn', 'azb_Arab', 'azj_Latn', 'bak_Cyrl', 'bam_Latn', 'ban_Latn', 'bel_Cyrl', 'bem_Latn', 'ben_Beng', 'bho_Deva', 'bjn_Arab', 'bjn_Latn', 'bod_Tibt', 'bos_Latn', 'bug_Latn', 'bul_Cyrl', 'cat_Latn', 'ceb_Latn', 'ces_Latn', 'cjk_Latn', 'ckb_Arab', 'crh_Latn', 'cym_Latn', 'dan_Latn', 'deu_Latn', 'dik_Latn', 'dyu_Latn', 'dzo_Tibt', 'ell_Grek', 'eng_Latn', 'epo_Latn', 'est_Latn', 'eus_Latn', 'ewe_Latn', 'fao_Latn', 'pes_Arab', 'fij_Latn', 'fin_Latn', 'fon_Latn', 'fra_Latn', 'fur_Latn', 'fuv_Latn', 'gla_Latn', 'gle_Latn', 'glg_Latn', 'grn_Latn', 'guj_Gujr', 'hat_Latn', 'hau_Latn', 'heb_Hebr', 'hin_Deva', 'hne_Deva', 'hrv_Latn', 'hun_Latn', 'hye_Armn', 'ibo_Latn', 'ilo_Latn', 'ind_Latn', 'isl_Latn', 'ita_Latn', 'jav_Latn', 'jpn_Jpan', 'kab_Latn', 'kac_Latn', 'kam_Latn', 'kan_Knda', 'kas_Arab', 'kas_Deva', 'kat_Geor', 'knc_Arab', 'knc_Latn', 'kaz_Cyrl', 'kbp_Latn', 'kea_Latn', 'khm_Khmr', 'kik_Latn', 'kin_Latn', 'kir_Cyrl', 'kmb_Latn', 'kon_Latn', 'kor_Hang', 'kmr_Latn', 'lao_Laoo', 'lvs_Latn', 'lij_Latn', 'lim_Latn', 'lin_Latn', 'lit_Latn', 'lmo_Latn', 'ltg_Latn', 'ltz_Latn', 'lua_Latn', 'lug_Latn', 'luo_Latn', 'lus_Latn', 'mag_Deva', 'mai_Deva', 'mal_Mlym', 'mar_Deva', 'min_Latn', 'mkd_Cyrl', 'plt_Latn', 'mlt_Latn', 'mni_Beng', 'khk_Cyrl', 'mos_Latn', 'mri_Latn', 'zsm_Latn', 'mya_Mymr', 'nld_Latn', 'nno_Latn', 'nob_Latn', 'npi_Deva', 'nso_Latn', 'nus_Latn', 'nya_Latn', 'oci_Latn', 'gaz_Latn', 'ory_Orya', 'pag_Latn', 'pan_Guru', 'pap_Latn', 'pol_Latn', 'por_Latn', 'prs_Arab', 'pbt_Arab', 'quy_Latn', 'ron_Latn', 'run_Latn', 'rus_Cyrl', 'sag_Latn', 'san_Deva', 'sat_Olck', 'scn_Latn', 'shn_Mymr', 'sin_Sinh', 'slk_Latn', 'slv_Latn', 'smo_Latn', 'sna_Latn', 'snd_Arab', 'som_Latn', 'sot_Latn', 'spa_Latn', 'als_Latn', 'srd_Latn', 'srp_Cyrl', 'ssw_Latn', 'sun_Latn', 'swe_Latn', 'swh_Latn', 'szl_Latn', 'tam_Taml', 'tat_Cyrl', 'tel_Telu', 'tgk_Cyrl', 'tgl_Latn', 'tha_Thai', 'tir_Ethi', 'taq_Latn', 'taq_Tfng', 'tpi_Latn', 'tsn_Latn', 'tso_Latn', 'tuk_Latn', 'tum_Latn', 'tur_Latn', 'twi_Latn', 'tzm_Tfng', 'uig_Arab', 'ukr_Cyrl', 'umb_Latn', 'urd_Arab', 'uzn_Latn', 'vec_Latn', 'vie_Latn', 'war_Latn', 'wol_Latn', 'xho_Latn', 'ydd_Hebr', 'yor_Latn', 'yue_Hant', 'zho_Hans', 'zho_Hant', 'zul_Latn'], lang_dict=None, source_dict=None, target_dict=None, lang_tok_style='multilingual', load_alignments=False, left_pad_source='True', left_pad_target='False', upsample_primary=1, truncate_source=False, encoder_langtok='src', decoder_langtok=True, lang_tok_replacing_bos_eos=False, enable_lang_ids=False, enable_reservsed_directions_shared_datasets=False, extra_data=None, extra_lang_pairs=None, fixed_dictionary=None, langtoks_specs=['main'], langtoks=None, sampling_weights_from_file=None, sampling_weights=None, virtual_epoch_size=None, virtual_data_size=None, pad_to_fixed_length=False, use_local_shard_size=True, enable_m2m_validation=True, add_data_source_prefix_tags=True, add_ssl_task_tokens=False, tokens_per_sample=512, sample_break_mode='eos', mask=0.1, mask_random=0.0, insert=0.0, permute=0.0, rotate=0.0, poisson_lambda=3.0, permute_sentences=0.0, mask_length='subword', replace_length=1, ignore_mmt_main_data=False, mixed_multitask_denoising_prob=0.5, eval_lang_pairs=None, finetune_dict_specs=None, adam_betas='(0.9, 0.98)', adam_eps=1e-06, weight_decay=0.0, use_old_adam=False, fp16_adam_stats=True, block_wise=False, warmup_updates=10, warmup_init_lr=1e-07, pad=1, eos=2, unk=3, label_smoothing=0.1, report_accuracy=False, ignore_prefix_size=0, dropout=0.1, max_source_positions=512, max_target_positions=512, share_all_embeddings=True, decoder_normalize_before=True, encoder_normalize_before=True, min_params_to_wrap=100000000, encoder_layers=24, decoder_layers=24, encoder_ffn_embed_dim=8192, decoder_ffn_embed_dim=8192, encoder_embed_dim=2048, decoder_embed_dim=2048, encoder_attention_heads=16, decoder_attention_heads=16, attention_dropout=0.1, relu_dropout=0.0, no_seed_provided=False, encoder_embed_path=None, encoder_learned_pos=False, decoder_embed_path=None, decoder_learned_pos=False, activation_dropout=0.0, activation_fn='relu', adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, share_decoder_input_output_embed=False, no_token_positional_embeddings=False, adaptive_input=False, no_cross_attention=False, cross_self_attention=False, decoder_output_dim=2048, decoder_input_dim=2048, no_scale_embedding=False, layernorm_embedding=False, tie_adaptive_weights=False, checkpoint_activations=False, offload_activations=False, encoder_layers_to_keep=None, decoder_layers_to_keep=None, encoder_layerdrop=0, decoder_layerdrop=0, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, _name='transformer'), 'task': Namespace(no_progress_bar=False, log_interval=100, log_format='json', log_file=None, tensorboard_logdir='out/tb/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', wandb_project=None, azureml_logging=False, seed=2, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=True, memory_efficient_fp16=True, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, on_cpu_convert_precision=False, min_loss_scale=0.0001, threshold_loss_scale=None, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, quantization_config_path=None, profile=False, reset_logging=False, suppress_crashes=False, use_plasma_view=False, plasma_path='/tmp/plasma', log_nvidia_smi=False, use_tutel_moe=False, tokenizer=None, bpe=None, optimizer='adam', lr_scheduler='inverse_sqrt', scoring='bleu', criterion='label_smoothed_cross_entropy', task='translation_multi_simple_epoch', num_workers=1, num_workers_valid=0, skip_invalid_size_inputs_valid_test=True, max_tokens=100, batch_size=None, required_batch_size_multiple=8, required_seq_len_multiple=1, dataset_impl=None, data_buffer_size=10, train_subset='train', valid_subset='valid', combine_valid_subsets=None, ignore_unused_valid_subsets=False, validate_interval=1000, validate_interval_updates=10, validate_after_updates=0, fixed_validation_seed=None, disable_validation=False, max_tokens_valid=100, batch_size_valid=None, max_valid_steps=None, curriculum=0, gen_subset='test', num_shards=1, shard_id=0, grouped_shuffling=False, update_epoch_batch_itr=False, update_ordered_indices_seed=False, distributed_world_size=1, distributed_num_procs=1, distributed_rank=0, distributed_backend='nccl', distributed_init_method=None, distributed_port=-1, device_id=0, distributed_no_spawn=False, ddp_backend='fully_sharded', ddp_comm_hook='none', bucket_cap_mb=25, fix_batches_to_gpus=False, find_unused_parameters=False, gradient_as_bucket_view=False, fast_stat_sync=False, heartbeat_timeout=-1, broadcast_buffers=False, slowmo_momentum=None, slowmo_base_algorithm='localsgd', localsgd_frequency=3, nprocs_per_node=1, pipeline_model_parallel=False, pipeline_balance=None, pipeline_devices=None, pipeline_chunks=0, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_checkpoint='never', zero_sharding='none', no_reshard_after_forward=False, fp32_reduce_scatter=False, cpu_offload=False, use_sharded_state=False, not_fsdp_flatten_parameters=False, freeze_up_to_layer=None, arch='transformer', max_epoch=0, max_update=50, stop_time_hours=0, clip_norm=0.0, sentence_avg=False, update_freq=[1], lr=[5e-05], stop_min_lr=1e-09, use_bmuf=False, train_with_epoch_remainder_batch=False, save_dir='out/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', restore_file='checkpoint_last.pt', continue_once=None, finetune_from_model='/media/x/data/nllb_checkpoint/3.3/checkpoint.pt', ignore_suffix=False, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, optimizer_overrides='{}', save_interval=1000, save_interval_updates=50, keep_interval_updates=1, keep_interval_updates_pattern=-1, keep_last_epochs=1, keep_best_checkpoints=-1, no_save=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_best_checkpoints=False, no_save_optimizer_state=False, no_save_optimizer_state_on_training_finished=False, synchronize_checkpoints_before_copy=False, symlink_best_and_last_checkpoints=False, best_checkpoint_metric='nll_loss', maximize_best_checkpoint_metric=False, patience=-1, checkpoint_suffix='', checkpoint_shard_count=1, load_checkpoint_on_all_dp_ranks=False, write_checkpoints_asynchronously=False, s3_upload_path=None, replication_count=1, store_ema=False, ema_decay=0.9999, ema_start_update=0, ema_seed_model=None, ema_update_freq=1, ema_fp32=False, source_lang=None, target_lang=None, lang_pairs='eng_Latn-fra_Latn', keep_inference_langtok=False, one_dataset_per_batch=False, sampling_method='temperature', sampling_temperature=1.0, data='/home/x/projects/stopes/stopes/pipelines/prepare_data/outputs/2022-12-27/22-02-43/prepped_data_new_valid_skip_blank/data_bin/shard000', langs=['ace_Arab', 'ace_Latn', 'acm_Arab', 'acq_Arab', 'aeb_Arab', 'afr_Latn', 'ajp_Arab', 'aka_Latn', 'amh_Ethi', 'apc_Arab', 'arb_Arab', 'ars_Arab', 'ary_Arab', 'arz_Arab', 'asm_Beng', 'ast_Latn', 'awa_Deva', 'ayr_Latn', 'azb_Arab', 'azj_Latn', 'bak_Cyrl', 'bam_Latn', 'ban_Latn', 'bel_Cyrl', 'bem_Latn', 'ben_Beng', 'bho_Deva', 'bjn_Arab', 'bjn_Latn', 'bod_Tibt', 'bos_Latn', 'bug_Latn', 'bul_Cyrl', 'cat_Latn', 'ceb_Latn', 'ces_Latn', 'cjk_Latn', 'ckb_Arab', 'crh_Latn', 'cym_Latn', 'dan_Latn', 'deu_Latn', 'dik_Latn', 'dyu_Latn', 'dzo_Tibt', 'ell_Grek', 'eng_Latn', 'epo_Latn', 'est_Latn', 'eus_Latn', 'ewe_Latn', 'fao_Latn', 'pes_Arab', 'fij_Latn', 'fin_Latn', 'fon_Latn', 'fra_Latn', 'fur_Latn', 'fuv_Latn', 'gla_Latn', 'gle_Latn', 'glg_Latn', 'grn_Latn', 'guj_Gujr', 'hat_Latn', 'hau_Latn', 'heb_Hebr', 'hin_Deva', 'hne_Deva', 'hrv_Latn', 'hun_Latn', 'hye_Armn', 'ibo_Latn', 'ilo_Latn', 'ind_Latn', 'isl_Latn', 'ita_Latn', 'jav_Latn', 'jpn_Jpan', 'kab_Latn', 'kac_Latn', 'kam_Latn', 'kan_Knda', 'kas_Arab', 'kas_Deva', 'kat_Geor', 'knc_Arab', 'knc_Latn', 'kaz_Cyrl', 'kbp_Latn', 'kea_Latn', 'khm_Khmr', 'kik_Latn', 'kin_Latn', 'kir_Cyrl', 'kmb_Latn', 'kon_Latn', 'kor_Hang', 'kmr_Latn', 'lao_Laoo', 'lvs_Latn', 'lij_Latn', 'lim_Latn', 'lin_Latn', 'lit_Latn', 'lmo_Latn', 'ltg_Latn', 'ltz_Latn', 'lua_Latn', 'lug_Latn', 'luo_Latn', 'lus_Latn', 'mag_Deva', 'mai_Deva', 'mal_Mlym', 'mar_Deva', 'min_Latn', 'mkd_Cyrl', 'plt_Latn', 'mlt_Latn', 'mni_Beng', 'khk_Cyrl', 'mos_Latn', 'mri_Latn', 'zsm_Latn', 'mya_Mymr', 'nld_Latn', 'nno_Latn', 'nob_Latn', 'npi_Deva', 'nso_Latn', 'nus_Latn', 'nya_Latn', 'oci_Latn', 'gaz_Latn', 'ory_Orya', 'pag_Latn', 'pan_Guru', 'pap_Latn', 'pol_Latn', 'por_Latn', 'prs_Arab', 'pbt_Arab', 'quy_Latn', 'ron_Latn', 'run_Latn', 'rus_Cyrl', 'sag_Latn', 'san_Deva', 'sat_Olck', 'scn_Latn', 'shn_Mymr', 'sin_Sinh', 'slk_Latn', 'slv_Latn', 'smo_Latn', 'sna_Latn', 'snd_Arab', 'som_Latn', 'sot_Latn', 'spa_Latn', 'als_Latn', 'srd_Latn', 'srp_Cyrl', 'ssw_Latn', 'sun_Latn', 'swe_Latn', 'swh_Latn', 'szl_Latn', 'tam_Taml', 'tat_Cyrl', 'tel_Telu', 'tgk_Cyrl', 'tgl_Latn', 'tha_Thai', 'tir_Ethi', 'taq_Latn', 'taq_Tfng', 'tpi_Latn', 'tsn_Latn', 'tso_Latn', 'tuk_Latn', 'tum_Latn', 'tur_Latn', 'twi_Latn', 'tzm_Tfng', 'uig_Arab', 'ukr_Cyrl', 'umb_Latn', 'urd_Arab', 'uzn_Latn', 'vec_Latn', 'vie_Latn', 'war_Latn', 'wol_Latn', 'xho_Latn', 'ydd_Hebr', 'yor_Latn', 'yue_Hant', 'zho_Hans', 'zho_Hant', 'zul_Latn'], lang_dict=None, source_dict=None, target_dict=None, lang_tok_style='multilingual', load_alignments=False, left_pad_source='True', left_pad_target='False', upsample_primary=1, truncate_source=False, encoder_langtok='src', decoder_langtok=True, lang_tok_replacing_bos_eos=False, enable_lang_ids=False, enable_reservsed_directions_shared_datasets=False, extra_data=None, extra_lang_pairs=None, fixed_dictionary=None, langtoks_specs=['main'], langtoks=None, sampling_weights_from_file=None, sampling_weights=None, virtual_epoch_size=None, virtual_data_size=None, pad_to_fixed_length=False, use_local_shard_size=True, enable_m2m_validation=True, add_data_source_prefix_tags=True, add_ssl_task_tokens=False, tokens_per_sample=512, sample_break_mode='eos', mask=0.1, mask_random=0.0, insert=0.0, permute=0.0, rotate=0.0, poisson_lambda=3.0, permute_sentences=0.0, mask_length='subword', replace_length=1, ignore_mmt_main_data=False, mixed_multitask_denoising_prob=0.5, eval_lang_pairs=None, finetune_dict_specs=None, adam_betas='(0.9, 0.98)', adam_eps=1e-06, weight_decay=0.0, use_old_adam=False, fp16_adam_stats=True, block_wise=False, warmup_updates=10, warmup_init_lr=1e-07, pad=1, eos=2, unk=3, label_smoothing=0.1, report_accuracy=False, ignore_prefix_size=0, dropout=0.1, max_source_positions=512, max_target_positions=512, share_all_embeddings=True, decoder_normalize_before=True, encoder_normalize_before=True, min_params_to_wrap=100000000, encoder_layers=24, decoder_layers=24, encoder_ffn_embed_dim=8192, decoder_ffn_embed_dim=8192, encoder_embed_dim=2048, decoder_embed_dim=2048, encoder_attention_heads=16, decoder_attention_heads=16, attention_dropout=0.1, relu_dropout=0.0, no_seed_provided=False, encoder_embed_path=None, encoder_learned_pos=False, decoder_embed_path=None, decoder_learned_pos=False, activation_dropout=0.0, activation_fn='relu', adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, share_decoder_input_output_embed=False, no_token_positional_embeddings=False, adaptive_input=False, no_cross_attention=False, cross_self_attention=False, decoder_output_dim=2048, decoder_input_dim=2048, no_scale_embedding=False, layernorm_embedding=False, tie_adaptive_weights=False, checkpoint_activations=False, offload_activations=False, encoder_layers_to_keep=None, decoder_layers_to_keep=None, encoder_layerdrop=0, decoder_layerdrop=0, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, _name='translation_multi_simple_epoch'), 'criterion': {'_name': 'label_smoothed_cross_entropy', 'label_smoothing': 0.1, 'report_accuracy': False, 'ignore_prefix_size': 0, 'sentence_avg': False}, 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.98)', 'adam_eps': 1e-06, 'weight_decay': 0.0, 'use_old_adam': False, 'fp16_adam_stats': True, 'tpu': False, 'lr': [5e-05], 'block_wise': False}, 'lr_scheduler': {'_name': 'inverse_sqrt', 'warmup_updates': 10, 'warmup_init_lr': 1e-07, 'lr': [5e-05]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None, 'ema': {'_name': None, 'store_ema': False, 'ema_decay': 0.9999, 'ema_start_update': 0, 'ema_seed_model': None, 'ema_update_freq': 1, 'ema_fp32': False}}

@FayZ676
Copy link

FayZ676 commented Dec 30, 2022

For reference, I tried finetuning GPT-NeoX-20B on my setup (4x 3090's) and was told by the devs that I needed at least 13 Bytes of memory per parameter. The largest model I could successfully fine tune was up to the 2B parameter model.

It looks like youre using the config for a 3.3B param model on one 3090 so you may just not have enough memory to fine tune model's larger than 600M????

I don't know for sure, so if someone can confirm the memory requirements for Fairseq that would be great actually.

@edvardasast
Copy link

@zgerrard Hi, maybe you have step by step tutorial how to finetune 600M data model it will be really helpful for me? Could you share your finetune project via your git repository?

@yugaljain1999
Copy link

@edvardasast Did you find any git repository for finetuning?

@edvardasast
Copy link

@edvardasast Did you find any git repository for finetuning?

unfortunately not :(
I have successfuly preprocessed data by using this command:
python preprocess.py -s eng_Latn -t deu_Latn --task multilingual_translation --trainpref my_dataset/train --destdir processed_data --validpref my_dataset/train --testpref my_dataset/train
But when I try to finetune with command:
python train.py processed_data --task multilingual_translation --arch multilingual_transformer --save-dir fine_tuned_model --finetune-from-model model_checkpoints/checkpoint.pt --lang-pairs eng_Latn-deu_Latn --max-tokens 4096
I am getting this error:
Exception: Cannot load model parameters from checkpoint model_checkpoints/checkpoint.pt; please ensure that the architectures match.

@robotsp
Copy link

robotsp commented Feb 20, 2023

@edvardasast

@edvardasast Did you find any git repository for finetuning?

unfortunately not :( I have successfuly preprocessed data by using this command: python preprocess.py -s eng_Latn -t deu_Latn --task multilingual_translation --trainpref my_dataset/train --destdir processed_data --validpref my_dataset/train --testpref my_dataset/train But when I try to finetune with command: python train.py processed_data --task multilingual_translation --arch multilingual_transformer --save-dir fine_tuned_model --finetune-from-model model_checkpoints/checkpoint.pt --lang-pairs eng_Latn-deu_Latn --max-tokens 4096 I am getting this error: Exception: Cannot load model parameters from checkpoint model_checkpoints/checkpoint.pt; please ensure that the architectures match.

@edvardasast Would you please share your whole steps on finetuning nllb? Thanks!

@martinbombin
Copy link

@edvardasast Did you find any git repository for finetuning?

unfortunately not :( I have successfuly preprocessed data by using this command: python preprocess.py -s eng_Latn -t deu_Latn --task multilingual_translation --trainpref my_dataset/train --destdir processed_data --validpref my_dataset/train --testpref my_dataset/train But when I try to finetune with command: python train.py processed_data --task multilingual_translation --arch multilingual_transformer --save-dir fine_tuned_model --finetune-from-model model_checkpoints/checkpoint.pt --lang-pairs eng_Latn-deu_Latn --max-tokens 4096 I am getting this error: Exception: Cannot load model parameters from checkpoint model_checkpoints/checkpoint.pt; please ensure that the architectures match.

I am getting the same error, it seems that it is using the vocab of my data instead of the vocab of the NLLB trained model. That makes model have a different number of parameters.

@zhanbaohang
Copy link

Where is the code for fine-tuning the nllb model? ,thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants