Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EVA2.0模型文件 #20

Closed
xiaoqiao opened this issue Jan 28, 2022 · 30 comments
Closed

EVA2.0模型文件 #20

xiaoqiao opened this issue Jan 28, 2022 · 30 comments

Comments

@xiaoqiao
Copy link

请问EVA2.0的模型文件会放出么?
感觉效果挺不错的 ,是多少参数量?

@t1101675
Copy link
Member

Readme 中的例子是28亿参数的 EVA2.0,模型文件肯定会放出,但是具体时间可能得等智源研究院完成模型的评估,大概在2月份或者三月份

@xiaoqiao
Copy link
Author

xiaoqiao commented Jan 28, 2022 via email

@jiangliqin
Copy link

@t1101675 请问能在三月开放EVA2.0吗?

@Jiaxin-Wen
Copy link
Member

您好,模型开源还在审批中,我们争取在三月下旬完成开源~

@t1101675
Copy link
Member

我们的 EVA2.0 模型已经完成开放,相关链接已经更新至 README。另外我们也放出了 EVA2.0 模型的技术报告,通过实验探索了使用大规模预训练构建对话系统的一些重要问题,欢迎关注~

@jiangliqin
Copy link

非常感谢!

@jiangliqin
Copy link

体验下EVA2.0,测试了几个话题,感觉质量很差啊,请帮忙看看是我配置的问题吗 谢谢 @t1101675
image

@t1101675
Copy link
Member

您看一下模型的配置文件有没有改成 eva2.0 的?

@jiangliqin
Copy link

CONFIG_PATH有改成 eva2.0_model_config.json

@t1101675
Copy link
Member

应该是什么地方配置出问题了,我们先本地 check 一下

@Jiaxin-Wen Jiaxin-Wen reopened this Mar 21, 2022
@jiangliqin
Copy link

好的,看上去是配置问题

@jiangliqin
Copy link

我是修改模型并行度为1部署的,可以复现看看~

@Jiaxin-Wen
Copy link
Member

请问您运行的脚本是?

@jiangliqin
Copy link

jiangliqin commented Mar 21, 2022

change_mp修改模型并行度 python3 src/change_mp.py checkpoints/eva2.0_4 checkpoints/eva2.0 1
交互推理eva_inference_interactive_beam

@Jiaxin-Wen
Copy link
Member

您好,我重新下载了我们上传到智源的模型,运行eva_inference_interactive_beam.sh,没有复现您的问题,可以正常运行
请问您提到的checkpoints/eva2.0_4这个文件,是下载的原始文件吗, 还是做了一些修改呢?

@jiangliqin
Copy link

jiangliqin commented Mar 21, 2022

checkpoints/eva2.0_4就是下载的原始文件,只是我修改了名称
image

@Jiaxin-Wen
Copy link
Member

可以把运行后的全部log发一下嘛

@jiangliqin
Copy link

`python -m torch.distributed.launch --master_port 1256 --nproc_per_node 1 /mnt/src/eva_interactive.py --model-config /mnt/src/configs/model/eva2.0_model_config.json --model-parallel-size 1 --load /mnt/checkpoints/eva2.0 --no_load_strict --distributed-backend nccl --weight-decay 1e-2 --clip-grad 1.0 --tokenizer-path /mnt/bpe_dialog_new --temperature 0.9 --top_k 0 --top_p 0.9 --num-beams 4 --length-penalty 1.6 --repetition-penalty 1.6 --rule-path /mnt/rules --fp16 --deepspeed --deepspeed_config /mnt/src/configs/deepspeed/eva_ds_config.json
Loading Model ...
using world size: 1 and model-parallel size: 1

using dynamic loss scaling
[2022-03-21 05:40:58,880] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
initializing model parallel with size 1
initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3140 and data parallel seed: 422
building Enc-Dec model ...
number of parameters on model parallel rank 0: 2841044992
DeepSpeed is enabled.
[2022-03-21 05:41:46,998] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.9+59e4dbb, git-hash=59e4dbb, git-branch=master
[2022-03-21 05:41:47,031] [INFO] [config.py:705:print] DeepSpeedEngine configuration:
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] activation_checkpointing_config <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7fc44d1fc0d0>
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] allreduce_always_fp32 ........ False
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] amp_enabled .................. False
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] amp_params ................... False
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] disable_allgather ............ False
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] dump_state ................... False
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 2000, 'delayed_shift': 4, 'min_scale': 256}
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] elasticity_enabled ........... False
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] fp16_enabled ................. True
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] global_rank .................. 0
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] gradient_accumulation_steps .. 1
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] gradient_clipping ............ 1.0
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] gradient_predivide_factor .... 1.0
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] initial_dynamic_scale ........ 65536
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] loss_scale ................... 0
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] memory_breakdown ............. False
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] optimizer_legacy_fusion ...... False
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] optimizer_name ............... None
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] optimizer_params ............. None
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] pld_enabled .................. False
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] pld_params ................... False
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] prescale_gradients ........... False
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] scheduler_name ............... None
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] scheduler_params ............. None
[2022-03-21 05:41:47,032] [INFO] [config.py:709:print] sparse_attention ............. None
[2022-03-21 05:41:47,033] [INFO] [config.py:709:print] sparse_gradients_enabled ..... False
[2022-03-21 05:41:47,033] [INFO] [config.py:709:print] steps_per_print .............. 10
[2022-03-21 05:41:47,033] [INFO] [config.py:709:print] tensorboard_enabled .......... False
[2022-03-21 05:41:47,033] [INFO] [config.py:709:print] tensorboard_job_name ......... DeepSpeedJobName
[2022-03-21 05:41:47,033] [INFO] [config.py:709:print] tensorboard_output_path ......
[2022-03-21 05:41:47,033] [INFO] [config.py:709:print] train_batch_size ............. 32
[2022-03-21 05:41:47,033] [INFO] [config.py:709:print] train_micro_batch_size_per_gpu 32
[2022-03-21 05:41:47,033] [INFO] [config.py:709:print] wall_clock_breakdown ......... True
[2022-03-21 05:41:47,033] [INFO] [config.py:709:print] world_size ................... 1
[2022-03-21 05:41:47,033] [INFO] [config.py:709:print] zero_allow_untested_optimizer True
[2022-03-21 05:41:47,033] [INFO] [config.py:709:print] zero_config .................. {
"allgather_bucket_size": 500000000,
"allgather_partitions": true,
"contiguous_gradients": false,
"cpu_offload": false,
"elastic_checkpoint": true,
"load_from_fp32_weights": true,
"overlap_comm": false,
"reduce_bucket_size": 500000000,
"reduce_scatter": true,
"stage": 1
}
[2022-03-21 05:41:47,033] [INFO] [config.py:709:print] zero_enabled ................. True
[2022-03-21 05:41:47,033] [INFO] [config.py:709:print] zero_optimization_stage ...... 1
[2022-03-21 05:41:47,033] [INFO] [config.py:711:print] json = {
"activation_checkpointing":{
"contiguous_memory_optimization":false,
"partition_activations":false
},
"fp16":{
"enabled":true,
"hysteresis":4,
"initial_scale_power":16,
"loss_scale":0,
"loss_scale_window":2000,
"min_loss_scale":256
},
"gradient_accumulation_steps":1,
"gradient_clipping":1.0,
"steps_per_print":10,
"train_micro_batch_size_per_gpu":32,
"wall_clock_breakdown":true,
"zero_allow_untested_optimizer":true,
"zero_optimization":{
"stage":1
}
}
[2022-03-21 05:41:47,036] [INFO] [engine.py:1286:_load_checkpoint] rank: 0 loading checkpoint: /mnt/checkpoints/eva2.0/1/mp_rank_00_model_states.pt
[2022-03-21 05:41:53,622] [WARNING] [engine.py:1384:_get_all_zero_checkpoints] Client provided zero checkpoint load paths: ['/mnt/checkpoints/eva2.0/1/zero_pp_rank_0_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_1_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_2_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_3_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_4_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_5_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_6_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_7_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_8_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_9_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_10_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_11_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_12_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_13_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_14_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_15_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_16_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_17_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_18_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_19_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_20_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_21_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_22_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_23_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_24_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_25_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_26_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_27_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_28_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_29_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_30_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_31_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_32_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_33_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_34_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_35_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_36_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_37_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_38_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_39_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_40_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_41_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_42_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_43_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_44_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_45_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_46_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_47_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_48_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_49_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_50_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_51_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_52_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_53_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_54_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_55_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_56_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_57_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_58_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_59_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_60_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_61_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_62_mp_rank_00optim_states.pt', '/mnt/checkpoints/eva2.0/1/zero_pp_rank_63_mp_rank_00optim_states.pt'] does not exist
successfully loaded /mnt/checkpoints/eva2.0/1/mp_rank_00_model_states.pt
Model Loaded!`

@Jiaxin-Wen
Copy link
Member

log看起来没问题
原始checkpoint的并行度即为1,您试一下直接load未调用过change_mp.py的原文件

另外确认一下,您是使用的docker吗?

@jiangliqin
Copy link

eva2.0_model_config.json
eva2.0_base_model_config.json
eva2.0_large_model_config.json
有影响吗?

@jiangliqin
Copy link

好的,我使用原始的模型试试,我是用的docker

@Jiaxin-Wen
Copy link
Member

eva2.0_model_config.json eva2.0_base_model_config.json eva2.0_large_model_config.json 有影响吗?

  • eva2.0_model_config.json: 对应的是xLarge (2.8B)版本的EVA2.0
  • base和large分别对应base(300M)和large(700M)版本的EVA2.0,这两个小模型后续我们也会放出。

@jiangliqin
Copy link

是修改模型并行度的问题,可以正常对话了。

@jiangliqin
Copy link

记得好像以前默认并行度是4,就根据惯性修改了并行度。谢谢您的耐心指导~

@Jiaxin-Wen
Copy link
Member

是修改模型并行度的问题,可以正常对话了。

change_mp.py这个脚本我们fix过,您可以pull一下最新版本试一下。根据我的尝试应该不会导致问题了

@jiangliqin
Copy link

是修改模型并行度的问题,可以正常对话了。

change_mp.py这个脚本我们fix过,您可以pull一下最新版本试一下。根据我的尝试应该不会导致问题了

好的,谢谢

@Jiaxin-Wen Jiaxin-Wen changed the title 请问EVA2.0的模型文件会放出么? EVA2.0模型文件 Mar 21, 2022
@jiangliqin
Copy link

请问paper中提到的模型存在几个问题:Consistency/Knowledge /Safety/Empathy,后续有优化的计划吗?

@t1101675
Copy link
Member

这个我们正在尝试进行优化,但是因为属于比较前沿的研究问题,带有比较大的不确定性,所以优化后的版本什么时候放出还未确定。

@thu-coai thu-coai deleted a comment from jiangliqin Mar 22, 2022
@jiangliqin
Copy link

请问,beam search、topp sampling超参组合,如果想生成效果比较稳定,降低多样性的话,该怎么调整参数呢?通过调整不同参数,发现效果在稳定和多样性之间trade-off

@t1101675
Copy link
Member

可以吧 temperature 参数调小

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants