Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++ adam速度 #67

Closed
feifeibear opened this issue Aug 27, 2021 · 5 comments
Closed

C++ adam速度 #67

feifeibear opened this issue Aug 27, 2021 · 5 comments

Comments

@feifeibear
Copy link
Collaborator

feifeibear commented Aug 27, 2021

Aug 10的性能结果
log.GPT2small_gpu_1_cs_64_bs_128_cpueb_1_margin_0.8_warmup_0.2_gpu_0.8_adamcvt_1

2021-08-10:14:34:53,509 INFO [memory_monitor.py:65] CPU Virtual Memory: used = 15.08 GB, percent = 96.6%
605 2021-08-10:14:34:53,509 INFO [test_bert.py:223] ckp True fp16 True ps True: step elapse 5.177955627441406 sec/iter, 18.463766371092152 Tflops
606 2021-08-10:14:34:53,509 INFO [test_bert.py:225] model 0.72940493
607 2021-08-10:14:34:53,509 INFO [global_timer.py:45] *********** PROFILE RESULTS *************
608 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CHUNK_LIST_prepare_device, 0, 0.0 %
609 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CHUNK_allocate_payload, 0, 0.0 %
610 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CLIENT_access, 0.019408226013183594, 0.338427821424322 %
611 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CLIENT_release, 0.014924049377441406, 0.2602357121256555 %
612 2021-08-10:14:34:53,509 INFO [global_timer.py:50] chunk_cpu_gpu_move, 0, 0.0 %
613 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CLIENT_access_dist, 0.03873419761657715, 0.6754213447995139 %
614 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CLIENT_release_dist, 0.3606679439544678, 6.289089298897653 %
615 2021-08-10:14:34:53,509 INFO [global_timer.py:50] chunk_gpu_cpu_move, 0, 0.0 %
616 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CHUNK_LIST_chunk_move, 0, 0.0 %
617 2021-08-10:14:34:53,509 INFO [global_timer.py:50] FWD, 0.28232502937316895, 4.9229973187357 %
618 2021-08-10:14:34:53,509 INFO [global_timer.py:50] BWD, 2.9886157512664795, 52.1135067722565 %
619 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM_prepare_data_fp16_grad_to_fp32_grad_copy, 0.2039637565612793, 3.5565852198787224 %
620 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM_prepare_data, 0.22702884674072266, 3.958779022397416 %
621 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM_compute, 0.013135433197021484, 0.2290470049819615 %
622 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM_param_fp32_to_fp16, 0.5844182968139648, 10.190700111226695 %
623 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM_release_data, 0.016661882400512695, 0.29053889612597344 %
624 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM, 0.9849364757537842, 17.174671477149886 %
625 2021-08-10:14:34:53,509 INFO [global_timer.py:76] *********** DATA MOVE RESULTS *************
626 2021-08-10:14:34:53,509 INFO [global_timer.py:86] chunk_cpu_gpu_move: 0.0 MB
627 2021-08-10:14:34:53,509 INFO [global_timer.py:86] chunk_gpu_cpu_move: 0.0 MB
628 2021-08-10:14:34:53,509 INFO [global_timer.py:83] ADAM_prepare_data_fp16_grad_to_fp32_grad_copy: 2782.4589920043945 MB, 393 times, 13641.92854120348 MB/s
629 2021-08-10:14:34:53,509 INFO [global_timer.py:83] ADAM_param_fp32_to_fp16: 2782.4589920043945 MB, 393 times, 4761.0744002597885 MB/s

@feifeibear
Copy link
Collaborator Author

目前develop分支的结果

2231 2021-08-27:19:18:33,294 INFO [test_bert.py:236] ckp True fp16 True ps True: step elapse 5.9073121547698975 sec/iter, 16.184105474731595 Tflops
2232 2021-08-27:19:18:33,294 INFO [test_bert.py:238] model 0.72940493
2233 2021-08-27:19:18:33,295 INFO [global_timer.py:45] *********** PROFILE RESULTS *************
2234 2021-08-27:19:18:33,295 INFO [global_timer.py:50] CHUNK_LIST_prepare_device, 0, 0.0 %
2235 2021-08-27:19:18:33,295 INFO [global_timer.py:50] CHUNK_allocate_payload, 0, 0.0 %
2236 2021-08-27:19:18:33,295 INFO [global_timer.py:50] CLIENT_access, 0.02212691307067871, 0.33153822024058927 %
2237 2021-08-27:19:18:33,295 INFO [global_timer.py:50] CLIENT_release, 0.019531965255737305, 0.2926568644258493 %
2238 2021-08-27:19:18:33,295 INFO [global_timer.py:50] chunk_cpu_gpu_move, 0, 0.0 %
2239 2021-08-27:19:18:33,295 INFO [global_timer.py:50] CLIENT_access_dist, 0.04144024848937988, 0.6209192482752114 %
2240 2021-08-27:19:18:33,295 INFO [global_timer.py:50] CLIENT_release_dist, 0.027328968048095703, 0.4094831212440634 %
2241 2021-08-27:19:18:33,295 INFO [global_timer.py:50] chunk_gpu_cpu_move, 0, 0.0 %
2242 2021-08-27:19:18:33,295 INFO [global_timer.py:50] CHUNK_LIST_chunk_move, 0, 0.0 %
2243 2021-08-27:19:18:33,295 INFO [global_timer.py:50] FWD, 0.2834486961364746, 4.247048648242367 %
2244 2021-08-27:19:18:33,295 INFO [global_timer.py:50] BWD, 3.0267438888549805, 45.351164838479654 %
2245 2021-08-27:19:18:33,295 INFO [global_timer.py:50] ADAM_prepare_data_fp16_grad_to_fp32_grad_copy, 0.1033015251159668, 1.5478166193218406 %
2246 2021-08-27:19:18:33,295 INFO [global_timer.py:50] ADAM_prepare_data, 0.1308901309967041, 1.9611900195517777 %
2247 2021-08-27:19:18:33,295 INFO [global_timer.py:50] ADAM_compute, 1.1683895587921143, 17.50654479602667 %
2248 2021-08-27:19:18:33,295 INFO [global_timer.py:50] ADAM_param_fp32_to_fp16, 0.17122220993041992, 2.5655050284088605 %
2249 2021-08-27:19:18:33,295 INFO [global_timer.py:50] ADAM_release_data, 0.02155280113220215, 0.3229360239155347 %
2250 2021-08-27:19:18:33,295 INFO [global_timer.py:50] ADAM, 1.658038854598999, 24.843196571867583 %
2251 2021-08-27:19:18:33,295 INFO [global_timer.py:76] *********** DATA MOVE RESULTS *************
2252 2021-08-27:19:18:33,295 INFO [global_timer.py:86] chunk_cpu_gpu_move: 0.0 MB
2253 2021-08-27:19:18:33,295 INFO [global_timer.py:86] chunk_gpu_cpu_move: 0.0 MB
2254 2021-08-27:19:18:33,295 INFO [global_timer.py:83] ADAM_prepare_data_fp16_grad_to_fp32_grad_copy: 1391.2294960021973 MB, 393 times, 13467.656885417677 MB/s
2255 2021-08-27:19:18:33,295 INFO [global_timer.py:83] ADAM_param_fp32_to_fp16: 2782.4589920043945 MB, 393 times, 16250.572826592477 MB/s

@feifeibear
Copy link
Collaborator Author

CPU型号,AMD Ryzen 7 3700X 8-Core Processor
新的ds_adam kernel性能感觉比deepspeed慢很多。
建议单测内增加性能比较。

@zhuzilin
Copy link
Collaborator

我担心是 loss scale 导致的,因为它相当于给所有的参数都求了个 sum……可以去掉 loss scale 看一下

@feifeibear
Copy link
Collaborator Author

如果是loss scale影响的话,那还真不如把loss scale放在GPU上,反向产生梯度的时候。

@feifeibear
Copy link
Collaborator Author

速度并不明显差异。ADAM_compute时间增加因为算入了fp16->fp32转化时间。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants