Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多gpu并行训练问题 #11

Open
dodgaga opened this issue Nov 12, 2019 · 2 comments
Open

多gpu并行训练问题 #11

dodgaga opened this issue Nov 12, 2019 · 2 comments

Comments

@dodgaga
Copy link

dodgaga commented Nov 12, 2019

以下是多gpu并行训练的loss:

image

在第一个epoch的时候loss 和对应的 acc是正常的,到第二个epoch有问题,怀疑是合并参数的时候有问题??

@dodgaga
Copy link
Author

dodgaga commented Nov 12, 2019

txt:
bacth 16, 4块gpu并行训练

1055/1061 [============================>.] - ETA: 5s - loss: 3.5930 - acc: 0.0740
1056/1061 [============================>.] - ETA: 4s - loss: 3.5926 - acc: 0.0743
1057/1061 [============================>.] - ETA: 3s - loss: 3.5921 - acc: 0.0743
1058/1061 [============================>.] - ETA: 2s - loss: 3.5919 - acc: 0.0744
1059/1061 [============================>.] - ETA: 1s - loss: 3.5917 - acc: 0.0745
1060/1061 [============================>.] - ETA: 0s - loss: 3.5915 - acc: 0.0746
1061/1061 [==============================] - 919s 867ms/step - loss: 3.5911 - acc: 0.0748 - val_loss: 1.1921e-07 - val_acc: 0.0207
save weights file ./model_snapshots_multi/weights_000_0.0207.h5
Epoch 2/60

1/1061 [..............................] - ETA: 12:46 - loss: 3.2449 - acc: 0.1875
2/1061 [..............................] - ETA: 12:55 - loss: 3.3741 - acc: 0.1250
3/1061 [..............................] - ETA: 13:06 - loss: 3.2832 - acc: 0.1667
4/1061 [..............................] - ETA: 13:00 - loss: 3.2520 - acc: 0.1719
5/1061 [..............................] - ETA: 13:00 - loss: 3.2117 - acc: 0.1875
6/1061 [..............................] - ETA: 12:26 - loss: 3.3068 - acc: 0.1562
7/1061 [..............................] - ETA: 12:30 - loss: 2.8344 - acc: 0.1339
8/1061 [..............................] - ETA: 12:33 - loss: 2.4801 - acc: 0.1172
9/1061 [..............................] - ETA: 12:35 - loss: 2.2046 - acc: 0.1181
10/1061 [..............................] - ETA: 12:33 - loss: 1.9841 - acc: 0.1062
11/1061 [..............................] - ETA: 12:34 - loss: 1.8037 - acc: 0.0966
12/1061 [..............................] - ETA: 12:36 - loss: 1.6534 - acc: 0.0938
13/1061 [..............................] - ETA: 12:38 - loss: 1.5262 - acc: 0.0865
14/1061 [..............................] - ETA: 12:39 - loss: 1.4172 - acc: 0.0804
15/1061 [..............................] - ETA: 12:38 - loss: 1.3227 - acc: 0.0750
16/1061 [..............................] - ETA: 12:38 - loss: 1.2401 - acc: 0.0703
17/1061 [..............................] - ETA: 12:38 - loss: 1.1671 - acc: 0.0662
18/1061 [..............................] - ETA: 12:38 - loss: 1.1023 - acc: 0.0660
19/1061 [..............................] - ETA: 12:38 - loss: 1.0443 - acc: 0.0625
20/1061 [..............................] - ETA: 12:38 - loss: 0.9921 - acc: 0.0625
21/1061 [..............................] - ETA: 12:38 - loss: 0.9448 - acc: 0.0595

@wusaifei
Copy link
Owner

@dodgaga 你好,你是不是没有用预训练参数呢。如果用了预训练参数第一代准确率也会很高的。建议使用预训练参数

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants