Pretraining loss is increasing #124

huseinzol05 · 2019-07-05T02:21:53Z

Right now I do pretraining for Malaysia language. I got my own dataset collected from wikipedia, social media and public news. Everything is perfect, it just, the loss increasing,

I0704 18:35:04.634176 140514656708352 train_gpu.py:303] [500] | gnorm 1.29 lr 0.000249 | loss 7.61 | pplx 2012.05, bpc 10.9745
I0704 18:40:19.443927 140514656708352 train_gpu.py:303] [1000] | gnorm 1.14 lr 0.000249 | loss 7.49 | pplx 1798.12, bpc 10.8123
I0704 18:45:34.236683 140514656708352 train_gpu.py:303] [1500] | gnorm 1.20 lr 0.000248 | loss 7.52 | pplx 1843.51, bpc 10.8482
I0704 18:50:49.070508 140514656708352 train_gpu.py:303] [2000] | gnorm 1.16 lr 0.000248 | loss 7.53 | pplx 1855.24, bpc 10.8574
I0704 18:56:03.973169 140514656708352 train_gpu.py:303] [2500] | gnorm 0.80 lr 0.000247 | loss 7.50 | pplx 1809.21, bpc 10.8211
I0704 19:01:18.817846 140514656708352 train_gpu.py:303] [3000] | gnorm 0.68 lr 0.000246 | loss 7.47 | pplx 1751.64, bpc 10.7745
I0704 19:06:33.646725 140514656708352 train_gpu.py:303] [3500] | gnorm 0.68 lr 0.000246 | loss 7.50 | pplx 1813.33, bpc 10.8244
I0704 19:11:48.491064 140514656708352 train_gpu.py:303] [4000] | gnorm 0.63 lr 0.000245 | loss 7.48 | pplx 1765.44, bpc 10.7858
I0704 19:17:03.302957 140514656708352 train_gpu.py:303] [4500] | gnorm 0.54 lr 0.000244 | loss 7.40 | pplx 1643.27, bpc 10.6824
I0704 19:22:18.108561 140514656708352 train_gpu.py:303] [5000] | gnorm 0.43 lr 0.000244 | loss 7.48 | pplx 1768.99, bpc 10.7887
I0704 19:27:32.939702 140514656708352 train_gpu.py:303] [5500] | gnorm 0.52 lr 0.000243 | loss 7.41 | pplx 1647.01, bpc 10.6856
I0704 19:32:47.666982 140514656708352 train_gpu.py:303] [6000] | gnorm 0.58 lr 0.000243 | loss 7.44 | pplx 1700.44, bpc 10.7317
I0704 19:38:02.447965 140514656708352 train_gpu.py:303] [6500] | gnorm 0.47 lr 0.000242 | loss 7.42 | pplx 1669.42, bpc 10.7051
I0704 19:43:17.212873 140514656708352 train_gpu.py:303] [7000] | gnorm 0.56 lr 0.000241 | loss 7.43 | pplx 1692.20, bpc 10.7247
I0704 19:48:31.992203 140514656708352 train_gpu.py:303] [7500] | gnorm 0.54 lr 0.000241 | loss 7.47 | pplx 1759.98, bpc 10.7813
I0704 19:53:46.838080 140514656708352 train_gpu.py:303] [8000] | gnorm 0.40 lr 0.000240 | loss 7.42 | pplx 1675.03, bpc 10.7100
I0704 19:59:01.705397 140514656708352 train_gpu.py:303] [8500] | gnorm 0.60 lr 0.000239 | loss 7.45 | pplx 1713.91, bpc 10.7431
I0704 20:04:16.556568 140514656708352 train_gpu.py:303] [9000] | gnorm 0.31 lr 0.000239 | loss 7.45 | pplx 1717.98, bpc 10.7465
I0704 20:09:31.360584 140514656708352 train_gpu.py:303] [9500] | gnorm 0.31 lr 0.000238 | loss 7.42 | pplx 1667.16, bpc 10.7032
I0704 20:14:46.129139 140514656708352 train_gpu.py:303] [10000] | gnorm 0.32 lr 0.000238 | loss 7.41 | pplx 1658.86, bpc 10.6960
I0704 20:14:54.924502 140514656708352 train_gpu.py:309] Model saved in path: output-model/model.ckpt
I0704 20:20:09.735051 140514656708352 train_gpu.py:303] [10500] | gnorm 0.71 lr 0.000237 | loss 7.32 | pplx 1515.01, bpc 10.5651
I0704 20:25:24.431047 140514656708352 train_gpu.py:303] [11000] | gnorm 0.30 lr 0.000236 | loss 7.38 | pplx 1601.43, bpc 10.6451
I0704 20:30:39.190253 140514656708352 train_gpu.py:303] [11500] | gnorm 0.56 lr 0.000236 | loss 7.05 | pplx 1150.96, bpc 10.1686
I0704 20:35:54.004818 140514656708352 train_gpu.py:303] [12000] | gnorm 0.37 lr 0.000235 | loss 7.12 | pplx 1230.52, bpc 10.2651
I0704 20:41:08.760111 140514656708352 train_gpu.py:303] [12500] | gnorm 0.36 lr 0.000234 | loss 7.31 | pplx 1499.00, bpc 10.5498
I0704 20:46:23.480738 140514656708352 train_gpu.py:303] [13000] | gnorm 0.31 lr 0.000234 | loss 7.43 | pplx 1689.70, bpc 10.7226
I0704 20:51:38.286542 140514656708352 train_gpu.py:303] [13500] | gnorm 0.29 lr 0.000233 | loss 7.37 | pplx 1581.20, bpc 10.6268
I0704 20:56:53.045661 140514656708352 train_gpu.py:303] [14000] | gnorm 0.33 lr 0.000233 | loss 7.37 | pplx 1585.96, bpc 10.6311
I0704 21:02:07.842073 140514656708352 train_gpu.py:303] [14500] | gnorm 0.28 lr 0.000232 | loss 7.31 | pplx 1496.60, bpc 10.5475
I0704 21:07:22.611250 140514656708352 train_gpu.py:303] [15000] | gnorm 0.46 lr 0.000231 | loss 7.36 | pplx 1570.65, bpc 10.6171
I0704 21:12:37.345983 140514656708352 train_gpu.py:303] [15500] | gnorm 0.34 lr 0.000231 | loss 7.43 | pplx 1692.69, bpc 10.7251
I0704 21:17:52.026112 140514656708352 train_gpu.py:303] [16000] | gnorm 0.47 lr 0.000230 | loss 7.33 | pplx 1522.32, bpc 10.5721
I0704 21:23:06.814132 140514656708352 train_gpu.py:303] [16500] | gnorm 0.28 lr 0.000229 | loss 7.38 | pplx 1610.54, bpc 10.6533
I0704 21:28:21.642250 140514656708352 train_gpu.py:303] [17000] | gnorm 0.35 lr 0.000229 | loss 7.47 | pplx 1751.44, bpc 10.7743
I0704 21:33:36.417810 140514656708352 train_gpu.py:303] [17500] | gnorm 0.26 lr 0.000228 | loss 7.66 | pplx 2127.10, bpc 11.0547
I0704 21:38:51.233461 140514656708352 train_gpu.py:303] [18000] | gnorm 0.48 lr 0.000228 | loss 7.64 | pplx 2081.33, bpc 11.0233
I0704 21:44:06.010222 140514656708352 train_gpu.py:303] [18500] | gnorm 0.26 lr 0.000227 | loss 7.62 | pplx 2035.16, bpc 10.9909
I0704 21:49:20.783527 140514656708352 train_gpu.py:303] [19000] | gnorm 0.29 lr 0.000226 | loss 7.63 | pplx 2067.63, bpc 11.0138
I0704 21:54:35.625918 140514656708352 train_gpu.py:303] [19500] | gnorm 0.29 lr 0.000226 | loss 7.64 | pplx 2074.96, bpc 11.0189
I0704 21:59:50.491468 140514656708352 train_gpu.py:303] [20000] | gnorm 0.24 lr 0.000225 | loss 7.60 | pplx 2005.06, bpc 10.9694
I0704 21:59:58.156530 140514656708352 train_gpu.py:309] Model saved in path: output-model/model.ckpt
I0704 22:05:13.066222 140514656708352 train_gpu.py:303] [20500] | gnorm 0.37 lr 0.000224 | loss 7.61 | pplx 2023.57, bpc 10.9827
I0704 22:10:27.896057 140514656708352 train_gpu.py:303] [21000] | gnorm 0.23 lr 0.000224 | loss 7.59 | pplx 1972.14, bpc 10.9455
I0704 22:15:42.730550 140514656708352 train_gpu.py:303] [21500] | gnorm 0.25 lr 0.000223 | loss 7.64 | pplx 2081.29, bpc 11.0233
I0704 22:20:57.537832 140514656708352 train_gpu.py:303] [22000] | gnorm 0.28 lr 0.000223 | loss 7.65 | pplx 2098.58, bpc 11.0352
I0704 22:26:12.292067 140514656708352 train_gpu.py:303] [22500] | gnorm 0.28 lr 0.000222 | loss 7.60 | pplx 1996.29, bpc 10.9631
I0704 22:31:27.120922 140514656708352 train_gpu.py:303] [23000] | gnorm 0.31 lr 0.000221 | loss 7.60 | pplx 1990.51, bpc 10.9589
I0704 22:36:41.893491 140514656708352 train_gpu.py:303] [23500] | gnorm 0.36 lr 0.000221 | loss 7.63 | pplx 2064.10, bpc 11.0113
I0704 22:41:56.750755 140514656708352 train_gpu.py:303] [24000] | gnorm 0.24 lr 0.000220 | loss 7.61 | pplx 2026.16, bpc 10.9845
I0704 22:47:11.568091 140514656708352 train_gpu.py:303] [24500] | gnorm 0.33 lr 0.000219 | loss 7.61 | pplx 2012.64, bpc 10.9749
I0704 22:52:26.442100 140514656708352 train_gpu.py:303] [25000] | gnorm 0.44 lr 0.000219 | loss 7.63 | pplx 2068.76, bpc 11.0146
I0704 22:57:41.386162 140514656708352 train_gpu.py:303] [25500] | gnorm 0.24 lr 0.000218 | loss 7.61 | pplx 2014.27, bpc 10.9760
I0704 23:02:56.267807 140514656708352 train_gpu.py:303] [26000] | gnorm 0.22 lr 0.000218 | loss 7.72 | pplx 2255.77, bpc 11.1394
I0704 23:08:11.103214 140514656708352 train_gpu.py:303] [26500] | gnorm 0.27 lr 0.000217 | loss 7.93 | pplx 2782.29, bpc 11.4421
I0704 23:13:25.904230 140514656708352 train_gpu.py:303] [27000] | gnorm 0.23 lr 0.000216 | loss 7.96 | pplx 2875.61, bpc 11.4897
I0704 23:18:40.707365 140514656708352 train_gpu.py:303] [27500] | gnorm 0.35 lr 0.000216 | loss 7.92 | pplx 2748.53, bpc 11.4244
I0704 23:23:55.549267 140514656708352 train_gpu.py:303] [28000] | gnorm 0.21 lr 0.000215 | loss 7.89 | pplx 2671.65, bpc 11.3835
I0704 23:29:10.405974 140514656708352 train_gpu.py:303] [28500] | gnorm 0.39 lr 0.000215 | loss 7.90 | pplx 2684.09, bpc 11.3902
I0704 23:34:25.217416 140514656708352 train_gpu.py:303] [29000] | gnorm 0.39 lr 0.000214 | loss 8.00 | pplx 2978.12, bpc 11.5402
I0704 23:39:40.074820 140514656708352 train_gpu.py:303] [29500] | gnorm 0.31 lr 0.000213 | loss 7.96 | pplx 2865.29, bpc 11.4845
I0704 23:44:54.826035 140514656708352 train_gpu.py:303] [30000] | gnorm 0.26 lr 0.000213 | loss 7.98 | pplx 2925.37, bpc 11.5144
I0704 23:45:02.488555 140514656708352 train_gpu.py:309] Model saved in path: output-model/model.ckpt
I0704 23:50:17.176898 140514656708352 train_gpu.py:303] [30500] | gnorm 0.29 lr 0.000212 | loss 7.96 | pplx 2876.71, bpc 11.4902
I0704 23:55:31.884260 140514656708352 train_gpu.py:303] [31000] | gnorm 0.27 lr 0.000211 | loss 7.96 | pplx 2860.68, bpc 11.4821
I0705 00:00:46.648993 140514656708352 train_gpu.py:303] [31500] | gnorm 0.23 lr 0.000211 | loss 7.97 | pplx 2885.09, bpc 11.4944
I0705 00:06:01.459876 140514656708352 train_gpu.py:303] [32000] | gnorm 0.23 lr 0.000210 | loss 7.97 | pplx 2878.87, bpc 11.4913
I0705 00:11:16.248872 140514656708352 train_gpu.py:303] [32500] | gnorm 0.30 lr 0.000210 | loss 7.95 | pplx 2824.14, bpc 11.4636
I0705 00:16:31.057889 140514656708352 train_gpu.py:303] [33000] | gnorm 0.67 lr 0.000209 | loss 7.95 | pplx 2843.30, bpc 11.4734
I0705 00:21:45.813891 140514656708352 train_gpu.py:303] [33500] | gnorm 0.30 lr 0.000208 | loss 7.93 | pplx 2791.75, bpc 11.4470
I0705 00:27:00.544517 140514656708352 train_gpu.py:303] [34000] | gnorm 0.26 lr 0.000208 | loss 7.91 | pplx 2724.91, bpc 11.4120
I0705 00:32:15.378443 140514656708352 train_gpu.py:303] [34500] | gnorm 0.30 lr 0.000207 | loss 7.90 | pplx 2710.08, bpc 11.4041
I0705 00:37:30.203745 140514656708352 train_gpu.py:303] [35000] | gnorm 0.27 lr 0.000206 | loss 7.91 | pplx 2728.60, bpc 11.4139
I0705 00:42:45.133219 140514656708352 train_gpu.py:303] [35500] | gnorm 0.33 lr 0.000206 | loss 7.94 | pplx 2819.00, bpc 11.4610
I0705 00:47:59.879519 140514656708352 train_gpu.py:303] [36000] | gnorm 0.34 lr 0.000205 | loss 7.87 | pplx 2624.20, bpc 11.3577
I0705 00:53:14.583731 140514656708352 train_gpu.py:303] [36500] | gnorm 0.40 lr 0.000205 | loss 7.90 | pplx 2705.18, bpc 11.4015
I0705 00:58:29.223032 140514656708352 train_gpu.py:303] [37000] | gnorm 0.27 lr 0.000204 | loss 7.88 | pplx 2646.44, bpc 11.3698
I0705 01:03:43.887945 140514656708352 train_gpu.py:303] [37500] | gnorm 0.45 lr 0.000203 | loss 7.89 | pplx 2657.59, bpc 11.3759
I0705 01:08:58.620047 140514656708352 train_gpu.py:303] [38000] | gnorm 0.60 lr 0.000203 | loss 7.90 | pplx 2688.38, bpc 11.3925
I0705 01:14:13.428936 140514656708352 train_gpu.py:303] [38500] | gnorm 0.21 lr 0.000202 | loss 7.86 | pplx 2599.27, bpc 11.3439
I0705 01:19:28.241355 140514656708352 train_gpu.py:303] [39000] | gnorm 0.22 lr 0.000201 | loss 7.91 | pplx 2726.29, bpc 11.4127
I0705 01:24:43.089514 140514656708352 train_gpu.py:303] [39500] | gnorm 0.24 lr 0.000201 | loss 7.90 | pplx 2688.34, bpc 11.3925
I0705 01:29:57.828010 140514656708352 train_gpu.py:303] [40000] | gnorm 0.24 lr 0.000200 | loss 7.97 | pplx 2895.85, bpc 11.4998
I0705 01:30:05.489835 140514656708352 train_gpu.py:309] Model saved in path: output-model/model.ckpt
I0705 01:35:20.313300 140514656708352 train_gpu.py:303] [40500] | gnorm 0.48 lr 0.000200 | loss 7.97 | pplx 2893.86, bpc 11.4988
I0705 01:40:35.002449 140514656708352 train_gpu.py:303] [41000] | gnorm 0.27 lr 0.000199 | loss 7.89 | pplx 2680.88, bpc 11.3885
I0705 01:45:49.796684 140514656708352 train_gpu.py:303] [41500] | gnorm 0.31 lr 0.000198 | loss 7.89 | pplx 2676.68, bpc 11.3862
I0705 01:51:04.649299 140514656708352 train_gpu.py:303] [42000] | gnorm 0.30 lr 0.000198 | loss 7.91 | pplx 2726.73, bpc 11.4130
I0705 01:56:19.464629 140514656708352 train_gpu.py:303] [42500] | gnorm 0.60 lr 0.000197 | loss 7.83 | pplx 2525.32, bpc 11.3023
I0705 02:01:34.198068 140514656708352 train_gpu.py:303] [43000] | gnorm 0.38 lr 0.000196 | loss 7.93 | pplx 2770.81, bpc 11.4361
I0705 02:06:48.962699 140514656708352 train_gpu.py:303] [43500] | gnorm 0.60 lr 0.000196 | loss 7.93 | pplx 2783.54, bpc 11.4427

First 12k steps are fine, after that, its increasing. totally normal or not?

The text was updated successfully, but these errors were encountered:

bzantium · 2019-07-06T08:21:31Z

Can I see your command line for running train_gpu.py?

SuMarsss · 2019-07-06T08:58:29Z

Can I see your command line for running train_gpu.py?

I have the same issue and below is my params.

python -u train_gpu.py \ 
  --record_info_dir=data/tfrecords \
  --model_dir=data \
  --train_batch_size=4 \
  --seq_len=512 \
  --reuse_len=256 \
  --mem_len=74 \
  --perm_size=256 \
  --n_layer=24 \
  --d_model=1024 \
  --d_embed=1024 \
  --n_head=8 \
  --d_head=16 \
  --d_inner=2048 \
  --untie_r=True \
  --mask_alpha=6 \
  --mask_beta=1 \
  --num_predict=85 \
  --uncased=True \
  --num_core_per_host=2 \
 $@

The loss is about 5.76 ，which is increasing and too big

I0706 14:36:55.499520 139664167343936 train_gpu.py:300] [98000] | gnorm 1.15 lr 0.000002 | loss 5.76 | pplx 318.14, bpc 8.3135
I0706 14:37:05.705401 139664167343936 train_gpu.py:306] Model saved in path: data/model.ckpt
I0706 14:38:14.884466 139664167343936 train_gpu.py:300] [98100] | gnorm 1.20 lr 0.000002 | loss 5.78 | pplx 323.11, bpc 8.3359
I0706 14:39:24.123670 139664167343936 train_gpu.py:300] [98200] | gnorm 1.28 lr 0.000002 | loss 5.74 | pplx 309.57, bpc 8.2741
I0706 14:40:33.218821 139664167343936 train_gpu.py:300] [98300] | gnorm 1.18 lr 0.000002 | loss 5.78 | pplx 323.92, bpc 8.3395
I0706 14:41:42.791546 139664167343936 train_gpu.py:300] [98400] | gnorm 1.09 lr 0.000002 | loss 5.83 | pplx 339.70, bpc 8.4081
I0706 14:42:51.939918 139664167343936 train_gpu.py:300] [98500] | gnorm 1.16 lr 0.000002 | loss 5.79 | pplx 327.70, bpc 8.3562
I0706 14:44:01.089597 139664167343936 train_gpu.py:300] [98600] | gnorm 1.08 lr 0.000001 | loss 5.82 | pplx 337.25, bpc 8.3977
I0706 14:45:10.066946 139664167343936 train_gpu.py:300] [98700] | gnorm 1.42 lr 0.000001 | loss 5.75 | pplx 313.39, bpc 8.2918
I0706 14:46:19.434098 139664167343936 train_gpu.py:300] [98800] | gnorm 1.15 lr 0.000001 | loss 5.79 | pplx 328.49, bpc 8.3597
I0706 14:47:28.619400 139664167343936 train_gpu.py:300] [98900] | gnorm 1.18 lr 0.000001 | loss 5.79 | pplx 326.12, bpc 8.3493
I0706 14:48:37.849606 139664167343936 train_gpu.py:300] [99000] | gnorm 1.13 lr 0.000001 | loss 5.81 | pplx 333.03, bpc 8.3795
I0706 14:48:48.501792 139664167343936 train_gpu.py:306] Model saved in path: data/model.ckpt
I0706 14:49:57.396394 139664167343936 train_gpu.py:300] [99100] | gnorm 1.14 lr 0.000001 | loss 5.83 | pplx 339.86, bpc 8.4088
I0706 14:51:06.429167 139664167343936 train_gpu.py:300] [99200] | gnorm 1.12 lr 0.000001 | loss 5.84 | pplx 344.38, bpc 8.4279
I0706 14:52:15.260298 139664167343936 train_gpu.py:300] [99300] | gnorm 1.11 lr 0.000001 | loss 5.82 | pplx 337.40, bpc 8.3983
I0706 14:53:24.383842 139664167343936 train_gpu.py:300] [99400] | gnorm 1.22 lr 0.000001 | loss 5.82 | pplx 337.58, bpc 8.3991
I0706 14:54:33.653107 139664167343936 train_gpu.py:300] [99500] | gnorm 1.08 lr 0.000001 | loss 5.80 | pplx 331.66, bpc 8.3735
I0706 14:55:43.001544 139664167343936 train_gpu.py:300] [99600] | gnorm 1.16 lr 0.000000 | loss 5.81 | pplx 334.59, bpc 8.3862
I0706 14:56:52.218285 139664167343936 train_gpu.py:300] [99700] | gnorm 1.26 lr 0.000000 | loss 5.79 | pplx 326.63, bpc 8.3515
I0706 14:58:01.616966 139664167343936 train_gpu.py:300] [99800] | gnorm 1.49 lr 0.000000 | loss 5.82 | pplx 335.38, bpc 8.3897
I0706 14:59:10.785248 139664167343936 train_gpu.py:300] [99900] | gnorm 1.41 lr 0.000000 | loss 5.79 | pplx 328.28, bpc 8.3588
I0706 15:00:20.072597 139664167343936 train_gpu.py:300] [100000] | gnorm 1.04 lr 0.000000 | loss 5.78 | pplx 324.33, bpc 8.3413

huseinzol05 · 2019-07-06T09:12:03Z

after 200k steps, still not decreasing,

I0706 02:15:43.688152 140514656708352 train_gpu.py:303] [181000] | gnorm 0.35 lr 0.000025 | loss 7.44 | pplx 1709.01, bpc 10.7389
I0706 02:20:58.767530 140514656708352 train_gpu.py:303] [181500] | gnorm 1.21 lr 0.000024 | loss 7.42 | pplx 1675.25, bpc 10.7102
I0706 02:26:13.828760 140514656708352 train_gpu.py:303] [182000] | gnorm 0.88 lr 0.000023 | loss 7.40 | pplx 1636.98, bpc 10.6768
I0706 02:31:28.962370 140514656708352 train_gpu.py:303] [182500] | gnorm 0.52 lr 0.000023 | loss 7.40 | pplx 1637.76, bpc 10.6775
I0706 02:36:43.978382 140514656708352 train_gpu.py:303] [183000] | gnorm 0.38 lr 0.000022 | loss 7.35 | pplx 1552.99, bpc 10.6008
I0706 02:41:59.120646 140514656708352 train_gpu.py:303] [183500] | gnorm 0.34 lr 0.000022 | loss 7.44 | pplx 1707.32, bpc 10.7375
I0706 02:47:14.255108 140514656708352 train_gpu.py:303] [184000] | gnorm 0.27 lr 0.000021 | loss 7.37 | pplx 1585.88, bpc 10.6311
I0706 02:52:29.404605 140514656708352 train_gpu.py:303] [184500] | gnorm 0.57 lr 0.000020 | loss 7.52 | pplx 1837.07, bpc 10.8432
I0706 02:57:44.552990 140514656708352 train_gpu.py:303] [185000] | gnorm 0.47 lr 0.000020 | loss 7.44 | pplx 1699.02, bpc 10.7305
I0706 03:02:59.544211 140514656708352 train_gpu.py:303] [185500] | gnorm 0.37 lr 0.000019 | loss 7.39 | pplx 1623.65, bpc 10.6650
I0706 03:08:14.601415 140514656708352 train_gpu.py:303] [186000] | gnorm 0.51 lr 0.000018 | loss 7.41 | pplx 1644.99, bpc 10.6839
I0706 03:13:29.635721 140514656708352 train_gpu.py:303] [186500] | gnorm 0.24 lr 0.000018 | loss 7.43 | pplx 1677.59, bpc 10.7122
I0706 03:18:44.677387 140514656708352 train_gpu.py:303] [187000] | gnorm 0.55 lr 0.000017 | loss 7.42 | pplx 1666.01, bpc 10.7022
I0706 03:23:59.776241 140514656708352 train_gpu.py:303] [187500] | gnorm 0.49 lr 0.000017 | loss 7.37 | pplx 1592.12, bpc 10.6367
I0706 03:29:14.835679 140514656708352 train_gpu.py:303] [188000] | gnorm 0.78 lr 0.000016 | loss 7.40 | pplx 1641.50, bpc 10.6808
I0706 03:34:29.897017 140514656708352 train_gpu.py:303] [188500] | gnorm 0.99 lr 0.000015 | loss 7.46 | pplx 1738.75, bpc 10.7638
I0706 03:39:45.000864 140514656708352 train_gpu.py:303] [189000] | gnorm 1.16 lr 0.000015 | loss 7.40 | pplx 1627.83, bpc 10.6687
I0706 03:45:00.103481 140514656708352 train_gpu.py:303] [189500] | gnorm 0.61 lr 0.000014 | loss 7.41 | pplx 1647.68, bpc 10.6862
I0706 03:50:15.244863 140514656708352 train_gpu.py:303] [190000] | gnorm 0.92 lr 0.000013 | loss 7.43 | pplx 1687.13, bpc 10.7204
I0706 03:50:22.979946 140514656708352 train_gpu.py:309] Model saved in path: output-model/model.ckpt
I0706 03:55:38.055033 140514656708352 train_gpu.py:303] [190500] | gnorm 0.42 lr 0.000013 | loss 7.40 | pplx 1640.72, bpc 10.6801
I0706 04:00:53.074167 140514656708352 train_gpu.py:303] [191000] | gnorm 0.38 lr 0.000012 | loss 7.41 | pplx 1647.22, bpc 10.6858
I0706 04:06:08.058250 140514656708352 train_gpu.py:303] [191500] | gnorm 0.52 lr 0.000012 | loss 7.44 | pplx 1700.88, bpc 10.7321
I0706 04:11:23.095156 140514656708352 train_gpu.py:303] [192000] | gnorm 0.29 lr 0.000011 | loss 7.45 | pplx 1718.52, bpc 10.7470
I0706 04:16:38.141488 140514656708352 train_gpu.py:303] [192500] | gnorm 2.47 lr 0.000010 | loss 7.37 | pplx 1582.47, bpc 10.6280
I0706 04:21:53.163613 140514656708352 train_gpu.py:303] [193000] | gnorm 0.30 lr 0.000010 | loss 7.44 | pplx 1696.28, bpc 10.7282
I0706 04:27:08.224829 140514656708352 train_gpu.py:303] [193500] | gnorm 0.97 lr 0.000009 | loss 7.36 | pplx 1576.44, bpc 10.6225
I0706 04:32:23.283202 140514656708352 train_gpu.py:303] [194000] | gnorm 1.07 lr 0.000008 | loss 7.40 | pplx 1634.35, bpc 10.6745
I0706 04:37:38.382075 140514656708352 train_gpu.py:303] [194500] | gnorm 0.75 lr 0.000008 | loss 7.38 | pplx 1600.65, bpc 10.6444
I0706 04:42:53.470290 140514656708352 train_gpu.py:303] [195000] | gnorm 0.37 lr 0.000007 | loss 7.41 | pplx 1648.10, bpc 10.6866
I0706 04:48:08.537125 140514656708352 train_gpu.py:303] [195500] | gnorm 0.49 lr 0.000007 | loss 7.42 | pplx 1673.96, bpc 10.7090
I0706 04:53:23.645965 140514656708352 train_gpu.py:303] [196000] | gnorm 0.52 lr 0.000006 | loss 7.40 | pplx 1640.49, bpc 10.6799
I0706 04:58:38.749410 140514656708352 train_gpu.py:303] [196500] | gnorm 0.52 lr 0.000005 | loss 7.41 | pplx 1646.09, bpc 10.6848
I0706 05:03:53.834642 140514656708352 train_gpu.py:303] [197000] | gnorm 0.75 lr 0.000005 | loss 7.43 | pplx 1684.26, bpc 10.7179
I0706 05:09:08.898273 140514656708352 train_gpu.py:303] [197500] | gnorm 0.83 lr 0.000004 | loss 7.38 | pplx 1600.85, bpc 10.6446
I0706 05:14:23.854666 140514656708352 train_gpu.py:303] [198000] | gnorm 0.59 lr 0.000003 | loss 7.38 | pplx 1609.59, bpc 10.6525
I0706 05:19:38.868615 140514656708352 train_gpu.py:303] [198500] | gnorm 1.34 lr 0.000003 | loss 7.30 | pplx 1484.79, bpc 10.5360
I0706 05:24:53.900978 140514656708352 train_gpu.py:303] [199000] | gnorm 0.85 lr 0.000002 | loss 7.36 | pplx 1578.47, bpc 10.6243
I0706 05:30:08.955977 140514656708352 train_gpu.py:303] [199500] | gnorm 0.69 lr 0.000002 | loss 7.18 | pplx 1316.83, bpc 10.3629
I0706 05:35:24.005298 140514656708352 train_gpu.py:303] [200000] | gnorm 0.48 lr 0.000001 | loss 7.25 | pplx 1404.16, bpc 10.4555

this is my command running train_gpu.py,

python3 train_gpu.py \
  --corpus_info_path=save-location/corpus_info.json \
  --record_info_dir=save-location/tfrecords \
  --train_batch_size=4 \
  --seq_len=512 \
  --reuse_len=256 \
  --mem_len=384 \
  --perm_size=256 \
  --n_layer=12 \
  --d_model=512 \
  --d_embed=512 \
  --n_head=16 \
  --d_head=64 \
  --d_inner=2048 \
  --untie_r=True \
  --mask_alpha=6 \
  --mask_beta=1 \
  --num_predict=85 \
  --model_dir=output-model \
  --uncased=True \
  --num_core_per_host=1 \
  --train_steps=200000

Sounds like train_batch_size=4 is really important?

huseinzol05 · 2019-07-06T09:20:47Z

After 200k, just try my luck to finetune on 200 sentences, purposely make it small and to show that the model able to overfit (means that it able to learn). After run for different 20 random seeds, the loss is maintain,

learning_rate = 5e-5
batch_size = 10
MAX_SEQ_LENGTH = 128

I just duplicate same code from my notebook, finetune xlnet large, https://github.com/huseinzol05/NLP-Models-Tensorflow/blob/master/text-classification/72.xlnet-large.ipynb (able to learn properly).

train minibatch loop: 100%|██████████| 20/20 [00:02<00:00, 8.70it/s, accuracy=0.6, cost=0.678]
train minibatch loop: 100%|██████████| 20/20 [00:02<00:00, 8.65it/s, accuracy=0.6, cost=0.678]
train minibatch loop: 100%|██████████| 20/20 [00:02<00:00, 8.68it/s, accuracy=0.6, cost=0.678]
train minibatch loop: 100%|██████████| 20/20 [00:02<00:00, 8.67it/s, accuracy=0.6, cost=0.678]
train minibatch loop: 100%|██████████| 20/20 [00:02<00:00, 8.64it/s, accuracy=0.6, cost=0.678]
train minibatch loop: 100%|██████████| 20/20 [00:02<00:00, 8.66it/s, accuracy=0.6, cost=0.678]
train minibatch loop: 100%|██████████| 20/20 [00:02<00:00, 8.69it/s, accuracy=0.6, cost=0.678]
train minibatch loop: 100%|██████████| 20/20 [00:02<00:00, 8.69it/s, accuracy=0.6, cost=0.678]
train minibatch loop: 100%|██████████| 20/20 [00:02<00:00, 8.70it/s, accuracy=0.6, cost=0.678]
train minibatch loop: 100%|██████████| 20/20 [00:02<00:00, 8.69it/s, accuracy=0.6, cost=0.678]
train minibatch loop: 100%|██████████| 20/20 [00:02<00:00, 8.67it/s, accuracy=0.6, cost=0.678]
train minibatch loop: 100%|██████████| 20/20 [00:02<00:00, 8.65it/s, accuracy=0.6, cost=0.678]
train minibatch loop: 100%|██████████| 20/20 [00:02<00:00, 8.66it/s, accuracy=0.6, cost=0.678]

bzantium · 2019-07-06T09:32:23Z

I'm not sure since I haven't tried it with Malaysia language, how about trying uncased=False? For my case, I'm trying Korea language which should be cased.

SuMarsss · 2019-07-06T09:53:20Z

I'm not sure since I haven't tried it with Malaysia language, how about trying uncased=False? For my case, I'm trying Korea language which should be cased.

Hi , could you show your parameters for train_gpu.py ?

bzantium · 2019-07-06T10:03:02Z

Hi , could you show your parameters for train_gpu.py ?

Sure!

python train_gpu.py \
--num_core_per_host=2 \
--corpus_info_path=proc_data/processed_wiki/corpus_info.json \
--record_info_dir=proc_data/processed_wiki/tfrecords \
--train_batch_size=64 \
--seq_len=512 \
--reuse_len=256 \
--perm_size=256 \
--mem_len=384 \
--n_layer=6 \
--d_model=768 \
--d_embed=768 \
--n_head=6 \
--d_head=64 \
--d_inner=3072 \
--untie_r=True \
--mask_alpha=6 \
--mask_beta=1 \
--num_predict=85 \
--model_dir=model_output/model \
--use_tpu=False

SuMarsss · 2019-07-06T10:12:10Z

Hi , could you show your parameters for train_gpu.py ?

Sure!

python train_gpu.py \
--num_core_per_host=2 \
--corpus_info_path=proc_data/processed_wiki/corpus_info.json \
--record_info_dir=proc_data/processed_wiki/tfrecords \
--train_batch_size=64 \
--seq_len=512 \
--reuse_len=256 \
--perm_size=256 \
--mem_len=384 \
--n_layer=6 \
--d_model=768 \
--d_embed=768 \
--n_head=6 \
--d_head=64 \
--d_inner=3072 \
--untie_r=True \
--mask_alpha=6 \
--mask_beta=1 \
--num_predict=85 \
--model_dir=model_output/model \
--use_tpu=False

Wow, your models seems so big 👍 , What is your GPU memory?

bzantium · 2019-07-06T10:18:54Z

Wow, your models seems so big 👍 , What is your GPU memory?

I have 2 GPUs (Tesla V100) whose ram is 32GB each :)

huseinzol05 · 2019-07-06T12:24:28Z

I tried to increase batch size to become 32, but need to reduce sequence length, and tested on very small dataset (100 sentences), surprisingly the loss can reduce to 0.4X. Seems like batch size is very important here. I pretrained BERT model before, and because hardware limitation, my batch size is very small and accuracy from my pretrained still very good, plus able to finetune very well and beat multilanguage BERT.

But for XLNet, I cannot achieved the same / similar thing with very small batch size.

huseinzol05 · 2019-07-06T12:43:52Z

After change learning rate to --learning_rate=2.5e-5 on small dataset, I achieve loss lower than 0.99. I believe that is a good sign and going to try on larger dataset.

huseinzol05 · 2019-07-06T14:44:59Z

I0706 14:43:50.303417 139845453784832 train_gpu.py:303] [10356] | gnorm 2.38 lr 0.000024 | loss 4.64 | pplx  103.34, bpc  6.6913
I0706 14:43:51.560345 139845453784832 train_gpu.py:303] [10358] | gnorm 2.44 lr 0.000024 | loss 4.85 | pplx  127.21, bpc  6.9911
I0706 14:43:52.817217 139845453784832 train_gpu.py:303] [10360] | gnorm 2.29 lr 0.000024 | loss 4.24 | pplx   69.74, bpc  6.1238
I0706 14:43:54.073326 139845453784832 train_gpu.py:303] [10362] | gnorm 2.34 lr 0.000024 | loss 4.21 | pplx   67.51, bpc  6.0769
I0706 14:43:55.329735 139845453784832 train_gpu.py:303] [10364] | gnorm 2.47 lr 0.000024 | loss 4.91 | pplx  135.14, bpc  7.0784
I0706 14:43:56.585866 139845453784832 train_gpu.py:303] [10366] | gnorm 2.45 lr 0.000024 | loss 4.84 | pplx  125.86, bpc  6.9757
I0706 14:43:57.840969 139845453784832 train_gpu.py:303] [10368] | gnorm 2.44 lr 0.000024 | loss 4.28 | pplx   72.35, bpc  6.1769
I0706 14:43:59.096822 139845453784832 train_gpu.py:303] [10370] | gnorm 2.51 lr 0.000024 | loss 4.53 | pplx   93.07, bpc  6.5402

After 10k steps, my loss around 4.X, looks good to me.

3NFBAGDU · 2019-07-08T12:27:20Z

[Resource exhausted: OOM when allocating tensor with shape[512,4,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc.],
Does it mean, that my GPU memory is too small ?
I have 8 GPU, each have 11GB RAM.

SuMarsss · 2019-07-08T12:29:29Z

[Resource exhausted: OOM when allocating tensor with shape[512,4,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc.],
Does it mean, that my GPU memory is too small ?
I have 8 GPU, each have 11GB RAM.

Yeah，you have better decrease params such as batch size and seq_len

huseinzol05 · 2019-07-08T14:18:47Z

Done 200k steps,
Trying to finetune to do classification,

epoch: 0, pass acc: 0.000000, current acc: 0.891119
time taken: 104.25944423675537
epoch: 0, training loss: 0.322514, training acc: 0.865841, valid loss: 0.260226, valid acc: 0.891119
epoch: 1, pass acc: 0.891119, current acc: 0.891621
time taken: 98.91772723197937
epoch: 1, training loss: 0.157596, training acc: 0.939767, valid loss: 0.317146, valid acc: 0.891621

Looks perfect, closing this issue.

3NFBAGDU · 2019-07-08T14:20:22Z

@bzantium
What time does your training take?

bzantium · 2019-07-09T12:12:30Z

@bzantium
What time does your training take?

for 200k, it took 5 days with two 32GB Tesla V100 GPUs.

3NFBAGDU · 2019-07-09T12:56:02Z

@bzantium
Thank you for your response,
For what reason are you using XLNET?
I would like to know if your pretrained XLNET model was better than BERT model.
I have used sent2vec, word2vec for my task, and i want to pretrain xlnet to improve my previous models.

I have 1,600,000 sentences.
i ran this sudo python3 train_gpu.py --record_info_dir=fix3/tfrecords --train_batch_size=32 --seq_len=512 --reuse_len=256 --mem_len=384 --perm_size=256 --n_layer=6 --d_model=768 --d_embed=768 --n_head=6 --d_head=64 --d_inner=3072 --untie_r=True --model_dir=my_model --uncased=False --num_predict=85 .
Are this parameters enough to make a good model?
Thank you

huseinzol05 closed this as completed Jul 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretraining loss is increasing #124

Pretraining loss is increasing #124

huseinzol05 commented Jul 5, 2019

bzantium commented Jul 6, 2019

SuMarsss commented Jul 6, 2019

huseinzol05 commented Jul 6, 2019

huseinzol05 commented Jul 6, 2019

bzantium commented Jul 6, 2019 •

edited

SuMarsss commented Jul 6, 2019

bzantium commented Jul 6, 2019 •

edited

SuMarsss commented Jul 6, 2019

bzantium commented Jul 6, 2019

huseinzol05 commented Jul 6, 2019

huseinzol05 commented Jul 6, 2019

huseinzol05 commented Jul 6, 2019

3NFBAGDU commented Jul 8, 2019

SuMarsss commented Jul 8, 2019

huseinzol05 commented Jul 8, 2019

3NFBAGDU commented Jul 8, 2019 •

edited

bzantium commented Jul 9, 2019

3NFBAGDU commented Jul 9, 2019

Pretraining loss is increasing #124

Pretraining loss is increasing #124

Comments

huseinzol05 commented Jul 5, 2019

bzantium commented Jul 6, 2019

SuMarsss commented Jul 6, 2019

huseinzol05 commented Jul 6, 2019

huseinzol05 commented Jul 6, 2019

bzantium commented Jul 6, 2019 • edited

SuMarsss commented Jul 6, 2019

bzantium commented Jul 6, 2019 • edited

SuMarsss commented Jul 6, 2019

bzantium commented Jul 6, 2019

huseinzol05 commented Jul 6, 2019

huseinzol05 commented Jul 6, 2019

huseinzol05 commented Jul 6, 2019

3NFBAGDU commented Jul 8, 2019

SuMarsss commented Jul 8, 2019

huseinzol05 commented Jul 8, 2019

3NFBAGDU commented Jul 8, 2019 • edited

bzantium commented Jul 9, 2019

3NFBAGDU commented Jul 9, 2019

bzantium commented Jul 6, 2019 •

edited

bzantium commented Jul 6, 2019 •

edited

3NFBAGDU commented Jul 8, 2019 •

edited