Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

please mention which GPU: "WATCHDOG: T-Rex has a problem with GPU, terminating..." #20

Closed
aleqx opened this issue May 18, 2019 · 13 comments

Comments

@aleqx
Copy link

aleqx commented May 18, 2019

You give the error:

20190518 14:29:50 WARN: WATCHDOG: T-Rex has a problem with GPU, terminating...

But it doesn't say which GPU#, so I dont' know which one is the culpri tin order to reduce overclock for that particular GPU only.

Could you please include the GPU# in the error?

@trexminer
Copy link
Owner

There should be a message saying which GPU is idle prior to that. Please upgrade to 0.11.0 and send me the full log if the issue occurs again

@aleqx
Copy link
Author

aleqx commented May 20, 2019

That was the first error/warning message in the session. Regardless, I don't see a reason why you shouldn't display the GPU# in there (or don't you know it?).

@trexminer
Copy link
Owner

If there was no message with GPU# as you've just said, then yes, the miner doesn't know which GPU caused the problem. The behaviour you're describing is not expected and appears to be a bug that needs investigation. If you start the miner with --log-path trex.log -P parameters it'll create a detailed log file which will help troubleshoot the issue

@aleqx
Copy link
Author

aleqx commented Jun 9, 2019

example from mining BCD. There were no Xid errors reported by the driver in kern.log either. Seems odd that the miner can't identify which card, it's definitely not all the 9 cards as t-rex seems to suggest -- it's the only miner with this problem from all the miners that I have been using in the past 2+ years.

[...]
20190609 13:33:52 GPU #0: Gigabyte GTX 1080 Ti - 38.31 MH/s
20190609 13:33:52 GPU #1: ASUS GTX 1070        - 22.72 MH/s
20190609 13:33:52 GPU #2: EVGA GTX 1070        - 21.59 MH/s
20190609 13:33:52 GPU #3: ASUS GTX 1070        - 22.95 MH/s
20190609 13:33:52 GPU #4: EVGA GTX 1080 Ti     - 38.42 MH/s
20190609 13:33:52 GPU #5: MSI GTX 1080 Ti      - 35.71 MH/s
20190609 13:33:52 GPU #6: Gigabyte GTX 1080 Ti - 38.40 MH/s
20190609 13:33:52 GPU #7: Gigabyte GTX 1080 Ti - 37.86 MH/s
20190609 13:33:52 GPU #8: EVGA GTX 1070        - 22.21 MH/s
20190609 13:33:52 Shares/min: 3.425 (Avr. 8.086)
20190609 13:33:52 Uptime: 2 hours 53 mins 8 secs | Algo: bcd | T-Rex v0.11.1
20190609 13:34:35 [ OK ] 1401/1401 - 278.40 MH/s, 201ms
20190609 13:34:36 [ OK ] 1402/1402 - 278.43 MH/s, 202ms
20190609 13:34:46 [ OK ] 1403/1403 - 278.49 MH/s, 202ms
20190609 13:35:35 Dev fee mined (44 secs)
20190609 13:35:56 WARN: GPU #0: Gigabyte GeForce GTX 1080 Ti is idle, last activity was 21 secs ago
20190609 13:35:56 WARN: GPU #1: ASUS GeForce GTX 1070 is idle, last activity was 21 secs ago
20190609 13:35:56 WARN: GPU #2: EVGA GeForce GTX 1070 is idle, last activity was 21 secs ago
20190609 13:35:56 WARN: GPU #3: ASUS GeForce GTX 1070 is idle, last activity was 21 secs ago
20190609 13:35:56 WARN: GPU #4: EVGA GeForce GTX 1080 Ti is idle, last activity was 21 secs ago
20190609 13:35:56 WARN: GPU #5: MSI GeForce GTX 1080 Ti is idle, last activity was 21 secs ago
20190609 13:35:56 WARN: GPU #6: Gigabyte GeForce GTX 1080 Ti is idle, last activity was 21 secs ago
20190609 13:35:56 WARN: GPU #7: Gigabyte GeForce GTX 1080 Ti is idle, last activity was 21 secs ago
20190609 13:35:56 WARN: GPU #8: EVGA GeForce GTX 1070 is idle, last activity was 21 secs ago
20190609 13:36:01 WARN: GPU #0: Gigabyte GeForce GTX 1080 Ti is idle, last activity was 26 secs ago
20190609 13:36:01 WARN: GPU #1: ASUS GeForce GTX 1070 is idle, last activity was 26 secs ago
20190609 13:36:01 WARN: GPU #2: EVGA GeForce GTX 1070 is idle, last activity was 26 secs ago
20190609 13:36:01 WARN: GPU #3: ASUS GeForce GTX 1070 is idle, last activity was 26 secs ago
20190609 13:36:01 WARN: GPU #4: EVGA GeForce GTX 1080 Ti is idle, last activity was 26 secs ago
20190609 13:36:01 WARN: GPU #5: MSI GeForce GTX 1080 Ti is idle, last activity was 26 secs ago
20190609 13:36:01 WARN: GPU #6: Gigabyte GeForce GTX 1080 Ti is idle, last activity was 26 secs ago
20190609 13:36:01 WARN: GPU #7: Gigabyte GeForce GTX 1080 Ti is idle, last activity was 26 secs ago
20190609 13:36:01 WARN: GPU #8: EVGA GeForce GTX 1070 is idle, last activity was 26 secs ago
20190609 13:36:06 WARN: GPU #0: Gigabyte GeForce GTX 1080 Ti is idle, last activity was 31 secs ago
20190609 13:36:06 WARN: GPU #1: ASUS GeForce GTX 1070 is idle, last activity was 31 secs ago
20190609 13:36:06 WARN: GPU #2: EVGA GeForce GTX 1070 is idle, last activity was 31 secs ago
20190609 13:36:06 WARN: GPU #3: ASUS GeForce GTX 1070 is idle, last activity was 31 secs ago
20190609 13:36:06 WARN: GPU #4: EVGA GeForce GTX 1080 Ti is idle, last activity was 31 secs ago
20190609 13:36:06 WARN: GPU #5: MSI GeForce GTX 1080 Ti is idle, last activity was 31 secs ago
20190609 13:36:06 WARN: GPU #6: Gigabyte GeForce GTX 1080 Ti is idle, last activity was 31 secs ago
20190609 13:36:06 WARN: GPU #7: Gigabyte GeForce GTX 1080 Ti is idle, last activity was 31 secs ago
20190609 13:36:06 WARN: GPU #8: EVGA GeForce GTX 1070 is idle, last activity was 31 secs ago
20190609 13:36:11 WARN: GPU #0: Gigabyte GeForce GTX 1080 Ti is idle, last activity was 36 secs ago
20190609 13:36:11 WARN: GPU #1: ASUS GeForce GTX 1070 is idle, last activity was 36 secs ago
20190609 13:36:11 WARN: GPU #2: EVGA GeForce GTX 1070 is idle, last activity was 36 secs ago
20190609 13:36:11 WARN: GPU #3: ASUS GeForce GTX 1070 is idle, last activity was 36 secs ago
20190609 13:36:11 WARN: GPU #4: EVGA GeForce GTX 1080 Ti is idle, last activity was 36 secs ago
20190609 13:36:11 WARN: GPU #5: MSI GeForce GTX 1080 Ti is idle, last activity was 36 secs ago
20190609 13:36:11 WARN: GPU #6: Gigabyte GeForce GTX 1080 Ti is idle, last activity was 36 secs ago
20190609 13:36:11 WARN: GPU #7: Gigabyte GeForce GTX 1080 Ti is idle, last activity was 36 secs ago
20190609 13:36:11 WARN: GPU #8: EVGA GeForce GTX 1070 is idle, last activity was 36 secs ago
20190609 13:36:16 WARN: GPU #0: Gigabyte GeForce GTX 1080 Ti is idle, last activity was 41 secs ago
20190609 13:36:16 WARN: GPU #1: ASUS GeForce GTX 1070 is idle, last activity was 41 secs ago
20190609 13:36:16 WARN: GPU #2: EVGA GeForce GTX 1070 is idle, last activity was 41 secs ago
20190609 13:36:16 WARN: GPU #3: ASUS GeForce GTX 1070 is idle, last activity was 41 secs ago
20190609 13:36:16 WARN: GPU #4: EVGA GeForce GTX 1080 Ti is idle, last activity was 41 secs ago
20190609 13:36:16 WARN: GPU #5: MSI GeForce GTX 1080 Ti is idle, last activity was 41 secs ago
20190609 13:36:16 WARN: GPU #6: Gigabyte GeForce GTX 1080 Ti is idle, last activity was 41 secs ago
20190609 13:36:16 WARN: GPU #7: Gigabyte GeForce GTX 1080 Ti is idle, last activity was 41 secs ago
20190609 13:36:16 WARN: GPU #8: EVGA GeForce GTX 1070 is idle, last activity was 41 secs ago
20190609 13:36:21 WARN: GPU #0: Gigabyte GeForce GTX 1080 Ti is idle, last activity was 46 secs ago
20190609 13:36:21 WARN: GPU #1: ASUS GeForce GTX 1070 is idle, last activity was 46 secs ago
20190609 13:36:21 WARN: GPU #2: EVGA GeForce GTX 1070 is idle, last activity was 46 secs ago
20190609 13:36:21 WARN: GPU #3: ASUS GeForce GTX 1070 is idle, last activity was 46 secs ago
20190609 13:36:21 WARN: GPU #4: EVGA GeForce GTX 1080 Ti is idle, last activity was 46 secs ago
20190609 13:36:21 WARN: GPU #5: MSI GeForce GTX 1080 Ti is idle, last activity was 46 secs ago
20190609 13:36:21 WARN: GPU #6: Gigabyte GeForce GTX 1080 Ti is idle, last activity was 46 secs ago
20190609 13:36:21 WARN: GPU #7: Gigabyte GeForce GTX 1080 Ti is idle, last activity was 46 secs ago
20190609 13:36:21 WARN: GPU #8: EVGA GeForce GTX 1070 is idle, last activity was 46 secs ago
20190609 13:36:22 WARN: WATCHDOG: T-Rex has a problem with GPU, terminating...

@trexminer
Copy link
Owner

We've been chasing an issue where the miner stops hashing after a dev fee session, and the log you provided indicates it might be the same issue. In this case however the watchdog correctly did its job and restarted the miner, but we would like to fix the root cause. Would you be willing to help us with the investigation? If so we'll prepare a build which will produce an extra debugging info, so if you could run it and then send us the log file, that would be much appreciated. How long does it usually take for the problem to show itself? Which CUDA version do you use?

@aleqx
Copy link
Author

aleqx commented Jun 10, 2019

CUDA 10.0

Sadly, my time is very limited to help with testing and I don't mine with t-rex all the time either, but I can give it a try (I'm just not promising anything).

@trexminer
Copy link
Owner

Please try 0.12.0 when you have time, there is a chance that the error is fixed, although we are not 100% sure.

@CBDMINER
Copy link

CBDMINER commented Dec 3, 2019

I have the same issue as outlined above, is there any new information on how to resolve the error?

@OverchenkoDev
Copy link

OverchenkoDev commented Dec 9, 2020

I'm using v0.19.1 and have the same issue. The problems occurs right after miner starting. This is my output:

20201209 15:12:49 T-Rex NVIDIA GPU miner v0.19.1 - [CUDA v10.0]
20201209 15:12:49 r.99a7206c3590
20201209 15:12:49
20201209 15:12:49 NVIDIA Driver v450.80.02
20201209 15:12:49 CUDA devices available: 3
20201209 15:12:49
20201209 15:12:49 WARN: DevFee 1% (ethash)
20201209 15:12:49
20201209 15:12:49 URL : my_pool
20201209 15:12:49 USER: my_user
20201209 15:12:49 PASS:
20201209 15:12:49
20201209 15:12:49 Starting on: my_pool
20201209 15:12:49 ApiServer: HTTP server started on 0.0.0.0:4067
20201209 15:12:49 ----------------------------------------------------
20201209 15:12:49 For control navigate to: http://172.17.0.1:4067/trex
20201209 15:12:49 ----------------------------------------------------
20201209 15:12:49 ApiServer: Telnet server started on 127.0.0.1:3333
20201209 15:12:49 WARN: GPU #2(000600): MSI GeForce GTX 1070 Ti, intensity set to 22
20201209 15:12:49 WARN: GPU #0(000100): MSI GeForce GTX 1070 Ti, intensity set to 22
20201209 15:12:49 WARN: GPU #1(000500): MSI GeForce GTX 1070 Ti, intensity set to 22
20201209 15:12:54 Using protocol: stratum1.
20201209 15:12:54 Authorizing...
20201209 15:12:54 Authorized successfully.
20201209 15:13:10 WARN: GPU #0: MSI GeForce GTX 1070 Ti is idle, last activity was 20 secs ago
20201209 15:13:10 WARN: GPU #1: MSI GeForce GTX 1070 Ti is idle, last activity was 20 secs ago
20201209 15:13:10 WARN: GPU #2: MSI GeForce GTX 1070 Ti is idle, last activity was 20 secs ago
20201209 15:13:15 WARN: GPU #0: MSI GeForce GTX 1070 Ti is idle, last activity was 25 secs ago
20201209 15:13:15 WARN: GPU #1: MSI GeForce GTX 1070 Ti is idle, last activity was 25 secs ago
20201209 15:13:15 WARN: GPU #2: MSI GeForce GTX 1070 Ti is idle, last activity was 25 secs ago
20201209 15:13:20 WARN: GPU #0: MSI GeForce GTX 1070 Ti is idle, last activity was 30 secs ago
20201209 15:13:20 WARN: GPU #1: MSI GeForce GTX 1070 Ti is idle, last activity was 30 secs ago
20201209 15:13:20 WARN: GPU #2: MSI GeForce GTX 1070 Ti is idle, last activity was 30 secs ago
20201209 15:13:25 WARN: GPU #0: MSI GeForce GTX 1070 Ti is idle, last activity was 35 secs ago
20201209 15:13:25 WARN: GPU #1: MSI GeForce GTX 1070 Ti is idle, last activity was 35 secs ago
20201209 15:13:25 WARN: GPU #2: MSI GeForce GTX 1070 Ti is idle, last activity was 35 secs ago
20201209 15:13:30 WARN: GPU #0: MSI GeForce GTX 1070 Ti is idle, last activity was 40 secs ago
20201209 15:13:30 WARN: GPU #1: MSI GeForce GTX 1070 Ti is idle, last activity was 40 secs ago
20201209 15:13:30 WARN: GPU #2: MSI GeForce GTX 1070 Ti is idle, last activity was 40 secs ago
20201209 15:13:35 WARN: GPU #0: MSI GeForce GTX 1070 Ti is idle, last activity was 45 secs ago
20201209 15:13:35 WARN: GPU #1: MSI GeForce GTX 1070 Ti is idle, last activity was 45 secs ago
20201209 15:13:35 WARN: GPU #2: MSI GeForce GTX 1070 Ti is idle, last activity was 45 secs ago
20201209 15:13:36 WARN: WATCHDOG: T-Rex has a problem with GPU, terminating...
20201209 15:13:36 WARN: WATCHDOG: recovering T-Rex
20201209 15:13:38 T-Rex NVIDIA GPU miner v0.19.1 - [CUDA v10.0]
....
20201209 16:42:39 WARN: shutdown t-rex, signal [2] received
20201209 16:42:39 Main loop finished. Cleaning up resources...
20201209 16:42:39 ApiServer: stopped listening on 0.0.0.0:4067
20201209 16:42:39 ApiServer: stopped listening on 127.0.0.1:3333
terminate called after throwing an instance of 'std::runtime_error'
what(): wrong device nonce index

I tried to add different gpu parameters such an indexing or gpu indexes setting but it didn't help. How can I solve it?

@aleqx
Copy link
Author

aleqx commented Dec 9, 2020

@OverchenkoDev it would be best if the miner showed which GPU, but it's still not doing that. If you are on Linux, then you can grep Xid /var/log/kern.log | tail as Xid are Nvidia hardware errors. They are usually overclocking problems (you pushed the o/c too much)

Nnote that the explanation of the Xid errors doesn't always help you debug which particular o/c you need to bring down (mem,gpu,pow)

@OverchenkoDev
Copy link

OverchenkoDev commented Dec 9, 2020

@aleqx That's what I see after this command:

coser@rige0d55e8357e3:~$ grep Xid /var/log/kern.log | tail
Dec 9 12:34:44 rige0d55e8357e3 kernel: [ 7561.089746] NVRM: Xid (PCI:0000:01:00): 56, pid=2628, CMDre 0000000b 00000ffc ffffffff 00000007 00ffffff
Dec 9 12:34:44 rige0d55e8357e3 kernel: [ 7561.089759] NVRM: Xid (PCI:0000:01:00): 56, pid=2628, CMDre 0000000c 00000ffc ffffffff 00000007 00ffffff
Dec 9 12:34:44 rige0d55e8357e3 kernel: [ 7561.089772] NVRM: Xid (PCI:0000:01:00): 56, pid=2628, CMDre 0000000d 00000ffc ffffffff 00000007 00ffffff
Dec 9 12:34:44 rige0d55e8357e3 kernel: [ 7561.089785] NVRM: Xid (PCI:0000:01:00): 56, pid=2628, CMDre 0000000e 00000ffc ffffffff 00000007 00ffffff
Dec 9 12:34:44 rige0d55e8357e3 kernel: [ 7561.089798] NVRM: Xid (PCI:0000:01:00): 56, pid=2628, CMDre 0000000f 00000ffc ffffffff 00000007 00ffffff
Dec 9 12:34:44 rige0d55e8357e3 kernel: [ 7561.089811] NVRM: Xid (PCI:0000:01:00): 56, pid=2628, CMDre 00000010 00000ffc ffffffff 00000007 00ffffff
Dec 9 12:34:44 rige0d55e8357e3 kernel: [ 7561.089824] NVRM: Xid (PCI:0000:01:00): 56, pid=2628, CMDre 00000011 00000ffc ffffffff 00000007 00ffffff
Dec 9 12:34:44 rige0d55e8357e3 kernel: [ 7561.089837] NVRM: Xid (PCI:0000:01:00): 56, pid=2628, CMDre 00000012 00000ffc ffffffff 00000007 00ffffff
Dec 9 12:34:44 rige0d55e8357e3 kernel: [ 7561.089849] NVRM: Xid (PCI:0000:01:00): 56, pid=2628, CMDre 00000013 00000ffc ffffffff 00000007 00ffffff
Dec 9 12:34:44 rige0d55e8357e3 kernel: [ 7561.089862] NVRM: Xid (PCI:0000:01:00): 56, pid=2628, CMDre 00000014 00000ffc ffffffff 00000007 00ffffff

Find it difficult to understand this

Also I used bencmark mode and there was no problems. Output:

20201209 16:47:11 NVIDIA Driver v450.80.02
20201209 16:47:11 CUDA devices available: 3
20201209 16:47:11
20201209 16:47:11 WARN: BENCHMARK MODE (ethash)
20201209 16:47:11 WARN: EPOCH 1
20201209 16:47:11
20201209 16:47:11 WARN: GPU #0(000100): MSI GeForce GTX 1070 Ti, intensity set to 22
20201209 16:47:11 WARN: GPU #1(000500): MSI GeForce GTX 1070 Ti, intensity set to 22
20201209 16:47:11 WARN: GPU #2(000600): MSI GeForce GTX 1070 Ti, intensity set to 22
20201209 16:47:12 GPU #1: generating DAG 1.01 GB for epoch 1 ...
20201209 16:47:12 GPU #2: generating DAG 1.01 GB for epoch 1 ...
20201209 16:47:12 GPU #0: generating DAG 1.01 GB for epoch 1 ...
20201209 16:47:14 GPU #0: DAG generated [time: 1867 ms], memory left: 6.81 GB
20201209 16:47:14 GPU #2: DAG generated [time: 1887 ms], memory left: 6.81 GB
20201209 16:47:14 GPU #1: DAG generated [time: 1890 ms], memory left: 6.81 GB
20201209 16:47:31 GPU #0: using kernel #3
20201209 16:47:31 GPU #2: using kernel #3
20201209 16:47:31 GPU #1: using kernel #3
20201209 16:47:32 Total: 22.66 MH/s
20201209 16:47:34 Total: 67.98 MH/s
20201209 16:47:36 Total: 67.99 MH/s
20201209 16:47:38 Total: 67.99 MH/s
20201209 16:47:40 Found 1 share(s)
20201209 16:47:40 Found 1 share(s)
20201209 16:47:40 Found 1 share(s)
20201209 16:47:40 Total: 66.77 MH/s
20201209 16:47:41 Found 1 share(s)
20201209 16:47:41 Found 1 share(s)
20201209 16:47:41 Found 1 share(s)
20201209 16:47:42 Total: 68.15 MH/s
20201209 16:47:43 Found 1 share(s)
20201209 16:47:43 Found 1 share(s)
20201209 16:47:43 Found 1 share(s)
20201209 16:47:44 Total: 68.13 MH/s
.....

@muratkavuncu
Copy link

WARN: WATCHDOG: T-Rex has a problem with GPU, terminating...
i am getting this error and it keep continuing with this error. if it restart the system it would be solved. is there a way?

@Jerkysan
Copy link

I found out if your time changes on your computer this error will be caused.. just an fyi..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants