Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃悶 [Bug]: The GPU endpoint for the node returns a value that can be null. #1051

Closed
Mahmoud-Emad opened this issue May 30, 2024 · 4 comments
Assignees
Labels
grid-proxy belongs to grid proxy process_wontfix This will not be worked on type_bug Something isn't working
Milestone

Comments

@Mahmoud-Emad
Copy link

What happened?

I rented more than one node, and all worked fine as I could view the GPU details of the nodes. However, I encountered a problem with one node as I could not list its GPU. After investigating, I discovered that the node's GPU returned a null value from the grid proxy.

This is a valid node and this is its GPU.
This is a not valid node and this is its GPU.

which network/s did you face the problem on?

Main

Twin ID/s

8161

Version

2.4.2

Node ID/s

5856

Farm ID/s

1840

Contract ID/s

No response

Relevant log output

N/A
@Omarabdul3ziz
Copy link
Contributor

node 5856 doesn't have a gpu but its num_gpu=1, check the triggers

@Omarabdul3ziz
Copy link
Contributor

tl;dr issue fixed in v0.15.0 release

here is what happened.

  • checking the processor_db on the main grid stack found that the node_gpu table already has a GPU for this node twin

    tfgrid-graphql=# select * from node_gpu where node_twin_id=9545;
     node_twin_id |           id           |       vendor       |             device             | contract 
    --------------+------------------------+--------------------+--------------------------------+----------
            9545 | 0000:02:00.0/10de/1245 | NVIDIA Corporation | GF116 [GeForce GTS 450 Rev. 2] |        0
    (1 row)
    

    and in the cache table, it is reported correctly

    tfgrid-graphql=# select node_gpu_count from resources_cache where node_id=5856;
     node_gpu_count 
     ----------------
                  1
    (1 row)
    

    so it is not a problem with the triggers.

  • the most reasonable scenario could have happened is that the node had a GPU once but it is not there anymore. there is an invalidation mechanism introduced here gpu indexer fixes聽#642 for a similar issue. but it wasn't working as expected so it was modified later in this pr refactor the indexer聽#726

  • checking the node_gpu table it doesn't have an updated_at field which is what the new invalidation mechanism depends on to consider the GPU card as expired. this means this proxy doesn't have the latest changes regarding this part yet.

  • checking the changelog the latest expiration should be in release v0.15.0 but the main proxy still ar v0.14.13

@Omarabdul3ziz
Copy link
Contributor

this should be verified carefully on mainnet

@ashraffouda
Copy link
Collaborator

already fixed in a previous 0.15.0 which is not on mainnet yet

@ashraffouda ashraffouda added the process_wontfix This will not be worked on label Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
grid-proxy belongs to grid proxy process_wontfix This will not be worked on type_bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

3 participants