Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add disk and network metrics to Prometheus and fix dashboard #14144

Merged
merged 7 commits into from
Feb 17, 2021

Conversation

kathryn-zhou
Copy link
Contributor

@kathryn-zhou kathryn-zhou commented Feb 16, 2021

Why are these changes needed?

Replacement PR for #14061 because of #14133

Related issue number

Closes #14133

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rkooo567
Copy link
Contributor

Hey can we add a regression test here to avoid a similar issue?

@rkooo567
Copy link
Contributor

What was the root cause of the previous failure?

@simon-mo
Copy link
Contributor

The root cause was a rename in variable net -> network_speed in Python the missed renaming of net -> networkSpeed in javascript. There's no way to test it using our current stack (because javascript fail to render). If we get someone to work on this, using the javascript testing stack (jest, puppeteer) should help.

@rkooo567
Copy link
Contributor

Why don't we have a simple sanity check with https://selenium-python.readthedocs.io/getting-started.html? (maybe one test that uses seleium to click a button or sth)

@simon-mo
Copy link
Contributor

simon-mo commented Feb 17, 2021 via email

@rkooo567
Copy link
Contributor

Can we just do this in this pr? Dashboard error we saw is a significant regression for ppl like me who have engagement with users (with the master wheel). If you guys have no bandwidth, I can try making a pr

@simon-mo
Copy link
Contributor

let's do another PR so this PR is unblocked. One issue is that these kind of browser based testing can be very flaky because you don't know how long javascript takes to render. And installing browser tool like selenium itself is hard to do it right. I would say it takes around 4-8 hours to finish end to end. @kathryn-zhou can you start taking a look at selenium, or even better i heard cypress is more reliable testing tool: https://docs.cypress.io/guides/overview/key-differences.html#Architecture

@rkooo567
Copy link
Contributor

I see. I wasn't aware that this could be complicated. I agree not to include it in this PR in that case. @kathryn-zhou can you create a separate issue?

@simon-mo
Copy link
Contributor

lint passed.

@simon-mo simon-mo merged commit f6b5e83 into ray-project:master Feb 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[dashboard] dashboard crashed on nightly
3 participants