Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dashboard] Select port in dashboard #13763

Merged
merged 14 commits into from
Feb 24, 2021

Conversation

fyrestone
Copy link
Contributor

@fyrestone fyrestone commented Jan 28, 2021

Why are these changes needed?

  • Problem
    • The dashboard test cases may fail due to port conflict. It's not stable to check port is free in services.py, then bind port in dashboard.
  • Solution
    • Dashboard select an available port by itself, then services.py get the bound port from redis.
  • Other
    • Fix the dashboard may hangs at exit.
    • Fix test_stats_collector.py::test_get_all_node_details.
    • Fix test_multi_node_3.py::test_calling_start_ray_head may fail.
    • Move some cases from test_advanced_3.py to test_advanced_2.py due to timeout.

Related issue number

Closes #13351

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rkooo567
Copy link
Contributor

rkooo567 commented Feb 1, 2021

We have the same issue for the dashboard agent, and it is really hard for it to choose the same strategy for that (because other cpp components need to know the port number of the agent). Do you have any good idea to solve it?

@fyrestone
Copy link
Contributor Author

We have the same issue for the dashboard agent, and it is really hard for it to choose the same strategy for that (because other cpp components need to know the port number of the agent). Do you have any good idea to solve it?

raylet knows the port of dashboard agent because agent calls RegisterAgent after started, the request message is https://github.com/ray-project/ray/blob/master/src/ray/protobuf/agent_manager.proto#L28. Which cpp component needs to know the port of the agent?

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 4, 2021
@rkooo567
Copy link
Contributor

rkooo567 commented Feb 4, 2021

Lmk when the PR is ready

@fyrestone
Copy link
Contributor Author

Lmk when the PR is ready

I am not sure if CI failure is related to this PR.

@fyrestone fyrestone added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Feb 4, 2021
@fyrestone
Copy link
Contributor Author

Lmk when the PR is ready

All cases passed. The PR is ready.

@fyrestone fyrestone removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 5, 2021
@fyrestone
Copy link
Contributor Author

@rkooo567 Is there any problem with this PR?

@rkooo567
Copy link
Contributor

rkooo567 commented Feb 7, 2021

Window tests seem to fail

@rkooo567 rkooo567 added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Feb 7, 2021
@fyrestone
Copy link
Contributor Author

Window tests seem to fail

The dashboard failure will be reported, so Windows CI should build the dashboard frontend before running tests. I found the failures in Windows CI logs:

2021-02-04T08:34:07.1279655Z E                   Exception: Failed to start the dashboard, return code 0. The last 10 lines of C:\Users\RUNNER~1\AppData\Local\Temp\ray\session_2021-02-04_08-33-29_280416_4784\logs\dashboard.log:
2021-02-04T08:34:07.1280906Z E                   2021-02-04 08:33:30,592	WARNING dashboard.py:220 -- The dashboard on node fv-az177-581 failed with the following error:
2021-02-04T08:34:07.1281711Z E                   Traceback (most recent call last):
2021-02-04T08:34:07.1282415Z E                     File "d:\a\ray\ray\python\ray\new_dashboard\dashboard.py", line 201, in <module>
2021-02-04T08:34:07.1283074Z E                       log_dir=args.log_dir)
2021-02-04T08:34:07.1283711Z E                     File "d:\a\ray\ray\python\ray\new_dashboard\dashboard.py", line 85, in __init__
2021-02-04T08:34:07.1284376Z E                       build_dir = setup_static_dir()
2021-02-04T08:34:07.1285143Z E                     File "d:\a\ray\ray\python\ray\new_dashboard\dashboard.py", line 46, in setup_static_dir
2021-02-04T08:34:07.1285842Z E                       "&& npm run build)", build_dir)
2021-02-04T08:34:07.1287263Z E                   FileNotFoundError: [Errno 2] Dashboard build directory not found. If installing from source, please follow the additional steps required to build the dashboard(cd python/ray/new_dashboard/client && npm install && npm ci && npm run build): 'd:\\a\\ray\\ray\\python\\ray\\new_dashboard\\client\\build'
2021-02-04T08:34:07.1288564Z 

But, the npm build is skipped on Windows:

build_dashboard_front_end() {
  if [ "${OSTYPE}" = msys ]; then
    { echo "WARNING: Skipping dashboard due to NPM incompatibilities with Windows"; } 2> /dev/null
  else
    (
      cd ray/new_dashboard/client

      if [ -z "${BUILDKITE-}" ]; then
        set +x  # suppress set -x since it'll get very noisy here
        . "${HOME}/.nvm/nvm.sh"
        nvm use --silent node
      fi
      install_npm_project
      npm run -s build
    )
  fi
}

If the npm is incompatible with Windows, I have to make the dashboard starts silently without frontend.

@fyrestone fyrestone force-pushed the dashboard_select_port branch 5 times, most recently from ee2023e to 120f033 Compare February 20, 2021 07:34
@fyrestone fyrestone added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Feb 22, 2021
@fyrestone
Copy link
Contributor Author

@rkooo567 It's hard to make all test cases green.

@rkooo567
Copy link
Contributor

@fyrestone no reason to make it all green. Just need to make sure failed tests are unrelated to this PR or also failing in the master

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Btw this is only for the dashboard, not agents right? I've seen similar issues from agents too I think

@fyrestone
Copy link
Contributor Author

LGTM. Btw this is only for the dashboard, not agents right? I've seen similar issues from agents too I think

This PR is only for the dashboard. I will look into the problem of dashboard agent port conflicts.

@rkooo567
Copy link
Contributor

Sounds good. I’ve seen issues from agents a lot recently. The issue is it is a bit trickier than others because raylet need the port of it (to report metrics). Are you planning to work on it soon?

@fyrestone
Copy link
Contributor Author

Sounds good. I’ve seen issues from agents a lot recently. The issue is it is a bit trickier than others because raylet need the port of it (to report metrics). Are you planning to work on it soon?

I will fix the dashboard agent port conflicts after this PR is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Dashboard][CI] Improve dashboard port selection
2 participants