Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release v0.5.0 doc validation part 2 #999

Merged
merged 7 commits into from
Apr 4, 2023

Conversation

architkulkarni
Copy link
Contributor

@architkulkarni architkulkarni commented Mar 31, 2023

Why are these changes needed?

Continuation of #997. As in that PR,

  1. This PR mainly focuses on correctness validation rather than document improvement.
  2. We use the nightly KubeRay operator image to test the documents.
  • https://github.com/ray-project/kuberay/blob/master/docs/guidance/ingress.md

    • The script does not always work when run all at once, likely due to some race condition. Added a troubleshooting step to unblock the user if this happens. Ideally we should figure out the correct "wait" step to insert in the script to fix the issue.
  • https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayjob.md

  • https://github.com/ray-project/kuberay/blob/master/docs/guidance/tls.md

    • Passes. Only tested "TL;DR" path. We should explicitly add the command for "# Install v0.5.0 KubeRay operator". I used kind create cluster followed by helm install kuberay-operator --set image.repository=kuberay/operator --set image.tag=nightly ~/kuberay/helm-chart/kuberay-operator
  • https://github.com/ray-project/kuberay/blob/master/docs/guidance/pod-command.md

    • Passes
  • https://github.com/ray-project/kuberay/blob/master/docs/guidance/pod-security.md

    • Passes. Note that the job "succeeds" but the script itself raises an error:
    Traceback (most recent call last):
    File "samples/xgboost_example.py", line 28, in <module>
      result = trainer.fit()
    File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/base_trainer.py", line 363, in fit
      result_grid = tuner.fit()
    File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tuner.py", line 292, in fit
      return self._local_tuner.fit()
    File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/impl/tuner_internal.py", line 455, in fit
      analysis = self._fit_internal(trainable, param_space)
    File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/impl/tuner_internal.py", line 573, in _fit_internal
      **args,
    File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 679, in run
      callbacks, sync_config, metric=metric, progress_metrics=progress_metrics
    File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/callback.py", line 105, in _create_default_callbacks
      callbacks.append(TBXLoggerCallback())
    File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/logger/tensorboardx.py", line 165, in __init__
      from tensorboardX import SummaryWriter
    File "/home/ray/anaconda3/lib/python3.7/site-packages/tensorboardX/__init__.py", line 5, in <module>
      from .torchvis import TorchVis
    File "/home/ray/anaconda3/lib/python3.7/site-packages/tensorboardX/torchvis.py", line 10, in <module>
      from .writer import SummaryWriter
    File "/home/ray/anaconda3/lib/python3.7/site-packages/tensorboardX/writer.py", line 16, in <module>
      from .comet_utils import CometLogger
    File "/home/ray/anaconda3/lib/python3.7/site-packages/tensorboardX/comet_utils.py", line 7, in <module>
      from .summary import _clean_tag
    File "/home/ray/anaconda3/lib/python3.7/site-packages/tensorboardX/summary.py", line 12, in <module>
      from .proto.summary_pb2 import Summary
    File "/home/ray/anaconda3/lib/python3.7/site-packages/tensorboardX/proto/summary_pb2.py", line 16, in <module>
      from tensorboardX.proto import tensor_pb2 as tensorboardX_dot_proto_dot_tensor__pb2
    File "/home/ray/anaconda3/lib/python3.7/site-packages/tensorboardX/proto/tensor_pb2.py", line 16, in <module>
      from tensorboardX.proto import resource_handle_pb2 as tensorboardX_dot_proto_dot_resource__handle__pb2
    File "/home/ray/anaconda3/lib/python3.7/site-packages/tensorboardX/proto/resource_handle_pb2.py", line 42, in <module>
      serialized_options=None, file=DESCRIPTOR),
    File "/home/ray/anaconda3/lib/python3.7/site-packages/google/protobuf/descriptor.py", line 561, in __new__
      _message.Message._CheckCalledFromGeneratedFile()
    
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower)

this doesn't affect the main point of the doc, which is to demonstrate security settings.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
@@ -201,3 +201,12 @@ kubectl describe ingress raycluster-ingress-head-ingress
# [Note] The forward slash at the end of the address is necessary. `<ip>/raycluster-ingress`
# will report "404 Not Found".
```

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we just need to add a comment in Step 2 that says to execute the kubectl wait command after the Pod for the NGINX ingress controller is created or just sleep 10. This is an expected behavior for kubectl wait.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commands are copied from here.

@kevin85421
Copy link
Member

kevin85421 commented Apr 1, 2023

For pod-security.md, this bug is totally from Ray Docker images. It has been fixed by Ray 2.2 but happened again in Ray 2.3 images. See #873 for more details. Maybe we should update the example script in Step 5.2.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the section "Example: NGINX Ingress on KinD (built-in ingress support)" in ingress.md and remove https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.ingress.yaml.

@gvspraveen
Copy link
Contributor

For

does not gracefully handle multiple jobs with the same submission ID submitted in quick succession
May be in the description here link to your PR in 2.4 which fixes this

@architkulkarni
Copy link
Contributor Author

architkulkarni commented Apr 3, 2023

For

does not gracefully handle multiple jobs with the same submission ID submitted in quick succession
May be in the description here link to your PR in 2.4 which fixes this

Actually it works now after Kai-Hsun's fix #1000, I've updated the PR description

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
@architkulkarni
Copy link
Contributor Author

For pod-security.md, this bug is totally from Ray Docker images. It has been fixed by Ray 2.2 but happened again in Ray 2.3 images. See #873 for more details. Maybe we should update the example script in Step 5.2.

I confirmed that this script passes with rayproject/ray-ml:nightly, so I've updated the pod security sample YAML to use ray-ml:latest. When the Ray 2.4 docker images are released, the test should pass.

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
@architkulkarni architkulkarni marked this pull request as ready for review April 4, 2023 22:50
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
@architkulkarni
Copy link
Contributor Author

Discussed offline, we'll use Ray 2.2.0 for the pod security doc for now until we can confirm ray:latest works (likely after 2.4.0 release of Ray)

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
@architkulkarni architkulkarni changed the title [WIP] Release v0.5.0 doc validation part 2 Release v0.5.0 doc validation part 2 Apr 4, 2023
@kevin85421 kevin85421 merged commit 31c1e6a into ray-project:master Apr 4, 2023
@architkulkarni architkulkarni deleted the doc-updates branch June 12, 2023 21:53
@kevin85421 kevin85421 mentioned this pull request Jun 23, 2023
1 task
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Release v0.5.0 doc validation part 2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants