Releases: ray-project/kuberay
v1.1.1 release
Compared to KubeRay v1.1.0, KubeRay v1.1.1 includes four cherry-picked commits.
- [Bug] Ray operator crashes when specifying RayCluster with resources.limits but no resources.requests (#2077, @kevin85421)
- [CI] Pin kustomize to v5.3.0 (#2067, @kevin85421)
- [Bug] All worker Pods are deleted if using KubeRay v1.0.0 CRD with KubeRay operator v1.1.0 image (#2087, @kevin85421)
- [Hotfix][CI] Pin setup-envtest dep (#2038, @kevin85421)
v1.1.0 release
Highlights
-
RayJob improvements
- Gang / Priority scheduling with Kueue:
- ActiveDeadlineSeconds (new field): A feature to control the lifecycle of a RayJob. See this doc and #1933 for more details.
- submissionMode (new field): Users can specify “K8sJobMode” or “HTTPMode”. The default value is “K8sJobMode”. In HTTPMode, the submitter K8s Job will not be created. Instead, KubeRay sends a HTTP request to the Ray head Pod to create a Ray job. See this doc and #1893 for more details.
- Fix a lot of stability issues.
-
Structured logging
- In KubeRay v1.1.0, we have changed the KubeRay logs to JSON format, and each log message includes context information such as the custom resource’s name and reconcileID. Hence, users can filter out logs associated with a RayCluster, RayJob, or RayService CR by its name.
-
RayService improvements
- Refactor health check mechanism to improve the stability.
- Deprecate the
deploymentUnhealthySecondThreshold
andserviceUnhealthySecondThreshold
to avoid unintentional preparation of new RayCluster custom resource.
-
TPU multi-host PodSlice support
- The KubeRay team is actively working with the Google GKE and TPU teams on integration. The required changes in KubeRay have already been completed. The GKE team will complete some tasks on their side this week or next. Then, users should be able to use multi-host TPU PodSlice with a static RayCluster (without autoscaling).
-
Stop publishing images on DockerHub; instead, we will only publish on Quay.
- https://quay.io/repository/kuberay/operator?tab=tags
- Users should use docker pull
quay.io/kuberay/operator:v1.1.0
instead of docker pullkuberay/operator:v1.1.0
.
RayJob
RayJob state machine refactor
- [RayJob][Status][1/n] Redefine the definition of JobDeploymentStatusComplete (#1719, @kevin85421)
- [RayJob][Status][2/n] Redefine
ready
for RayCluster to avoid using HTTP requests to check dashboard status (#1733, @kevin85421) - [RayJob][Status][3/n] Define JobDeploymentStatusInitializing (#1737, @kevin85421)
- [RayJob][Status][4/n] Remove some JobDeploymentStatus and updateState function calls (#1743, @kevin85421)
- [RayJob][Status][5/n] Refactor getOrCreateK8sJob (#1750, @kevin85421)
- [RayJob][Status][6/n] Redefine JobDeploymentStatusComplete and clean up K8s Job after TTL (#1762, @kevin85421)
- [RayJob][Status][7/n] Define JobDeploymentStatusNew explicitly (#1772, @kevin85421)
- [RayJob][Status][8/n] Only a RayJob with the status Running can transition to Complete at this moment (#1774, @kevin85421)
- [RayJob][Status][9/n] RayJob should not pass any changes to RayCluster (#1776, @kevin85421)
- [RayJob][10/n] Add finalizer to the RayJob when the RayJob status is JobDeploymentStatusNew (#1780, @kevin85421)
- [RayJob][Status][11/n] Refactor the suspend operation (#1782, @kevin85421)
- [RayJob][Status][12/n] Resume suspended RayJob (#1783, @kevin85421)
- [RayJob][Status][13/n] Make suspend operation atomic by introducing the new status
Suspending
(#1798, @kevin85421) - [RayJob][Status][14/n] Decouple the Initializing status and Running status (#1801, @kevin85421)
- [RayJob][Status][15/n] Unify the codepath for the status transition to
Suspended
(#1805, @kevin85421) - [RayJob][Status][16/n] Refactor
Running
status (#1807, @kevin85421) - [RayJob][Status][17/n] Unify the codepath for status updates (#1814, @kevin85421)
- [RayJob][Status][18/n] Control the entire lifecycle of the Kubernetes submitter Job using KubeRay (#1831, @kevin85421)
- [RayJob][Status][19/n] Transition to
Complete
if the K8s Job fails (#1833, @kevin85421)
Others
- [Refactor] Remove global utils.GetRayXXXClientFuncs (#1727, @rueian)
- [Feature] Warn Users When Updating the RayClusterSpec in RayJob CR (#1778, @Yicheng-Lu-llll)
- Add apply configurations to generated client (#1818, @astefanutti)
- RayJob: inject RAY_DASHBOARD_ADDRESS envariable variable for user provided submiter templates (#1852, @andrewsykim)
- [Bug] Submitter K8s Job fails even though the RayJob has a JobDeploymentStatus
Complete
and a JobStatusSUCCEEDED
(#1919, @kevin85421) - add toleration for GPUs in sample pytorch RayJob (#1914, @andrewsykim)
- Add a sample RayJob to fine-tune a PyTorch lightning text classifier with Ray Data (#1891, @andrewsykim)
- rayjob controller: refactor environment variable check in unit tests (#1870, @andrewsykim)
- RayJob: don't delete submitter job when ShutdownAfterJobFinishes=true (#1881, @andrewsykim)
- rayjob controller: update EndTime to always be the time when the job deployment transitions to Complete status (#1872, @andrewsykim)
- chore: remove ConfigMap from ray-job.kueue-toy-sample.yaml (#1976, @kevin85421)
- [Kueue] Add a sample YAML for Kueue toy sample (#1956, @kevin85421)
- [RayJob] Support ActiveDeadlineSeconds (#1933, @kevin85421)
- [Feature][RayJob] Support light-weight job submission (#1893, @kevin85421)
- [RayJob] Add JobDeploymentStatusFailed Status and Reason Field to Enhance Observability for Flyte/RayJob Integration (#1942, @Yicheng-Lu-llll)
- [RayJob] Refactor Rayjob E2E Tests to Use Server-Side Apply (#1927, @Yicheng-Lu-llll)
- [RayJob] Rewrite RayJob envtest (#1916, @kevin85421)
- [Chore][RayJob] Remove the TODO of verifying the schema of RayJobInfo because it is already correct (#1911, @rueian)
- [RayJob] Set missing CPU limit (#1899, @kevin85421)
- [RayJob] Set the timeout of the HTTP client from 2 mins to 2 seconds (#1910, @kevin85421)
- [Feature][RayJob] Support light-weight job submission with entrypoint_num_cpus, entrypoint_num_gpus and entrypoint_resources (#1904, @rueian)
- [RayJob] Improve dashboard client log (#1903, @kevin85421)
- [RayJob] Validate whether runtimeEnvYAML is a valid YAML string (#1898, @kevin85421)
- [RayJob] Add additional print columns for RayJob (#1895, @andrewsykim)
- [Test][RayJob] Transition to
Complete
if the JobStatus is STOPPED (#1871, @kevin85421) - [RayJob] Inject RAY_SUBMISSION_ID env variable for user provided submitter template (#1868, @kevin85421)
- [RayJob] Transition to
Complete
if the JobStatus is STOPPED (#1855, @kevin85421) - [RayJob][Kueue] Move limitation check to validateRayJobSpec (#1854, @kevin85421)
- [RayJob] Validate RayJob spec (#1813, @kevin85421)
- [Test][RayJob] Kueue happy-path scenario (#1809, @kevin85421)
- [RayJob] Delete the Kubernetes Job and its Pods immediately when suspending (#1791, @rueian)
- [Feature][RayJob] Remove the deprecated RuntimeEnv from CRD. Use RuntimeEnvYAML instead. (#1792, @rueian)
- [Bug][RayJob] Avoid nil pointer dereference ([#1756](https://github.c...
v1.0.0 release
KubeRay is officially in General Availability!
- Bump the CRD version from v1alpha1 to v1.
- Relocate almost all documentation to the Ray website.
- Improve RayJob UX.
- Improve GCS fault tolerance.
GCS fault tolerance
- [GCS FT] Improve GCS FT cleanup UX (#1592, @kevin85421)
- [Bug][RayCluster] Fix RAY_REDIS_ADDRESS parsing with redis scheme and… (#1556, @rueian)
- [Bug] RayService with GCS FT HA issue (#1551, @kevin85421)
- [Test][GCS FT] End-to-end test for cleanup_redis_storage (#1422)(#1459) (#1466, @rueian)
- [Feature][GCS FT] Clean up Redis once a GCS FT-Enabled RayCluster is deleted (#1412, @kevin85421)
- Update GCS fault tolerance YAML (#1404, @kevin85421)
- [GCS FT] Consider the case of sidecar containers (#1386, @kevin85421)
- [GCS FT] Give readiness / liveness probes good default values (#1364, @kevin85421)
- [GCS FT][Refactor] Redefine the behavior for deleting Pods and stop listening to Kubernetes events (#1341, @kevin85421)
CRD versioning
- [CRD] Inject CRD version to the Autoscaler sidecar container (#1496, @kevin85421)
- [CRD][2/n] Update from CRD v1alpha1 to v1 (#1482, @kevin85421)
- [CRD][1/n] Create v1 CRDs (#1481, @kevin85421)
- [CRD] Set maxDescLen to 0 (#1449, @kevin85421)
RayService
- [Hotfix][Bug] Avoid unnecessary zero-downtime upgrade (#1581, @kevin85421)
- [Feature] Add an example for RayService high availability (#1566, @kevin85421)
- [Feature] Add a flag to make zero downtime upgrades optional (#1564, @kevin85421)
- [Bug][RayService] KubeRay does not recreate Serve applications if a head Pod without GCS FT recovers from a failure. (#1420, @kevin85421)
- [Bug] Fix the filename of text summarizer YAML (#1415, @kevin85421)
- [serve] Change text ml yaml to use french in user config (#1403, @zcin)
- [services] Add text ml rayservice yaml (#1402, @zcin)
- [Bug] Fix flakiness of RayService e2e tests (#1385, @kevin85421)
- Add RayService sample test (#1377, @Darren221)
- [RayService] Revisit the conditions under which a RayService is considered unhealthy and the default threshold (#1293, @kevin85421)
- [RayService][Observability] Add more loggings about networking issues (#1282, @kevin85421)
RayJob
- [Feature] Improve observability for flaky RayJob test (#1587, @kevin85421)
- [Bug][RayJob] Fix FailedToGetJobStatus by allowing transition to Running (#1583, @architkulkarni)
- [RayJob] Fix RayJob status reconciliation (#1539, @astefanutti)
- [RayJob]: Always use target RayCluster image as default RayJob submitter image (#1548, @astefanutti)
- [RayJob] Add default CPU and memory for job submitter pod (#1319, @architkulkarni)
- [Bug][RayJob] Check dashboard readiness before creating job pod (#1381) (#1429, @rueian)
- [Feature][RayJob] Use RayContainerIndex instead of 0 (#1397) (#1427, @rueian)
- [RayJob] Enable job log streaming by setting
PYTHONUNBUFFERED
in job container (#1375, @architkulkarni) - Add field to expose entrypoint num cpus in rayjob (#1359, @shubhscoder)
- [RayJob] Add runtime env YAML field (#1338, @architkulkarni)
- [Bug][RayJob] RayJob with custom head service name (#1332, @kevin85421)
- [RayJob] Add e2e sample yaml test for shutdownAfterJobFinishes (#1269, @architkulkarni)
RayCluster
- [Enhancement] Remove unused variables in constant.go (#1474, @evalaiyc98)
- [Enhancement] GPU RayCluster doesn't work on GKE Autopilot (#1470, @kevin85421)
- [Refactor] Parameterize TestGetAndCheckServeStatus (#1450, @evalaiyc98)
- [Feature] Make replicas optional for WorkerGroupSpec (#1443, @kevin85421)
- use raycluster app's name as podgroup name key word (#1446, @lowang-bh)
- [Refactor] Make port name variables consistent and meaningful (#1389, @evalaiyc98)
- [Feature] Use image of Ray head container as the default Ray Autoscaler container (#1401, @kevin85421)
- Update Autoscaler YAML for the Autoscaler tutorial (#1400, @kevin85421)
- [Feature] Ray container must be the first application container (#1379, @kevin85421)
- [release blocker][Feature] Only Autoscaler can make decisions to delete Pods (#1253, @kevin85421)
- [release blocker][Autoscaler] Randomly delete Pods when scaling down the cluster (#1251, @kevin85421)
Helm charts
- Remove miniReplicas in raycluster-cluster.yaml (#1473, @evalaiyc98)
- Helm chart ray-cluster template reference fix (#1469, @chrisxstyles)
- fix: Issue #1391 - Custom labels not being pulled in (#1398, @rxraghu)
- Remove unnecessary kustomize in make helm (#1370, @shubhscoder)
- [Feature] Allow RayCluster Helm chart to specify different images for different worker groups (#1352, @Darren221)
- Allow manually creating init containers in Kuberay helm charts (#1287, @richardsliu)
KubeRay API Server
- Added Python API server client (#1561, @blublinsky)
- updating url use v1 (#1577, @blublinsky)
- Fixed processing of job submitter (#1562, @blublinsky)
- extended job APIs (#1537, @blublinsky)
- fixed volumes test in cluster test (#1498, @blublinsky)
- Add documentation for API Server monitoring (#1479, @blublinsky)
- created HA example for API server (#1461, @blublinsky)
- Numerous fixes to the API server to make RayJob APIs working (#1447, @blublinsky)
- Updated API server documentation (#1435, @z103cb)
- servev2 support for API server (#1419, @blublinsky)
- replacement for #1312 (#1409, @blublinsky)
- Updates to the apiserver swagger-ui (#1410, @z103cb)
- implemented liveness/readyness probe for the API server (#1369, @blublinsky)
- Operator support for openShift (#1371, @blublinsky)
- Removed use of the of BUILD_FLAGS in apiserver makefile (#1336, @z103cb)
- Api server makefile (#1301, @z103cb)
Documentation
- [Doc] Update release docs (#1621, @kevin85421)
- [Doc] Fix release doc format (#1578, @kevin85421)
- Update kuberay mcad integration doc (#1373, @tedhtchang)
- [Release][Doc] Add instructions to release Go modules. (#1546, @kevin85421)
- [Post v1.0.0-rc.1] Reenable sample YAML tests for latest release and update some docs (#1544, @kevin85421)
- Update operator development instruction ([#1458](https://g...
v0.6.0 release
Highlights
-
RayService
- RayService starts to support Ray Serve multi-app API (#1136, #1156)
- RayService stability improvements (#1231, #1207, #1173)
- RayService observability (#1230)
- RayService examples
- [RayService] Stable Diffusion example (#1181, @kevin85421)
- MobileNet example (#1175, @kevin85421)
- RayService troubleshooting handbook (#1221)
-
RayJob refactoring (#1177)
RayService
- [RayService][Observability] Add more logging for RayService troubleshooting (#1230, @kevin85421)
- [Bug] Long image pull time will trigger blue-green upgrade after the head is ready (#1231, @kevin85421)
- [RayService] Stable Diffusion example (#1181, @kevin85421)
- [RayService] Update docs to use multi-app (#1179, @zcin)
- [RayService] Change runtime env for e2e autoscaling test (#1178, @zcin)
- [RayService] Add e2e tests (#1167, @zcin)
- [RayService][docs] Improve explanation for config file and in-place updates (#1229, @zcin)
- [RayService][Doc] RayService troubleshooting handbook (#1221, @kevin85421)
- [Doc] Improve RayService doc (#1235, @kevin85421)
- [Doc] Improve FAQ page and RayService troubleshooting guide (#1225, @kevin85421)
- [RayService] Add RayService alb ingress CR (#1169, @sihanwang41)
- [RayService] Add support for multi-app config in yaml-string format (#1156, @zcin)
- [rayservice] Add support for getting multi-app status (#1136, @zcin)
- [Refactor] Remove Dashboard Agent service (#1207, @kevin85421)
- [Bug] KubeRay operator fails to get serve deployment status due to 500 Internal Server Error (#1173, @kevin85421)
- MobileNet example (#1175, @kevin85421)
- [Bug] fix RayActorOptionSpec.items.spec.serveConfig.deployments.rayActorOptions.memory int32 data type (#1220, @kevin85421)
RayJob
- [RayJob] Submit job using K8s job instead of checking Status and using DashboardHTTPClient (#1177, @architkulkarni)
- [Doc] [RayJob] Add documentation for submitterPodTemplate (#1228, @architkulkarni)
Autoscaler
- [release blocker][Feature] Only Autoscaler can make decisions to delete Pods (#1253, @kevin85421)
- [release blocker][Autoscaler] Randomly delete Pods when scaling down the cluster (#1251, @kevin85421)
Helm
- [Helm][RBAC] Introduce the option crNamespacedRbacEnable to enable or disable the creation of Role/RoleBinding for RayCluster preparation (#1162, @kevin85421)
- [Bug] Allow zero replica for workers for Helm (#968, @ducviet00)
- [Bug] KubeRay tries to create ClusterRoleBinding when singleNamespaceInstall and rbacEnable are set to true (#1190, @kevin85421)
KubeRay API Server
- Add support for openshift routes (#1183, @blublinsky)
- Adding API server support for service account (#1148, @blublinsky)
Documentation
- [release v0.6.0] Update tags and versions (#1270, @kevin85421)
- [release v0.6.0-rc.1] Update tags and versions (#1264, @kevin85421)
- [release v0.6.0-rc.0] Update tags and versions (#1237, @kevin85421)
- [Doc] Develop Ray Serve Python script on KubeRay (#1250, @kevin85421)
- [Doc] Fix the order of comments in sample Job YAML file (#1242, @architkulkarni)
- [Doc] Upload a screenshot for the Serve page in Ray dashboard (#1236, @kevin85421)
- [Doc] GKE GPU cluster setup (#1223, @kevin85421)
- [Doc][Website] Add complete document link (#1224, @yuxiaoba)
- Add FAQ page (#1150, @Yicheng-Lu-llll)
- [Doc] Add gofumpt lint instructions (#1180, @architkulkarni)
- [Doc] Add
helm update
command to chart validation step in release process (#1165, @architkulkarni) - [Doc] Add git fetch --tags command to release instructions (#1164, @architkulkarni)
- Add KubeRay related blogs (#1147, @tedhtchang)
- [2.5.0 Release] Change version numbers 2.4.0 -> 2.5.0 (#1151, @ArturNiederfahrenhorst)
- [Sample YAML] Bump ray version in pod security YAML to 2.4.0 (#1160, @architkulkarni)
- Add instruction to skip unit tests in DEVELOPMENT.md (#1171, @architkulkarni)
- Fix typo (#1241, @mmourafiq)
- Fix typo (#1232, @mmourafiq)
CI
- [CI] Add
kind
-in-Docker test to Buildkite CI (#1243, @architkulkarni) - [CI] Remove unnecessary release.yaml workflow (#1168, @architkulkarni)
Others
- Pin operator version in single namespace installation(#1193) (#1210, @wjzhou)
- RayCluster updates status frequently (#1211, @kevin85421)
- Improve the observability of the init container (#1149, @Yicheng-Lu-llll)
- [Ray Observability] Disk usage in Dashboard (#1152, @kevin85421)
v0.5.2 release
Changelog for v0.5.2
Highlights
The KubeRay 0.5.2 patch release includes the following improvements.
- Allow specifying the entire headService and serveService YAML spec. Previously, only certain special fields such as
labels
andannotations
were exposed to the user.- Expose entire head pod Service to the user (#1040, @architkulkarni)
- Exposing Serve Service (#1117, @kodwanis)
- RayService stability improvements
- RayService object’s Status is being updated due to frequent reconciliation (#1065, @kevin85421)
- [RayService] Submit requests to the Dashboard after the head Pod is running and ready (#1074, @kevin85421)
- Fix in HeadPod Service Generation logic which was causing frequent reconciliation (#1056, @msumitjain)
- Allow watching multiple namespaces
- [Feature] Watch CR in multiple namespaces with namespaced RBAC resources (#1106, @kevin85421)
- Autoscaler stability improvements
- [Bug] RayService restarts repeatedly with Autoscaler (#1037, @kevin85421)
- [Bug] autoscaler not working properly in rayjob (#1064, @Yicheng-Lu-llll)
- [Bug][Autoscaler] Operator does not remove workers (#1139, @kevin85421)
Contributors
We'd like to thank the following contributors for their contributions to this release:
@ByronHsu, @Yicheng-Lu-llll, @anishasthana, @architkulkarni, @blublinsky, @chrisxstyles, @dirtyValera, @ecurtin, @jasoonn, @jjyao, @kevin85421, @kodwanis, @msumitjain, @oginskis, @psschwei, @scarlet25151, @sihanwang41, @tedhtchang, @varungup90, @xubo245
Features
- Add a flag to enable/disable worker init container injection (#1069, @ByronHsu)
- Add a warning to discourage users from launching a KubeRay-incompatible autoscaler. (#1102, @kevin85421)
- Add consistency check for deepcopy generated files (#1127, @varungup90)
- Add kubernetes dependency in python client library (#998, @jasoonn)
- Add support for pvcs to apiserver (#1118, @psschwei)
- Add support for tolerations, env, annotations and labels (#1070, @blublinsky)
- Align Init Container's ImagePullPolicy with Ray Container's ImagePullPolicy (#1080, @Yicheng-Lu-llll)
- Connect Ray client with TLS using Nginx Ingress on Kind cluster (#729) (#1051, @tedhtchang)
- Expose entire head pod Service to the user (#1040, @architkulkarni)
- Exposing Serve Service (#1117, @kodwanis)
- [Test] Add e2e test for sample RayJob yaml on kind (#935, @architkulkarni)
- Parametrize ray-operator makefile (#1121, @anishasthana)
- RayService object's Status is being updated due to frequent reconciliation (#1065, @kevin85421)
- [Feature] Support suspend in RayJob (#926, @oginskis)
- [Feature] Watch CR in multiple namespaces with namespaced RBAC resources (#1106, @kevin85421)
- [RayService] Submit requests to the Dashboard after the head Pod is running and ready (#1074, @kevin85421)
- feat: Rename instances of rayiov1alpha1 to rayv1alpha1 (#1112, @anishasthana)
- ray-operator: Reuse contexts across ray operator reconcilers (#1126, @anishasthana)
Fixes
- Fix CI (#1145, @kevin85421)
- Fix config frequent update (#1014, @sihanwang41)
- Fix for Sample YAML Config Test - 2.4.0 Failure due to 'suspend' Field (#1096, @Yicheng-Lu-llll)
- Fix in HeadPod Service Generation logic which was causing frequent reconciliation (#1056, @msumitjain)
- [Bug] Autoscaler doesn't support TLS (#1119, @chrisxstyles)
- [Bug] Enable ResourceQuota by adding Resources for the health-check init container (#1043, @kevin85421)
- [Bug] Fix null map handling in
BuildServiceForHeadPod
function (#1095, @architkulkarni) - [Bug] RayService restarts repeatedly with Autoscaler (#1037, @kevin85421)
- [Bug] Service (Serve) changing port from 8000 to 9000 doesn't work (#1081, @kevin85421)
- [Bug] autoscaler not working properly in rayjob (#1064, @Yicheng-Lu-llll)
- [Bug] compatibility test for the nightly Ray image fails (#1055, @kevin85421)
- [Bug] rayStartParams is required at this moment. (#1031, @kevin85421)
- [Bug][Autoscaler] Operator does not remove workers (#1139, @kevin85421)
- [Bug][Doc] fix the link error of operator document (#1046, @xubo245)
- [Bug][GCS FT] Worker pods crash unexpectedly when gcs_server on head pod is killed (#1036, @kevin85421)
- [Bug][breaking change] Unauthorized 401 error on fetching Ray Custom Resources from K8s API server (#1128, @kevin85421)
- [Bug][k8s compatibility] k8s v1.20.7 ClusterIP svc do not updated under RayService (#1110, @kevin85421)
- [Helm][ray-cluster] Fix parsing envFrom field in additionalWorkerGroups (#1039, @dirtyValera)
Documentation
- [Doc] Copyedit dev guide (#1012, @architkulkarni)
- [Doc] Update nav to include missing files and reorganize nav (#1011, @architkulkarni)
- [Doc] Update version from 0.4.0 to 0.5.0 on remaining kuberay docs files (#1018, @architkulkarni)
- [Doc][Website] Update KubeRay introduction and fix layout issues (#1042, @kevin85421)
- [Docs][Website] One word typo fix in docs and README (#1068, @ecurtin)
- Add a document to outline the default settings for
rayStartParams
in Kuberay (#1057, @Yicheng-Lu-llll) - Example Pod to connect Ray client to remote a Ray cluster with TLS enabled (#994, @tedhtchang)
- [Post release v0.5.0] Update CHANGELOG.md (#1026, @kevin85421)
- [Post release v0.5.0] Update release doc (#1028, @kevin85421)
- [Post Ray 2.4 Release] Update Ray versions to Ray 2.4.0 (#1049, @jjyao)
- [Post release v0.5.0] Remove block from rayStartParams (#1015, @kevin85421)
- [Post release v0.5.0] Remove block from rayStartParams for python client and KubeRay operator tests (#1050, @Yicheng-Lu-llll)
- [Post release v0.5.0] Remove serviceType (#1013, @kevin85421)
- [Post v0.5.0] Remove init containers from YAML files (#1010, @kevin85421)
- [Sample YAML] Bump ray version in pod security YAML to 2.4.0 (#1160) (#1161, @architkulkarni)
- Kuberay 0.5.0 docs validation update docs for GCS FT (#1004, @scarlet25151)
- Release v0.5.0 doc validation (#997, @kevin85421)
- Release v0.5.0 doc validation part 2 (#999, @architkulkarni)
- Release v0.5.0 python client library validation (#1006, @jasoonn)
- [release v0.5.2] Update tags and versions to 0.5.2 (#1159, @architkulkarni)
v0.5.0 release
Highlights
The KubeRay 0.5.0 release includes the following improvements.
- Interact with KubeRay via a Python client
- Integrate KubeRay with Kubeflow to provide an interactive development environment (link).
- Integrate KubeRay with Ray TLS authentication
- Improve the user experience for KubeRay on AWS EKS (link)
- Fix some Kubernetes networking issues
- Fix some stability bugs in RayJob and RayService
Contributors
The following individuals contributed to KubeRay 0.5.0. This list is alphabetical and incomplete.
@akanso @alex-treebeard @architkulkarni @cadedaniel @cskornel-doordash @davidxia @DmitriGekhtman @ducviet00 @gvspraveen @harryge00 @jasoonn @Jeffwan @kevin85421 @psschwei @scarlet25151 @sihanwang41 @wilsonwang371 @Yicheng-Lu-llll
Python client (alpha)(New!)
Kubeflow (New!)
- [Feature][Doc] Kubeflow integration (#937, @kevin85421)
- [Feature] Ray restricted podsecuritystandards for enterprise security and Kubeflow integration (#750, @kevin85421)
TLS authentication (New!)
- [Feature] TLS authentication (#989, @kevin85421)
AWS EKS (New!)
- [Feature][Doc] Access S3 bucket from Pods in EKS (#958, @kevin85421)
Kubernetes networking (New!)
- Read cluster domain from resolv.conf or env (#951, @harryge00)
- [Feature] Replace service name with Fully Qualified Domain Name (#938, @kevin85421)
- [Feature] Add default init container in workers to wait for GCS to be ready (#973, @kevin85421)
Observability
- Fix issue with head pod not monitered by Prometheus under certain condition (#963, @Yicheng-Lu-llll)
- [Feature] Improve and fix Prometheus & Grafana integrations (#895, @kevin85421)
- Add example and tutorial to explain how to create custom metrics for Prometheus (#914, @Yicheng-Lu-llll)
- feat: enrich
kubectl get
output (#878, @davidxia)
RayCluster
- Fix issue with operator OOM restart (#946, @wilsonwang371)
- [Feature][Hotfix] Add observedGeneration to the status of CRDs (#979, @kevin85421)
- Customize the Prometheus export port (#954, @Yicheng-Lu-llll)
- [Feature] The default ImagePullPolicy should be IfNotPresent (#947, @kevin85421)
- Inject the --block option to ray start command automatically (#932, @Yicheng-Lu-llll)
- Inject cluster name as an environment variable into head and worker pods (#934, @Yicheng-Lu-llll)
- Ensure container ports without names are also included in the head node service (#891, @Yicheng-Lu-llll)
- fix:
.status.availableWorkerReplicas
(#887, @davidxia) - fix: only filter RayCluster events for reconciliation (#882, @davidxia)
- refactor: remove redundant import in
raycluster_controller.go
(#884, @davidxia) - refactor: use equivalent, shorter
Builder.Owns()
method (#881, @davidxia) - [RayCluster controller] [Bug] Unconditionally reconcile RayCluster every 60s instead of only upon change (#850, @architkulkarni)
- [Feature] Make head serviceType optional (#851, @kevin85421)
- [RayCluster controller] Add headServiceAnnotations field to RayCluster CR (#841, @cskornel-doordash)
RayJob (alpha)
- [Hotfix][release blocker][RayJob] HTTP client from submitting jobs before dashboard initialization completes (#1000, @kevin85421)
- [RayJob] Propagate error traceback string when GetJobInfo doesn't return valid JSON (#943, @architkulkarni)
- [RayJob][Doc] Fix RayJob sample config. (#807, @DmitriGekhtman)
RayService (alpha)
- [RayService] Skip update events without change (#811, @sihanwang41)
Helm
- Add rayVersion in the RayCluster chart (#975, @Yicheng-Lu-llll)
- [Feature] Support environment variables for KubeRay operator chart (#978, @kevin85421)
- [Feature] Add service account section in helm chart (#969, @ducviet00)
- Update apiserver chart location in readme (#896, @psschwei)
- add sidecar container option (#920, @akihikokuroda)
- match selector of service to pod labels (#918, @akihikokuroda)
- [Feature] Nodeselector/Affinity/Tolerations value to kuberay-apiserver chart (#879, @alex-treebeard)
- [Feature] Enable namespaced installs via helm chart (#860, @alex-treebeard)
- Remove unused fields from KubeRay operator and RayCluster charts (#839, @kevin85421)
- [Bug] Remove an unused field (ingress.enabled) from KubeRay operator chart (#812, @kevin85421)
- [helm] Add memory limits and resource documentation. (#789, @DmitriGekhtman)
CI
- [Feature] Add python client test to action (#993, @jasoonn)
- [CI][Buildkite] Fix the PATH issue (#952, @kevin85421)
- [CI][Buildkite] An example test for Buildkite (#919, @kevin85421)
- refactor: Fix flaky tests by using RetryOnConflict (#904, @Yicheng-Lu-llll)
- Use k8sClient from client.New in controller test (#898, @Yicheng-Lu-llll)
- [Bug] Fix flaky test: should be able to update all Pods to Running (#893, @kevin85421)
- Enable test framework to install operator with custom config and put operator in a namespace with enforced PSS in security testing (#876, @Yicheng-Lu-llll)
- Ensure all temp files are deleted after the compatibility test (#886, @Yicheng-Lu-llll)
- Adding a test for the document for the Pod security standard (#866, @Yicheng-Lu-llll)
- [Feature] Run config tests with the latest release of KubeRay operator (#858, @kevin85421)
- [Feature] Define a general-purpose cleanup method for CREvent (#849, @kevin85421)
- [Feature] Remove Docker container and NodePort from compatibility test (#844, @kevin85421)
- Remove Docker from BasicRayTestCase (#840, @kevin85421)
- [Feature] Move some functions from prototype test framework to a new utils file (#837, @kevin85421)
- [CI] Add workflow to manually trigger release image push (#801, @DmitriGekhtman)
- [CI] Pin go version in CRD consistency check (#794, @DmitriGekhtman)
- [Feature] Improve the observability of integration tests (#775, @jasoonn)
Sample YAML files
- Improve ray-cluster.external-redis.yaml (#986, @Yicheng-Lu-llll)
- remove ray-cluster.getting-started.yaml (#987, @Yicheng-Lu-llll)
- [Feature] Read Redis password from Kubernetes Secret (#950, @kevin85421)
- [Ray 2.3.0] Update --redis-password for RayCluster (#929, @kevin85421)
- [Bug] KubeRay does not work on M1 macs. (#869, @kevin85421)
- [Post Ray 2.3 Release] Update Ray versions to Ray 2.3.0 (#925, @cadedaniel)
- [Post Ray...
v0.4.0 release
Highlights
The KubeRay 0.4.0 release includes the following improvements.
- Integrations for the MCAD and Volcano batch scheduling systems.
- Stable Helm support for the KubeRay Operator, KubeRay API Server, and Ray clusters. These charts are now hosted at a Helm repo.
- Critical stability improvements to the Ray Autoscaler integration. (To benefit from these improvements, use KubeRay >=0.4.0 and Ray >=2.2.0.)
- Numerous improvements to CI, tests, and developer workflows; a new configuration test framework.
- Numerous improvements to documentation.
- Bug fixes for alpha features, such as RayJobs and RayServices.
- Various improvements and bug fixes for the core RayCluster controller.
Contributors
The following individuals contributed to KubeRay 0.4.0. This list is alphabetical and incomplete.
@AlessandroPomponio @architkulkarni @Basasuya @DmitriGekhtman @IceKhan13 @asm582 @davidxia @dhaval0108 @haoxins @iycheng @jasoonn @Jeffwan @jianyuan @kaushik143 @kevin85421 @lizzzcai @orcahmlee @pcmoritz @peterghaddad @rafvasq @scarlet25151 @shrekris-anyscale @sigmundv @sihanwang41 @simon-mo @tbabej @tgaddair @ulfox @wilsonwang371 @wuisawesome
New features and integrations
- [Feature] Support Volcano for batch scheduling (#755, @tgaddair)
- kuberay int with MCAD (#598, @asm582)
Helm
These changes pertain to KubeRay's Helm charts.
- [Bug] Remove an unused field (ingress.enabled) from KubeRay operator chart (#812, @kevin85421)
- [helm] Add memory limits and resource documentation. (#789, @DmitriGekhtman)
- [Helm] Expose security context in helm chart. (#773, @DmitriGekhtman)
- [Helm] Clean up RayCluster Helm chart ahead of KubeRay 0.4.0 release (#751, @DmitriGekhtman)
- [Feature] Expose initContainer image in RayCluster chart (#674, @kevin85421)
- [Feature][Helm] Expose the autoscalerOptions (#666, @orcahmlee)
- [Feature][Helm] Align the key of minReplicas and maxReplicas (#663, @orcahmlee)
- Helm: add service type configuration to head group for ray-cluster (#614, @IceKhan13)
- Allow annotations in ray cluster helm chart (#574, @sigmundv)
- [Feature][Helm] Enable sidecar configuration in Helm chart (#604, @kevin85421)
- [bugfix][apiserver helm]: Adding missing rbacenable value (#594, @dhaval0108)
- [Bug] Modification of nameOverride will cause label selector mismatch for head node (#572, @kevin85421)
- [Helm][minor] Make "disabled" flag for worker groups optional (#548, @kevin85421)
- helm: Uncomment the disabled key for the default workergroup (#543, @tbabej)
- Fix Helm chart default configuration (#530, @kevin85421)
- helm-chart/ray-cluster: Allow setting pod lifecycle (#494, @ulfox)
CI
The changes in this section pertain to KubeRay CI, testing, and developer workflows.
- [Feature] Improve the observability of integration tests (#775, @jasoonn)
- [CI] Pin go version in CRD consistency check (#794, @DmitriGekhtman)
- [Feature] Test sample RayService YAML to catch invalid or out of date one (#731, @jasoonn)
- Replace kubectl wait command with RayClusterAddCREvent (#705, @kevin85421)
- [Feature] Test sample RayCluster YAMLs to catch invalid or out of date ones (#678, @kevin85421)
- [Bug] Misuse of Docker API and misunderstanding of Ray HA cause test_ray_serve flaky (#650, @jasoonn)
- Configuration Test Framework Prototype (#605, @kevin85421)
- Update tests for better Mac M1 compatibility (#654, @shrekris-anyscale)
- [Bug] Update wait function in test_detached_actor (#635, @kevin85421)
- [Bug] Misuse of Docker API and misunderstanding of Ray HA cause test_detached_actor flaky (#619, @kevin85421)
- [Feature] Docker support for chart-testing (#623, @jasoonn)
- [Feature] Optimize the wait functions in E2E tests (#609, @kevin85421)
- [Feature] Running end-to-end tests on local machine (#589, @kevin85421)
- [CI]use fixed version of gofumpt (#596, @wilsonwang371)
- update test files before separating them (#591, @wilsonwang371)
- Add reminders to avoid RBAC synchronization bug (#576, @kevin85421)
- [Feature] Consistency check for RBAC (#577, @kevin85421)
- [Feature] Sync for manifests and helm chart (#564, @kevin85421)
- [Feature] Add a chart-test script to enable chart lint error reproduction on laptop (#563, @kevin85421)
- [Feature] Add helm lint check in Github Actions (#554, @kevin85421)
- [Feature] Add consistency check for types.go, CRDs, and generated API in GitHub Actions (#546, @kevin85421)
- support ray 2.0.0 in compatibility test (#508, @wilsonwang371)
KubeRay Operator deployment
The changes in this section pertain to deployment of the KubeRay Operator.
- Fix finalizer typo and re-create manifests (#631, @AlessandroPomponio)
- Change Kuberay operator Deployment strategy type to Recreate (#566, @haoxins)
- [Bug][Doc] Increase default operator resource requirements, improve docs (#727, @kevin85421)
- [Feature] Sync logs to local file (#632, @Basasuya)
- [Bug] label rayNodeType is useless (#698, @kevin85421)
- Revise sample configs, increase memory requests, update Ray versions (#761, @DmitriGekhtman)
RayCluster controller
The changes in this section pertain to the RayCluster controller sub-component of the KubeRay Operator.
- [autoscaler] Expose autoscaler container security context. (#752, @DmitriGekhtman)
- refactor: log more descriptive info from initContainer (#526, @davidxia)
- [Bug] Fail to create ingress due to the deprecation of the ingress.class annotation (#646, @kevin85421)
- [kuberay] Fix inconsistent RBAC truncation for autoscaling clusters. (#689, @DmitriGekhtman)
- [raycluster controller] Always honor maxReplicas (#662, @DmitriGekhtman)
- [Autoscaler] Pass pod name to autoscaler, add pod patch permission (#740, @DmitriGekhtman)
- [Bug] Shallow copy causes different worker configurations (#714, @kevin85421)
- Fix duplicated volume issue (#690, @wilsonwang371)
- [fix][raycluster controller] No error if head ip cannot be determined. (#701, @DmitriGekhtman)
- [Feature] Set default appProtocol for Ray head service to tcp (#668, @kevin85421)
- [Telemetry] Inject env identifying KubeRay. (#562, @DmitriGekhtman)
- fix: correctly set GPUs in rayStartParams (#497, @davidxia)
- [operator] enable bashrc before container start (#427, @Basasuya)
- ...
v0.3.0 release
v0.3.0 (2022-08-17)
RayService (new feature!)
- [rayservice] Fix config names to match serve config format directly (#464, @edoakes)
- Disable pin on head for serve controller by default in service operator (#457, @iycheng)
- add wget timeout to probes (#448, @wilsonwang371)
- Disable async serve handler in Ray Service cluster. (#447, @iycheng)
- Add more env for RayService head or worker pods (#439, @brucez-anyscale)
- RayCluster created by RayService set death info env for ray container (#419, @brucez-anyscale)
- Add integration test for kuberay ray service and improve ray service operator (#415, @brucez-anyscale)
- Fix a potential reconcile issue for RayService and allow config unhealth time threshold in CR (#384, @brucez-anyscale)
- [Serve] Unify logger and add user facing events (#378, @simon-mo)
- Improve RayService Operator logic to handle head node crash (#376, @brucez-anyscale)
- Add serving service for users traffic with health check (#367, @brucez-anyscale)
- Create a service for dashboard agent (#324, @brucez-anyscale)
- Update RayService CR to integrate with Ray Nightly (#322, @brucez-anyscale)
- RayService: zero downtime update and healthcheck HA recovery (#307, @brucez-anyscale)
- RayService: Dev RayService CR and Controller logic (#287, @brucez-anyscale)
- KubeRay: kubebuilder creat RayService Controller and CR (#270, @brucez-anyscale)
RayJob (new feature!)
- Properly convert unix time into meta time (#480, @pingsutw)
- Fix nil pointer dereference (#429, @pingsutw)
- Improve RayJob controller quality to alpha (#398, @Jeffwan)
- Submit ray job after cluster is ready (#405, @pingsutw)
- Add RayJob CRD and controller logic (#303, @harryge00)
Cluster Fault Tolerant (new feature!)
- tune readiness probe timeouts (#411, @wilsonwang371)
- enable ray external storage namespace (#406, @wilsonwang371)
- Initial support for external Redis and GCS HA (#294, @wilsonwang371)
Autoscaler (new feature!)
- [Autoscaler] Match autoscaler image to Ray head image for Ray >= 2.0.0 (#423, @DmitriGekhtman)
- [autoscaler] Better defaults and config options (#414, @DmitriGekhtman)
- [autoscaler] Make log file mount path more specific. (#391, @DmitriGekhtman)
- [autoscaler] Flip prioritize-workers-to-delete feature flag (#379, @DmitriGekhtman)
- Update autoscaler image (#371, @DmitriGekhtman)
- [minor] Update autoscaler image. (#313, @DmitriGekhtman)
- Provide override for autoscaler image pull policy. (#297, @DmitriGekhtman)
- [RFC][autoscaler] Add autoscaler container overrides and config options for scale behavior. (#278, @DmitriGekhtman)
- [autoscaler] Improve autoscaler auto-configuration, upstream recent improvements to Kuberay NodeProvider (#274, @DmitriGekhtman)
Operator
- correct gcs ha to gcs ft (#482, @wilsonwang371)
- Fix panic in cleanupInvalidVolumeMounts (#481, @MissiontoMars)
- fix: worker node can't connect to head node service (#445, @pingsutw)
- Add http resp code check for kuberay (#435, @brucez-anyscale)
- Fix wrong ray start command (#431, @pingsutw)
- fix controller: use Service's TargetPort (#383, @davidxia)
- Generate clientset for new specs (#392, @Basasuya)
- Add Ray address env. (#388, @DmitriGekhtman)
- Add the support to replace evicted head pod (#381, @Jeffwan)
- [Bug] Fix raycluster updatestatus list wrong label (#377, @scarlet25151)
- Make replicas optional for the head spec. (#362, @DmitriGekhtman)
- Add ray head service endpoints in status for expose raycluster's head node endpoints (#341, @scarlet25151)
- Support KubeRay management labels (#345, @Jeffwan)
- fix: bug in object store memory validation (#332, @davidxia)
- feat: add EventReason type for events (#334, @davidxia)
- minor refactor: fix camel-casing of unHealthy -> unhealthy (#333, @davidxia)
- refactor: remove redundant imports (#317, @davidxia)
- Fix GPU-autofill for rayStartParams (#328, @DmitriGekhtman)
- ray-operator: add missing space in controller log messages (#316, @davidxia)
- fix: use head group's ServiceAccount in autoscaler RoleBinding (#315, @davidxia)
- fix typos in comments and help messages (#304, @davidxia)
- enable force cluster upgrade (#231, @wilsonwang371)
- fix operator: correctly set head pod service account (#276, @davidxia)
- [hotfix] Fix Service account typo (#285, @DmitriGekhtman)
- Rename RayCluster folder to Ray since the group is Ray (#275, @brucez-anyscale)
- KubeRay: Relocate files to enable controller extension with Kubebuilder (#268, @brucez-anyscale)
- fix: use configured RayCluster service account when autoscaling (#259, @davidxia)
- suppress not found errors into regular logs (#222, @akanso)
- adding label check (#221, @akanso)
- Prioritize WorkersToDelete (#208, @sriram-anyscale)
- Simplify k8s client creation (#179, @chenk008)
- [ray-operator]Make log timestamp readable (#206, @chenk008)
- bump controller-runtime to 0.11.1 and Kubernetes to v1.23 (#180, @chenk008)
APIServer
- Add envs in cluster service api (#432, @MissiontoMars)
- Expose swallowed detail error messages (#422, @Jeffwan)
- fix: typo RAY_DISABLE_DOCKER_CPU_WRARNING -> RAY_DISABLE_DOCKER_CPU_WARNING (#421, @pingsutw)
- Add hostPathType and mountPropagationMode field for apiserver (#413, @scarlet25151)
- Fix
ListAllComputeTemplates
proto comments (#407, @MissiontoMars) - Enable DefaultHTTPErrorHandler and Upgrade grpc-gateway to v2 (#369, @Jeffwan)
- Validate namespace consistency in the request when creating the cluster and the compute template (#365, @daikeshi)
- Update compute template service url to include namespace path param (#363, @Jeffwan)
- fix apiserver created raycluster metrics port missing and check (#356, @scarlet25151)
- Support mounting volumes in API request (#346, @Jeffwan)
- add standard label for the filtering of cluster (#342, @scarlet25151)
- expose kubernetes events in apiserver ([#343](https:...
v0.3.0-rc.2 release
v0.3.0-rc.2 (2022-08-17)
- [doc] Config and doc updates ahead of KubeRay 0.3.0/Ray 2.0.0 (#486, @DmitriGekhtman)
- Properly convert unix time into meta time (#480, @pingsutw)
- correct gcs ha to gcs ft (#482, @wilsonwang371)
- Fix panic in cleanupInvalidVolumeMounts (#481, @MissiontoMars)
- document the raycluster status (#473, @scarlet25151)
- sync up helm chart's role (#472, @scarlet25151)
- Enable docker image push for release-0.3 branch (#462, @Jeffwan)
- helm-charts/ray-cluster: Allow extra workers (#451, @ulfox)
- [rayservice] Fix config names to match serve config format directly (#464, @edoakes)
- Update helm chart version to 0.3.0 (#461, @Jeffwan)
v0.3.0-rc.1 release
v0.3.0-rc.1 (2022-08-11)
- Disable pin on head for serve controller by default in service operator (#457, @iycheng)
- add wget timeout to probes (#448, @wilsonwang371)
- Disable async serve handler in Ray Service cluster. (#447, @iycheng)
- helm-chart/ray-cluster: allow head autoscaling (#443, @ulfox)
- fix: worker node can't connect to head node service (#445, @pingsutw)
- Add more env for RayService head or worker pods (#439, @brucez-anyscale)
- Clean up example samples (#434, @DmitriGekhtman)
- Add envs in cluster service api (#432, @MissiontoMars)
- Add http resp code check for kuberay (#435, @brucez-anyscale)
- Add ray state api doc link in ray service doc (#428, @brucez-anyscale)
- Fix wrong ray start command (#431, @pingsutw)
- Fix nil pointer dereference (#429, @pingsutw)