-
Notifications
You must be signed in to change notification settings - Fork 741
Insights: kubeflow/trainer
Overview
Could not load contribution data
Please try again later
18 Pull requests merged by 5 people
-
fix type in model initializer entrypoint
#2489 merged
Mar 8, 2025 -
Move TrainJob marker defaulting and validation integration tests to test/integration/webhooks pkg
#2486 merged
Mar 7, 2025 -
fix(runtime): fix error label name.
#2487 merged
Mar 7, 2025 -
Implement MPI plugin UTs
#2481 merged
Mar 7, 2025 -
feat(controller): Integrate DependsOn API
#2484 merged
Mar 7, 2025 -
Implement MPIImplementation Enum CRD validation
#2482 merged
Mar 6, 2025 -
Implement MPI numProcPerNode defaulter
#2483 merged
Mar 6, 2025 -
Store E2E manifests to artifacts directory
#2478 merged
Mar 6, 2025 -
Add dependencies to RuntimeRegistrar
#2476 merged
Mar 6, 2025 -
Use large runner for building container image
#2475 merged
Mar 5, 2025 -
Add MPIMLPolicySource CRD defaulters
#2474 merged
Mar 5, 2025 -
feat(sdk): Generate external Kubernetes and JobSet models
#2466 merged
Mar 5, 2025 -
chore(test): Upload artifacts from dir
#2473 merged
Mar 5, 2025 -
Make MPIMLPolicySource optional fields as a pointer
#2472 merged
Mar 5, 2025 -
Implement UTs for PlainML plugin
#2469 merged
Mar 5, 2025 -
chore(test): Add E2E tests for Kubeflow Trainer
#2470 merged
Mar 5, 2025 -
KEP-2170: Add Kubeflow Trainer Pipeline Framework Design
#2439 merged
Mar 3, 2025 -
fix: fix typos in script comments.
#2465 merged
Mar 3, 2025
4 Pull requests opened by 4 people
-
WIP: Implement torch plugin UTs
#2471 opened
Mar 5, 2025 -
chore: Add unit tests for `pkg/apply`
#2479 opened
Mar 6, 2025 -
feat(sdk): Migrate to OpenAPI V3
#2490 opened
Mar 8, 2025 -
Push images to GHCR
#2491 opened
Mar 9, 2025
4 Issues closed by 2 people
-
Training job restart enhancement
#2185 closed
Mar 10, 2025 -
Use JobSet DependsOn API for TrainJob
#2467 closed
Mar 7, 2025 -
Randomly failed to initialze runtime factories
#2477 closed
Mar 6, 2025 -
KEP-2170: Add E2E tests for Kubeflow Training V2
#2213 closed
Mar 5, 2025
4 Issues opened by 4 people
-
Add a workflow for publishing Helm charts
#2488 opened
Mar 7, 2025 -
[SDK] Failed to list runtime due to invalid integer type
#2485 opened
Mar 7, 2025 -
CONTRIBUTING.md should be updated
#2480 opened
Mar 6, 2025 -
Decouple UTs between Framework and Plugins packages
#2468 opened
Mar 3, 2025
19 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
Add Helm chart for kubeflow trainer
#2435 commented on
Mar 7, 2025 • 72 new comments -
KEP-2401: Kubeflow LLM Trainer V2
#2410 commented on
Mar 7, 2025 • 61 new comments -
fix restart policy bug in mpi job UpdateJobConditions
#2344 commented on
Mar 5, 2025 • 5 new comments -
Add Initialized and ComponentsCreated conditions to TrainJob API
#2464 commented on
Mar 5, 2025 • 4 new comments -
Upgrade: K8s 1.32
#2448 commented on
Mar 9, 2025 • 1 new comment -
[feature] migrate images to ghcr
#2455 commented on
Mar 5, 2025 • 0 new comments -
Add internal-cert-controller disable flag
#2426 commented on
Mar 3, 2025 • 0 new comments -
KEP-2170: Create MPI Runtime
#2217 commented on
Mar 7, 2025 • 0 new comments -
Managing Pod Lifecycle in Distributed Training with TFJob
#2454 commented on
Mar 6, 2025 • 0 new comments -
Strategies for Deleting Successful Pods without Affecting Task Execution in TFJob
#2453 commented on
Mar 6, 2025 • 0 new comments -
Cap `nproc_per_node` based on the CPU resources of the node for PyTorch TrainJob
#2407 commented on
Mar 5, 2025 • 0 new comments -
Don't overwrite PYTHONUNBUFFERED in PyTorchJob (and others)
#1921 commented on
Mar 5, 2025 • 0 new comments -
Add unit tests that cover the `pkg/apply` package
#2452 commented on
Mar 5, 2025 • 0 new comments -
KEP-2170: Kubeflow Trainer V2 API
#2170 commented on
Mar 5, 2025 • 0 new comments -
mpi job bug
#2334 commented on
Mar 5, 2025 • 0 new comments -
KEP-2170: Add Kubeflow Trainer Pipeline Framework Concept page to Documentation
#2458 commented on
Mar 5, 2025 • 0 new comments -
Add migration guide from Training Operator to Kubeflow Trainer V2
#2412 commented on
Mar 4, 2025 • 0 new comments -
Create Slurm runtime for model training using V2 APIs
#2249 commented on
Mar 4, 2025 • 0 new comments -
Flaky Test: TestDatasetIntegration.test_dataset_download[HuggingFace - Public dataset-huggingface-test_case0]
#2460 commented on
Mar 4, 2025 • 0 new comments