Description
Describe the bug
Seems there is a bug with ModelTrainer and heterogeneous clusters. When I'm trying to run a SageMaker training job with an heterogeneous clusters, even if I'm configuring the instance_group in the compute, and in the input channels, I get the following error:
ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: Some channels
have assigned instance groups: [test, train] while others not: [sm_drivers, code]
sm_drivers
and code
are private channels configured by the SDK.
To reproduce
A clear, step-by-step set of instructions to reproduce the bug.
from sagemaker.modules.configs import InstanceGroup
from sagemaker.modules.configs import InputData, S3DataSource
from sagemaker.modules.configs import (
Compute,
OutputDataConfig,
RemoteDebugConfig,
SourceCode,
StoppingCondition,
)
from sagemaker.modules.train import ModelTrainer
group_1 = InstanceGroup(
instance_type="ml.g5.xlarge",
instance_count=2,
instance_group_name="group_1",
)
group_2 = InstanceGroup(
instance_type="ml.t3.xlarge",
instance_count=1,
instance_group_name="group_2",
)
# Define the script to be run
source_code = SourceCode(
source_dir="./scripts",
requirements="requirements.txt",
command="python launcher.py -e train.py",
)
# Define the compute
compute_configs = Compute(
instance_groups=[group_1, group_2]
keep_alive_period_in_seconds=0,
)
job_name = "train-ray-processing-train"
output_path = f"s3://{bucket_name}/{job_name}"
model_trainer = ModelTrainer(
training_image=image_uri,
source_code=source_code,
base_job_name=job_name,
compute=compute_configs,
hyperparameters={
"epochs": 100,
"learning_rate": 0.001,
"batch_size": 100,
},
stopping_condition=StoppingCondition(max_runtime_in_seconds=18000),
output_data_config=OutputDataConfig(
s3_output_path=output_path, compression_type="NONE"
),
role=role,
)
train_input = InputData(
channel_name="processing",
data_source=S3DataSource(
s3_data_type="S3Prefix",
s3_uri=input_data,
s3_data_distribution_type="FullyReplicated",
instance_group_names=["group_1", "group_2"]
),
)
data = [train_input]
model_trainer.train(input_data_config=data, wait=False)
Expected behavior
See Estimator behavior. sm_drivers
and code
are private channels configured by the SDK, so instance_group_names
should be automatically set
Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.
System information
A description of your system. Please provide:
- SageMaker Python SDK version: 2.247.1
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): Any
- Framework version: Any
- Python version: 3.12
- CPU or GPU: CPU and GPU
- Custom Docker image (Y/N): N
Additional context
Add any other context about the problem here.