Skip to content

ModelTrainer doesn't support heterogeneous clusters #5225

Open
@brunopistone

Description

@brunopistone

Describe the bug
Seems there is a bug with ModelTrainer and heterogeneous clusters. When I'm trying to run a SageMaker training job with an heterogeneous clusters, even if I'm configuring the instance_group in the compute, and in the input channels, I get the following error:

ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: Some channels 
have assigned instance groups: [test, train] while others not: [sm_drivers, code]

sm_drivers and code are private channels configured by the SDK.

To reproduce
A clear, step-by-step set of instructions to reproduce the bug.

from sagemaker.modules.configs import InstanceGroup
from sagemaker.modules.configs import InputData, S3DataSource
from sagemaker.modules.configs import (
    Compute,
    OutputDataConfig,
    RemoteDebugConfig,
    SourceCode,
    StoppingCondition,
)
from sagemaker.modules.train import ModelTrainer

group_1 = InstanceGroup(
    instance_type="ml.g5.xlarge",
    instance_count=2,
    instance_group_name="group_1",
)

group_2 = InstanceGroup(
    instance_type="ml.t3.xlarge",
    instance_count=1,
    instance_group_name="group_2",
)

# Define the script to be run
source_code = SourceCode(
    source_dir="./scripts",
    requirements="requirements.txt",
    command="python launcher.py -e train.py",
)

# Define the compute
compute_configs = Compute(
    instance_groups=[group_1, group_2]
    keep_alive_period_in_seconds=0,
)

job_name = "train-ray-processing-train"

output_path = f"s3://{bucket_name}/{job_name}"

model_trainer = ModelTrainer(
    training_image=image_uri,
    source_code=source_code,
    base_job_name=job_name,
    compute=compute_configs,
    hyperparameters={
        "epochs": 100,
        "learning_rate": 0.001,
        "batch_size": 100,
    },
    stopping_condition=StoppingCondition(max_runtime_in_seconds=18000),
    output_data_config=OutputDataConfig(
        s3_output_path=output_path, compression_type="NONE"
    ),
    role=role,
)

train_input = InputData(
    channel_name="processing",
    data_source=S3DataSource(
        s3_data_type="S3Prefix",
        s3_uri=input_data,
        s3_data_distribution_type="FullyReplicated",
        instance_group_names=["group_1", "group_2"]
    ),  
)

data = [train_input]

model_trainer.train(input_data_config=data, wait=False)

Expected behavior
See Estimator behavior. sm_drivers and code are private channels configured by the SDK, so instance_group_names should be automatically set

Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.247.1
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Any
  • Framework version: Any
  • Python version: 3.12
  • CPU or GPU: CPU and GPU
  • Custom Docker image (Y/N): N

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions