ModelTrainer doesn't support heterogeneous clusters

**Describe the bug**
Seems there is a bug with [ModelTrainer](https://sagemaker.readthedocs.io/en/stable/api/training/model_trainer.html) and [heterogeneous clusters](https://docs.aws.amazon.com/sagemaker/latest/dg/train-heterogeneous-cluster-configure.html). When I'm trying to run a SageMaker training job with an heterogeneous clusters, even if I'm configuring the instance_group in the compute, and in the input channels, I get the following error:

```
ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: Some channels 
have assigned instance groups: [test, train] while others not: [sm_drivers, code]
```

`sm_drivers` and `code` are private channels configured by the SDK.

**To reproduce**
A clear, step-by-step set of instructions to reproduce the bug.

```
from sagemaker.modules.configs import InstanceGroup
from sagemaker.modules.configs import InputData, S3DataSource
from sagemaker.modules.configs import (
    Compute,
    OutputDataConfig,
    RemoteDebugConfig,
    SourceCode,
    StoppingCondition,
)
from sagemaker.modules.train import ModelTrainer

group_1 = InstanceGroup(
    instance_type="ml.g5.xlarge",
    instance_count=2,
    instance_group_name="group_1",
)

group_2 = InstanceGroup(
    instance_type="ml.t3.xlarge",
    instance_count=1,
    instance_group_name="group_2",
)

# Define the script to be run
source_code = SourceCode(
    source_dir="./scripts",
    requirements="requirements.txt",
    command="python launcher.py -e train.py",
)

# Define the compute
compute_configs = Compute(
    instance_groups=[group_1, group_2]
    keep_alive_period_in_seconds=0,
)

job_name = "train-ray-processing-train"

output_path = f"s3://{bucket_name}/{job_name}"

model_trainer = ModelTrainer(
    training_image=image_uri,
    source_code=source_code,
    base_job_name=job_name,
    compute=compute_configs,
    hyperparameters={
        "epochs": 100,
        "learning_rate": 0.001,
        "batch_size": 100,
    },
    stopping_condition=StoppingCondition(max_runtime_in_seconds=18000),
    output_data_config=OutputDataConfig(
        s3_output_path=output_path, compression_type="NONE"
    ),
    role=role,
)

train_input = InputData(
    channel_name="processing",
    data_source=S3DataSource(
        s3_data_type="S3Prefix",
        s3_uri=input_data,
        s3_data_distribution_type="FullyReplicated",
        instance_group_names=["group_1", "group_2"]
    ),  
)

data = [train_input]

model_trainer.train(input_data_config=data, wait=False)
```

**Expected behavior**
See Estimator behavior. `sm_drivers` and `code` are private channels configured by the SDK, so `instance_group_names` should be automatically set

**Screenshots or logs**
If applicable, add screenshots or logs to help explain your problem.

**System information**
A description of your system. Please provide:
- **SageMaker Python SDK version**: 2.247.1
- **Framework name (eg. PyTorch) or algorithm (eg. KMeans)**:  Any
- **Framework version**: Any
- **Python version**: 3.12
- **CPU or GPU**: CPU and GPU
- **Custom Docker image (Y/N)**: N

**Additional context**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ModelTrainer doesn't support heterogeneous clusters #5225

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ModelTrainer doesn't support heterogeneous clusters #5225

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions