Old explanation of design (overhaul/integrate):

By contrast to earlier designs, create_step does not require a step_config to be passed here as an argument. This simplify the interface considerably because we don't have to worry about getting the right type of step_config for a given step_type. The problem is that while we can simply *return more specific* step types in subclasses (such as a ProcessingStep instead of a ConfigurableRetryStep), we can**not** require a *more specific argument* for subclasses (for example requiring a ProcessingStepConfig for a ProcessingStepFactory), as this would violate the Liskov Substitution Principal.

This problem would not even be easily solved by using generics, because it is not obvious how we can go from my given Steptype to the associated StepConfig.

In [21]:
# %load_ext nb_mypy

In [22]:
# to make local *folder* paths work (even though python paths work due to packageing)
import os
os.chdir(
    f'{os.environ["HOME"]}/repos/sagemaker-pipelines-abstraction/src/sm_pipelines_oo/'
)

In [23]:
from typing import  ClassVar, TypeVar, TypeAlias, Any, final, TypedDict
from abc import ABC, abstractmethod
from pathlib import Path

from pydantic_settings import BaseSettings
from sagemaker.processing import Processor
from sagemaker.estimator import EstimatorBase

from sagemaker.workflow.steps import ProcessingStep, TrainingStep, CreateModelStep, TransformStep, \
    TuningStep, ConfigurableRetryStep


## Why no generics?
- The end goal is simply to have an object that satisfies the (ConfigurableRetry)StepInterface. From the perspective of the pipeline, we don't care what type of step it is.
- The initial reason for looking into leveraging generics was for making sure that we are passing the right config for a given type of step. However, after a lot of trial and error, I still did not find a good way to create a simple class hierarchy based on what the Sagemaker SDK makes available to us. Instead, it looks more promising to simply create a very minimal interface for step factories, and later specific implementations decide what the best way to create that kind of step is.
  - Downside: let's reuse of code between different step factories. This makes it somewhat harder to get started with creating new step factories, because there is less structure imposed for how exactly to do it.
  - Upside: more flexibility for grading step factories. This may actually make it easier to create new step factories, and it will make it easier to maintain given step factories as the interface of the Sagemaker SDK changes.
  - Note: neither of these points will affect a basic library user who only uses inbuilt step factories.

# Configuration
Goal:  Abstract configuration into a single config class which loads all config's it needs in the directory (even if this requires traversing). This will not only make the intent of this method more clear, but it will also make it easier to have a single config façade that abstracts what config's are global and what are step-specific (step config simply need a reference to the shared config, so they can fall back to that if necessary, but the concrete logic can be implemented differently for each step type). Also, having a config façade makes it easy to define methods that compute derived values.

## Attempt 1: *Overarching* ConfigFacade

In [24]:
# class ConfigFacade:
#     def __init__(self, config_dir: Path):
#         # load all the yaml config files
#         shared_config_dict: dict[str, Any] = ...
#         step_configs_dicts: dict[str, dict[str, Any]] = ...

#         # Convert the dictionaries to pydantic models
#         self.shared_config: SharedConfig = SharedConfig(**shared_config_dict)
#         self.step_configs: dict[str, BaseSettings] = {}
#         for step_name, step_config_dict  in step_configs_dicts.items():

#             self.step_configs[step_name] = StepConfig(**step_config_dict)


Problem: how to get  the pydantic model for a given config? While it would be possible to have another look up table, similar to how we find the right specific step factory, it makes more sense that each step factory owns the associated config model. This is because the key challenge is that the  config model matches the specific factory. 

As a result, it  is better to not load all the configs upfront (except possibly into dictionaries).

## Attempt 2: *Separate* Configs w/o facade, but *reference* to shared config

In [25]:
# from dataclasses import dataclass

# class SharedConfigInterface(BaseSettings):
#     """
#     This interface defines all the configs that our library code expects to be present in the shared_config.
#     """
#     project_name: str
#     project_version: str  # Versions data (and probably more in the future)


# class StepConfigInterface(BaseSettings):
#     """
#     This ensures every step config has a step_type (required to determine step factory),  as well as a reference to the shared_config.

#     Note: If the concrete step_config depends on any specific config values being set in the shared_config (in addition to the ones defined in the SharedConfigInterface), we should redefine the type of shared_config to this more specific type.
#     """
#     step_type: str # Identifies factory, which in turn identifies StepConfig model
#     shared_config: SharedConfigInterface # So that we have access to sharedconfig

# Simple factory
~~This makes better use of factory, because depending on the arg passed to it, it creates a different type of step. Otherwise, we may as well us strategy pattern (only use of factory is to construct step later when configs etc are known - but a given factory always produces same kind of step, except from configuration).~~

In [26]:
from sagemaker.session import Session, get_execution_role
from sagemaker.workflow.pipeline_context import PipelineSession, LocalPipelineSession

class StepFactoryInterface(ABC):
    """
    In addition to the required methods defined below, it is recommended to implement the following attributes and methods in order to make implementation of the required methods easiest:
    - _config_model: ClassVar[type[BaseSettings]] (Class used to convert config_dict to pydantic model to validate types and potentially compute derived attributes.
    """

    @abstractmethod
    def __init__(
        self,
        step_config_dict: dict[str, Any],
        role_arn: str,
        pipeline_session: PipelineSession | LocalPipelineSession, # todo: consider allowing normal session - probably should be separate argument though?
    ):
        ...


    # @staticmethod
    # @abstractmethod
    # def _get_config_model() -> type[BaseSettings]:
    #     """
    #     Pydantic model used to validate and convert the config_dict to an instance of pydantic.BaseSettings.
    #     """
    #     ...


    @abstractmethod
    def create_step(self) -> ConfigurableRetryStep:
        # Note that we don't have to worry about violating the LSP -  even though we are adding back an argument for the config – because at this stage that config will simply be of type dictionary. Thus, subclasses don't have to specify a more specific subtype of config here yet.
        ...

In [27]:
# For Python < 3.12, don't use typing.TypedDict: https://docs.pydantic.dev/2.6/errors/usage_errors/#typed-dict-version
from typing_extensions import TypedDict

from sagemaker.processing import FrameworkProcessor
from sagemaker.estimator import EstimatorBase
from sagemaker.sklearn.estimator import SKLearn

from sm_pipelines_oo.shared_config_schema import SharedConfig

class _FWProcessorInitConfig(TypedDict):
    framework_version: str
    estimator_cls_name: str
    instance_count: int
    instance_type: str


class _FWProcessorRunConfig(TypedDict):
    code: str
    source_dir: str
    # todo: allow athena datasetdefinition instead
    input_files_s3paths: list[str]  # todo: validate it's an s3 path
    output_files_s3paths: list[str]  # todo: validate it's an s3 path


class FrameworkProcessingStepConfig(BaseSettings):
    # todo:
    step_name: str
    step_factory_class: str
    processor_init_args: _FWProcessorInitConfig
    processor_run_args: _FWProcessorRunConfig
    # For now, we will reload this for every step config to avoid dependency on pipeline wrapper.
    shared_config: SharedConfig


In [28]:
from sagemaker.processing import ProcessingInput, ProcessingOutput


class FrameworkProcessingStepFactory(StepFactoryInterface):
    # Note: this is a public attribute, so user can add support for additional estimators
    estimator_name_to_cls_mapping: ClassVar[dict[str, Any]] = {  # todo:  find supertype
        'SKLearn': SKLearn,
    }

    _config_model: ClassVar[type[FrameworkProcessingStepConfig]] = FrameworkProcessingStepConfig

    def __init__(
        self,
        step_config_dict: dict[str, Any],
        role_arn: str,
        pipeline_session: PipelineSession | LocalPipelineSession
    ):
        # Parse config, using the specific pydantic model that this factory has as a class variable.
        self._config: FrameworkProcessingStepConfig = self._config_model(**step_config_dict)
        self._role_arn = role_arn
        self._pipeline_session = pipeline_session

    @property
    def processor(self) -> FrameworkProcessor:
        # Start with init args from config, but convert TypedDict to dict so we can modify keys.
        init_args: dict[str, Any] = dict(self._config.processor_init_args)
        # Replace the string of estimator_cls_name with the actual estimator_cls
        estimator_cls_name = init_args.pop('estimator_cls_name')
        init_args['estimator_cls'] = self.estimator_name_to_cls_mapping[estimator_cls_name]
        return FrameworkProcessor(
            **init_args,
            role=self._role_arn,
            sagemaker_session=self._pipeline_session,
        )  # todo: check if typechecker catches wrong args. Otherwise, define typed dict for FWPInitArgs.

    def _construct_run_args(self) -> dict[str, Any]:
        # Start with init args from config, but convert TypedDict to dict so we can modify keys.
        run_args: dict[str, Any] = dict(self._config.processor_run_args)

        # Create ProcessingInputs from list of s3paths (strings)
        _input_files_s3paths: list[str] = run_args.pop('input_files_s3paths')
        _processing_inputs: list[ProcessingInput] = [
            ProcessingInput(
                source=s3path,
                # todo: Allow passing through extra arguments
            )
            for s3path in _input_files_s3paths
        ]
        run_args['inputs'] = _processing_inputs

        # Do the same for ProcessingOutputs
        _output_files_s3paths: list[str] = run_args.pop('output_files_s3paths')
        _processing_outputs: list[ProcessingOutput] = [
            ProcessingOutput(
                source=s3path,
                # todo: Allow passing through extra arguments
            )
            for s3path in _output_files_s3paths
        ]

        run_args['outputs'] = _processing_outputs
        return run_args

    def create_step(self) -> ProcessingStep:
        _step_args = self.processor.run(
            **self._construct_run_args()
        )
        return ProcessingStep(
            name=self._config.step_name,
            step_args=_step_args,
        )


In [29]:
# To do: move those into unit tests
from sagemaker.session import Session, get_execution_role
from sagemaker.workflow.pipeline_context import PipelineSession, LocalPipelineSession

_fw_processor_config_dict = {
    'step_name': 'preprocessing',
    'step_factory_class': 'FrameworkProcessingStepFactory',
    'processor_init_args': {
        'framework_version': '0.23-1',
        'estimator_cls_name': 'SKLearn',
        'instance_count': 1,
        'instance_type': 'ml.m5.xlarge',
    },
    'processor_run_args': {
        'code': 'preprocess.py',
        'source_dir': 'code/preprocess',
        'input_files_s3paths': [],
        'output_files_s3paths': [],
    },
    'shared_config': {
        'project_name': 'test',
        'project_version': '0',
        'region': 'local',
        'project_bucket_name': 'test-bucket',
    }
}

fw_processing_step_factory = FrameworkProcessingStepFactory(
    step_config_dict=_fw_processor_config_dict,
    pipeline_session=LocalPipelineSession(),
    role_arn=get_execution_role(),
)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/thomas-22/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/thomas-22/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/thomas-22/.config/sagemaker/config.yaml


INFO:botocore.tokens:Loading cached SSO token for ml


In [30]:
# fw_processixng_step_factory.create_step()
# fwp = fw_processing_step_factory.processor

In [31]:
# [a for a in dir(fwp) if not a.startswith('_')]

In [32]:
# To do: add unit test  but provide inputs and outputs, and make sure that the run_args are constructed correctly.
fw_processing_step_factory._construct_run_args()


{'code': 'preprocess.py',
 'source_dir': 'code/preprocess',
 'inputs': [],
 'outputs': []}

In [33]:
fw_processing_step_factory.create_step()

ProcessingStep(name='preprocessing', display_name=None, description=None, step_type=<StepTypeEnum.PROCESSING: 'Processing'>, depends_on=None)

## StepFactory *Facade*

In [34]:
class StepFactoryFacadeInterface(ABC):
    """
    This interface decouples the pipeline façade from the specific step factory first use. The pipeline façade only cares about this one method.
    """
    @abstractmethod
    def create_all_steps(self) -> list[ConfigurableRetryStep]:
        ...

In [35]:
default_stepfactory_lookup_table: dict[str, type[StepFactoryInterface]] = {
    'FrameworkProcessor': FrameworkProcessingStepFactory,
}

class StepFactoryFacade(StepFactoryFacadeInterface):
    """
    Relationship between façade and concrete factories: A pipeline will generally have a *single* instance of  this façade, which in turn will create an instance of a concrete factory for every step.

    This class serves as a façade for creating steps that abstracts the following tasks from the user:
    - It receives the configs for all steps as a list of dictionaries.
    - For each step config, it:
      - Looks up which factory it should use for creating that kind of step. To be able to do so, it has a lookup table that maps step names to factory classes. (This lookup table can be provided during instantiation of this class, but there is also a default lookup table for standard use cases.)
      - Creates an instance of that specific step factory.
      - Delegates the creation of the actual step to that specific factory.
    - Finally, it will return the resulting list containing all steps.
    """
    def __init__(
        self,
        step_config_dicts: list[dict[str, Any]],
        role_arn: str,
        pipeline_session: PipelineSession | LocalPipelineSession,
        # Generally, user does not set this, but it's useful for testing and custom use cases.
        stepfactory_lookup_table: dict[str, type[StepFactoryInterface]] = \
            default_stepfactory_lookup_table
    ):
        self._step_config_dicts = step_config_dicts
        self._role_arn = role_arn
        self._pipeline_session = pipeline_session
        self.stepfactory_lookup_table = stepfactory_lookup_table

    def _create_individual_step(
        self,
        step_config_dict: dict[str, Any]
    ) -> ConfigurableRetryStep:

        # Get the right *class* of step factory for a given step (based on its config)
        factory_cls_name: str = step_config_dict['step_factory_class']
        StepFactory_cls: type[StepFactoryInterface] = self.stepfactory_lookup_table[factory_cls_name]

        # Instantiate factory, using step config. Then create step
        step_factory: StepFactoryInterface = StepFactory_cls(
            step_config_dict=step_config_dict,
            role_arn=self._role_arn,
            pipeline_session=self._pipeline_session
        )
        return step_factory.create_step()

    def create_all_steps(self) -> list[ConfigurableRetryStep]:
        steps: list[ConfigurableRetryStep] = []
        for config in self._step_config_dicts:
            step: ConfigurableRetryStep = self._create_individual_step(config)
            steps.append(step)
        return steps

~~Note that the StepFactoryWrapper is decoupled from the specific StepFactory that will be used to create the step. The latter is determined by a lookup table, which is injected into to the StepFactoryWrapper during instantiation.~~

~~The downside is that this is less convenient for simple use cases, where the user is content with choosing only from the default factories that ship with the library. To remediate this disadvantage, we can simply create a facade, which instantiates the StepFactoryWrapper with the default lookup table. More advanced users, by contrast, can directly import this default lookup table and customize it to point to custom StepFactory implementations. In a second step, they then initialize the StepFactoryWrapper directly, passing it the custom lookup table.~~

## Config Loading

In [36]:
from typing import final
from functools import cached_property

import yaml
from sm_pipelines_oo.shared_config_schema import Environment


class AbstractConfigLoader():
    """
    Abstract factory for loading configs as dictionaries.
    Concrete implementations will  implement a method for how to load a given config file, as well as an attribute of which file types to load.
    This abstract class provides implementation for how to load both the shared config as well as all the steps configs.
    """

    def __init__(
        self,
        env: Environment,
        config_root_folder: str = 'config',  # relative path from project root
    ):
        self._env = env
        self._config_folder = Path(config_root_folder) / env

    @final
    @cached_property
    def shared_config_as_dict(self) -> dict[str, Any]:
        shared_config_path: Path = self._config_folder / f'shared_config.{self._file_type_to_load}'
        return self._load_config(shared_config_path)

    @final
    @cached_property
    def step_configs_as_dicts(self) -> list[dict[str, Any]]:
        # Traverses the config directory and returns names of all subfolders, each of which will correspond to a step name.
        step_config_paths: list[Path] = [
            path for path in self._config_folder.iterdir() if path.suffix == self._file_type_to_load
        ]
        return [
            self._load_config(config_path)
            for config_path in step_config_paths
        ]

    # Abstract methods that concrete implementations must implement
    # --------------------------------------------------------------
    @abstractmethod
    def _load_config(self, config_file: Path) -> dict[str, Any]:
        ...

    @property
    @abstractmethod
    def _file_type_to_load(self) -> str:
        """
        Returns file extension that identifies which files in config directory should be loaded.
        """
        ...


class YamlConfigLoader(AbstractConfigLoader):
    @property
    def _file_type_to_load(self) -> str:
        return 'yaml'

    def _load_config(self, config_file: Path) -> dict[str, Any]:
        with open(config_file, 'r') as file:
            return yaml.safe_load(file)


In [37]:
class MockConfigLoader(AbstractConfigLoader):
    def __init__(
        self,
        shared_config_dict: dict[str, Any],
        step_configs_dicts: list[dict[str, Any]],
    ):
        self._shared_config_dict = shared_config_dict
        self._step_configs_dicts = step_configs_dicts

    # Disable type checking, because we are overwriting a `final` method with a mock implementation.
    @cached_property # type: ignore[misc]
    def shared_config_as_dict(self) -> dict[str, Any]:
        return self._shared_config_dict

    # Disable type checking, because we are overwriting a `final` method with a mock implementation.
    @cached_property # type: ignore[misc]
    def step_configs_as_dicts(self) -> list[dict[str, Any]]:
        return self._step_configs_dicts

    # The following two methods are not needed, but are required to make the class *concrete*. While we could override this with a type: `ignore[abstract]`, it is better to avoid silencing type errors if easily possible.
    @property
    def _file_type_to_load(self) -> str:
        raise NotImplementedError

    def _load_config(self, config_file: Path) -> dict[str, Any]:
        raise NotImplementedError

## *Pipeline* Facade

In [38]:
from functools import cached_property

from sagemaker.workflow.pipeline import Pipeline

from sm_pipelines_oo.aws_connector.interface import AWSConnectorInterface
from sm_pipelines_oo.aws_connector.implementation import create_aws_connector
from sm_pipelines_oo.shared_config_schema import SharedConfig, Environment


class PipelineFacade:
    def __init__(
        self,
        env: Environment,
        config_loader: AbstractConfigLoader | None = None,
    ):
        self._env = env
        # Allows providing a different config loader, especially for testing
        self._user_provided_config_loader = config_loader

        # Derived attributes
        # ------------------
        self._shared_config = SharedConfig(
            **self._config_loader.shared_config_as_dict
        )
        self.aws_connector: AWSConnectorInterface = create_aws_connector(
            shared_config=self._shared_config,
            environment=env,
        )
        self.step_factory_facade = StepFactoryFacade(
            step_config_dicts=self._config_loader.step_configs_as_dicts, # todo: pass in method call again?
            role_arn=self.aws_connector.role_arn,
            pipeline_session=self.aws_connector.pipeline_session,
        )

    @property
    def _config_loader(self) -> AbstractConfigLoader:
        if self._user_provided_config_loader is not None:
            return self._user_provided_config_loader
        else:
            return YamlConfigLoader(env=self._env)

    def run(self) -> None:
        """This is the main way user will interact with this class."""
        self._pipeline.upsert(
            role_arn=self.aws_connector.role_arn,
        )

    @property
    def _pipeline(self):
        steps: list[ConfigurableRetryStep] = self.step_factory_facade.create_all_steps()
        return Pipeline(
            name=self._shared_config.project_name,
            # parameters=[],
            steps=steps,
            sagemaker_session=self.aws_connector.pipeline_session,
        )

# Usage of library code
## configs

In [39]:
_shared_config_dict={
    'project_name': 'test',
    'project_version': '0',
    'region': 'us-east-1',
    'project_bucket_name': '',
}
_fw_processor_config_dict = {
    'step_name': 'preprocessing',
    'step_factory_class': 'FrameworkProcessor',
    'processor_init_args': {
        'framework_version': '0.23-1',
        'estimator_cls_name': 'SKLearn',
        'instance_count': 1,
        'instance_type': 'ml.m5.xlarge',
    },
    'processor_run_args': {
        'code': 'preprocess.py',
        'source_dir': 'code/preprocess',
        'input_files_s3paths': [],
        'output_files_s3paths': [],
    },
    'shared_config': {
        'project_name': 'test',
        'project_version': '0',
        'region': 'local',
        'project_bucket_name': 'test-bucket',
    }
}
_mock_config_loader = MockConfigLoader(
    shared_config_dict=_shared_config_dict,
    step_configs_dicts=[_fw_processor_config_dict,]
)

## Use Facade

In [40]:
# import pdb; pdb.set_trace()
p = PipelineFacade(
    env='dev',
    # Use different configs for testing
    config_loader=_mock_config_loader,
)
p.run()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/thomas-22/.config/sagemaker/config.yaml


INFO:botocore.tokens:Loading cached SSO token for ml
[32m2024-02-22 12:05:21.850[0m | [34m[1mDEBUG   [0m | [36msm_pipelines_oo.aws_connector.implementation[0m:[36mrole_arn[0m:[36m83[0m - [34m[1mrole: arn:aws:iam::338755209567:role/aws-reserved/sso.amazonaws.com/AWSReservedSSO_AdministratorAccess_7b40736629c71dd9[0m


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/thomas-22/.config/sagemaker/config.yaml


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/thomas-22/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/thomas-22/.config/sagemaker/config.yaml


INFO:botocore.tokens:Loading cached SSO token for ml
[32m2024-02-22 12:05:23.443[0m | [34m[1mDEBUG   [0m | [36msm_pipelines_oo.aws_connector.implementation[0m:[36mrole_arn[0m:[36m83[0m - [34m[1mrole: arn:aws:iam::338755209567:role/aws-reserved/sso.amazonaws.com/AWSReservedSSO_AdministratorAccess_7b40736629c71dd9[0m
INFO:botocore.tokens:Loading cached SSO token for ml
INFO:sagemaker.processing:Uploaded code/preprocess to s3://sagemaker-us-east-1-338755209567/test/code/c4e73b419a6046db5b2c11efa195e8e5/sourcedir.tar.gz
INFO:sagemaker.processing:runproc.sh uploaded to s3://sagemaker-us-east-1-338755209567/test/code/bc2536a25d34e1ecae5238f42f4207c2/runproc.sh
INFO:botocore.tokens:Loading cached SSO token for ml


Using provided s3_resource


INFO:sagemaker.processing:Uploaded code/preprocess to s3://sagemaker-us-east-1-338755209567/test/code/c4e73b419a6046db5b2c11efa195e8e5/sourcedir.tar.gz
INFO:sagemaker.processing:runproc.sh uploaded to s3://sagemaker-us-east-1-338755209567/test/code/bc2536a25d34e1ecae5238f42f4207c2/runproc.sh


Using provided s3_resource
