Old explanation of design (overhaul/integrate):

By contrast to earlier designs, create_step does not require a step_config to be passed here as an argument. This simplify the interface considerably because we don't have to worry about getting the right type of step_config for a given step_type. The problem is that while we can simply *return more specific* step types in subclasses (such as a ProcessingStep instead of a ConfigurableRetryStep), we can**not** require a *more specific argument* for subclasses (for example requiring a ProcessingStepConfig for a ProcessingStepFactory), as this would violate the Liskov Substitution Principal.

This problem would not even be easily solved by using generics, because it is not obvious how we can go from my given Steptype to the associated StepConfig.

In [1]:
%load_ext nb_mypy

Version 1.0.5


In [2]:
from typing import  ClassVar, TypeVar, TypeAlias, Any, final, TypedDict
from abc import ABC, abstractmethod
from pathlib import Path

from pydantic_settings import BaseSettings
from sagemaker.processing import Processor
from sagemaker.estimator import EstimatorBase

from sagemaker.workflow.steps import ProcessingStep, TrainingStep, CreateModelStep, TransformStep, \
    TuningStep, ConfigurableRetryStep


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/thomas-22/.config/sagemaker/config.yaml


## Why no generics?
- The end goal is simply to have an object that satisfies the (ConfigurableRetry)StepInterface. From the perspective of the pipeline, we don't care what type of step it is.
- The initial reason for looking into leveraging generics was for making sure that we are passing the right config for a given type of step. However, after a lot of trial and error, I still did not find a good way to create a simple class hierarchy based on what the Sagemaker SDK makes available to us. Instead, it looks more promising to simply create a very minimal interface for step factories, and later specific implementations decide what the best way to create that kind of step is.
  - Downside: let's reuse of code between different step factories. This makes it somewhat harder to get started with creating new step factories, because there is less structure imposed for how exactly to do it.
  - Upside: more flexibility for grading step factories. This may actually make it easier to create new step factories, and it will make it easier to maintain given step factories as the interface of the Sagemaker SDK changes.
  - Note: neither of these points will affect a basic library user who only uses inbuilt step factories.

# Configuration
Goal:  Abstract configuration into a single config class which loads all config's it needs in the directory (even if this requires traversing). This will not only make the intent of this method more clear, but it will also make it easier to have a single config façade that abstracts what config's are global and what are step-specific (step config simply need a reference to the shared config, so they can fall back to that if necessary, but the concrete logic can be implemented differently for each step type). Also, having a config façade makes it easy to define methods that compute derived values.

## Attempt 1: *Overarching* ConfigFacade

In [3]:
# class ConfigFacade:
#     def __init__(self, config_dir: Path):
#         # load all the yaml config files
#         shared_config_dict: dict[str, Any] = ...
#         step_configs_dicts: dict[str, dict[str, Any]] = ...

#         # Convert the dictionaries to pydantic models
#         self.shared_config: SharedConfig = SharedConfig(**shared_config_dict)
#         self.step_configs: dict[str, BaseSettings] = {}
#         for step_name, step_config_dict  in step_configs_dicts.items():

#             self.step_configs[step_name] = StepConfig(**step_config_dict)


Problem: how to get  the pydantic model for a given config? While it would be possible to have another look up table, similar to how we find the right specific step factory, it makes more sense that each step factory owns the associated config model. This is because the key challenge is that the  config model matches the specific factory. 

As a result, it  is better to not load all the configs upfront (except possibly into dictionaries).

## Attempt 2: *Separate* Configs w/o facade, but *reference* to shared config

In [4]:
# from dataclasses import dataclass

class SharedConfigInterface(BaseSettings):
    """
    This interface defines all the configs that our library code expects to be present in the shared_config.
    """
    project_name: str
    project_version: str  # Versions data (and probably more in the future)


# class StepConfigInterface(BaseSettings):
#     """
#     This ensures every step config has a step_type (required to determine step factory),  as well as a reference to the shared_config.

#     Note: If the concrete step_config depends on any specific config values being set in the shared_config (in addition to the ones defined in the SharedConfigInterface), we should redefine the type of shared_config to this more specific type.
#     """
#     step_type: str # Identifies factory, which in turn identifies StepConfig model
#     shared_config: SharedConfigInterface # So that we have access to sharedconfig

# Simple factory
~~This makes better use of factory, because depending on the arg passed to it, it creates a different type of step. Otherwise, we may as well us strategy pattern (only use of factory is to construct step later when configs etc are known - but a given factory always produces same kind of step, except from configuration).~~

In [5]:
class StepFactoryInterface(ABC):
    """
    In addition to the required methods defined below, it is recommended to implement the following attributes and methods in order to make implementation of the required methods easiest:
    - _config_model: ClassVar[type[BaseSettings]] (Class used to convert config_dict to pydantic model to validate types and potentially compute derived attributes.
    """

    @abstractmethod
    def __init__(self, step_config_dict: dict[str, Any]):
        ...


    # @staticmethod
    # @abstractmethod
    # def _get_config_model() -> type[BaseSettings]:
    #     """
    #     Pydantic model used to validate and convert the config_dict to an instance of pydantic.BaseSettings.
    #     """
    #     ...


    @abstractmethod
    def create_step(self) -> ConfigurableRetryStep:
        # Note that we don't have to worry about violating the LSP -  even though we are adding back an argument for the config – because at this stage that config will simply be of type dictionary. Thus, subclasses don't have to specify a more specific subtype of config here yet.
        ...

In [6]:
# For Python < 3.12, don't use typing.TypedDict: https://docs.pydantic.dev/2.6/errors/usage_errors/#typed-dict-version
from typing_extensions import TypedDict

from sagemaker.processing import FrameworkProcessor
from sagemaker.estimator import EstimatorBase
from sagemaker.sklearn.estimator import SKLearn

class _FWProcessorInitConfig(TypedDict):
    framework_version: str
    estimator_cls_name: str
    instance_count: int
    instance_type: str


class _FWProcessorRunConfig(TypedDict):
    code: str
    source_dir: str
    # todo: allow athena datasetdefinition instead
    input_files_s3paths: list[str]  # todo: validate it's an s3 path
    output_files_s3paths: list[str]  # todo: validate it's an s3 path


class FrameworkProcessingStepConfig(BaseSettings):
    # todo:
    step_name: str
    processor_init_args: _FWProcessorInitConfig
    processor_run_args: _FWProcessorRunConfig
    # For now, we will reload this for every step config to avoid dependency on pipeline wrapper.
    shared_config: SharedConfigInterface

In [7]:
from sagemaker.processing import ProcessingInput, ProcessingOutput


class FrameworkProcessingStepFactory(StepFactoryInterface):
    # Note: this is a public attribute, so user can add support for additional estimators
    estimator_name_to_cls_mapping: ClassVar[dict[str, Any]] = {  # todo:  find supertype
        'SKLearn': SKLearn,
    }

    _config_model: ClassVar[type[FrameworkProcessingStepConfig]] = FrameworkProcessingStepConfig

    def __init__(self, step_config_dict: dict[str, Any]):
        # Parse config, using the specific pydantic model that this factory has as a class variable.
        self._config: FrameworkProcessingStepConfig = self._config_model(**step_config_dict)

    @property
    def processor(self) -> FrameworkProcessor:
        # Start with init args from config, but convert TypedDict to dict so we can modify keys.
        init_args: dict[str, Any] = dict(self._config.processor_init_args)
        # Replace the string of estimator_cls_name with the actual estimator_cls
        estimator_cls_name = init_args.pop('estimator_cls_name')
        init_args['estimator_cls'] = self.estimator_name_to_cls_mapping[estimator_cls_name]
        return FrameworkProcessor(**init_args)  # todo: check if typechecker catches wrong args. Otherwise, define typed dict for FWPInitArgs.

    def get_run_args(self) -> dict[str, Any]:
        # Start with init args from config, but convert TypedDict to dict so we can modify keys.
        run_args: dict[str, Any] = dict(self._config.processor_run_args)
        # Create ProcessingInputs from list of s3paths (strings)
        _input_files_s3paths: list[str] = run_args.pop('input_files_s3paths')
        _processing_inputs: list[ProcessingInput] = [
            ProcessingInput(
                source=s3path,
                # todo: Allow passing through extra arguments
            )
            for s3path in _input_files_s3paths
        ]
        run_args['inputs'] = _processing_inputs
        return run_args

    def create_step(self) -> ProcessingStep:
        return ProcessingStep(
            processor=self.processor,
            **self.get_run_args()
        )


## StepFactory *Facade*

In [8]:
import os

StepFactoryLookupTableType: TypeAlias = dict[str, type[StepFactoryInterface]]

default_stepfactory_lookup_table: StepFactoryLookupTableType = {
    'FrameworkProcessor': FrameworkProcessingStepFactory,
}

class StepFactoryFacade:
    """
    Relationship between façade and concrete factories: A pipeline will generally have a *single* instance of  this façade, which in turn will create an instance of a concrete factory for every step it needs to create.

    This class serves as a façade for creating steps that abstracts the following tasks from the user:
    - It receives the step name from the user, based on which it retrieves the associated config for that step.
    - From that config, it looks up what kind of step the user wants to create.
    - It looks up which factory it should use for creating that kind of step. To be able to do so, it has a lookup table that maps step names to factory classes. (Note that this lookup table needs to be provided during instantiation. However, this library will also expose an instance of the StepFactoryFaçade that has already been initialized with a default lookup table, which will make the library even easier to use for less advanced users).
    - Great an instance of that specific step factory.
    - Finally, it will delegate the creation of the actual step to that specific factory, and then return the resulting step to the user.
    """
    def __init__(
        self,
        step_config_dicts: list[dict[str, Any]],
        stepfactory_lookup_table: StepFactoryLookupTableType = default_stepfactory_lookup_table
    ):
        self.step_config_dicts = step_config_dicts
        self.stepfactory_lookup_table = stepfactory_lookup_table

    def _create_individual_step(
        self,
        step_config_dict: dict[str, Any]
    ) -> ConfigurableRetryStep:

        # Get the right *class* of step factory for a given step (based on its config)
        factory_cls_name: str = step_config_dict['step_factory_class']
        StepFactory_cls: type[StepFactoryInterface] = self.stepfactory_lookup_table[factory_cls_name]

        # Instantiate factory, using step config. Then create step
        step_factory = StepFactory_cls(step_config_dict=step_config_dict)
        return step_factory.create_step()

    def create_all_steps(self) -> list[ConfigurableRetryStep]:
        steps: list[ConfigurableRetryStep] = []
        for config in self.step_config_dicts:
            step: ConfigurableRetryStep = self._create_individual_step(config)
            steps.append(step)
        return steps

~~Note that the StepFactoryWrapper is decoupled from the specific StepFactory that will be used to create the step. The latter is determined by a lookup table, which is injected into to the StepFactoryWrapper during instantiation.~~

~~The downside is that this is less convenient for simple use cases, where the user is content with choosing only from the default factories that ship with the library. To remediate this disadvantage, we can simply create a facade, which instantiates the StepFactoryWrapper with the default lookup table. More advanced users, by contrast, can directly import this default lookup table and customize it to point to custom StepFactory implementations. In a second step, they then initialize the StepFactoryWrapper directly, passing it the custom lookup table.~~

## Config Loading

In [9]:
class ConfigLoaderInterface(ABC):
    @property
    @abstractmethod
    def shared_config(self) -> SharedConfigInterface:
        ...

    @property
    @abstractmethod
    def step_configs_as_dicts(self) -> list[dict[str, Any]]:
        ...


class ConfigLoader(ConfigLoaderInterface):
    def __init__(self, config_folder: Path, shared_config_model: type[SharedConfigInterface]):
        self._config_folder = config_folder
        self._shared_config_model = shared_config_model

    def _load_config(self, config_file: Path) -> dict[str, Any]:
        ...

    @property
    def shared_config(self) -> SharedConfigInterface:
        path_to_shared_config: Path = self._config_folder / 'shared_config.yaml'
        return self._shared_config_model(
            **self._load_config(config_file=path_to_shared_config),
        )

    @property
    def step_configs_as_dicts(self) -> list[dict[str, Any]]:
        # Traverses the config directory and returns names of all subfolders, each of which will correspond to a step name.
        step_config_paths: list[Path] = ...
        return [
            self._load_config(config_path)
            for config_path in step_config_paths
        ]


<cell>18: [1m[31merror:[m Missing return statement  [m[33m[empty-body][m
<cell>18: [34mnote:[m If the method is meant to be abstract, use @abc.abstractmethod[m
<cell>31: [1m[31merror:[m Incompatible types in assignment (expression has type [m[1m"ellipsis"[m, variable has type [m[1m"list[Path]"[m)  [m[33m[assignment][m


## *Pipeline* Facade

In [10]:
from sagemaker.workflow.pipeline import Pipeline

from sm_pipelines_oo.aws_connector.implementation import create_aws_connector
from sm_pipelines_oo.shared_config_schema import Environment


class PipelineFacade:
    def __init__(
        self,
        config_folder: Path,
        environment: Environment,
        shared_config_model: type[SharedConfigInterface], # todo: decide how best to get this
    ):

        # Derived attributes
        self._config_loader = ConfigLoader(
            config_folder=config_folder,
            shared_config_model=shared_config_model,
        )
        self.shared_config: SharedConfigInterface = self._config_loader.shared_config
        self.step_factory_facade = StepFactoryFacade(
            step_config_dicts=self._config_loader.step_configs_as_dicts, # todo: pass in method call again?
        )

        self.aws_connector = create_aws_connector(
            run_as_pipeline=True,
            shared_config=self.shared_config,
            environment=environment,
        )


    @property
    def pipeline(self):
        return Pipeline(
            name=self.shared_config.project_name,
            # parameters=[],
            steps=self.create_steps(),
            sagemaker_session=...
        )

    def create_steps(self):
        return self.step_factory_facade.create_all_steps()

    def run(self) -> None:
        ...

# Usage of library code
## configs

In [11]:
# _config = FrameworkProcessingStepConfig(
#     step_name="preprocessing",
#     processor_init_args={
#         "framework_version": "0.0.1",
#         "estimator_cls": "estimator",
#         "instance_count": 1,
#         "instance_type": "ml.m5.large",
#     },
#     processor_run_args={
#         "code": "code",
#         "source_dir": "source_dir",
#         "input_files": ["input_files"],
#         "output_files": ["output_files"],
#     },
#     shared_config=SharedConfigInterface(
#         project_name="project_name",
#         project_version="project_version"
#     )
# )

## Use Facade

In [12]:
# pipeline_facade = PipelineFacade(stepfactory_lookup_table=stepfactory_lookup_table)
# pipeline_facade.run()