Old explanation of design (overhaul/integrate):

By contrast to earlier designs, create_step does not require a step_config to be passed here as an argument. This simplify the interface considerably because we don't have to worry about getting the right type of step_config for a given step_type. The problem is that while we can simply *return more specific* step types in subclasses (such as a ProcessingStep instead of a ConfigurableRetryStep), we can**not** require a *more specific argument* for subclasses (for example requiring a ProcessingStepConfig for a ProcessingStepFactory), as this would violate the Liskov Substitution Principal.

This problem would not even be easily solved by using generics, because it is not obvious how we can go from my given Steptype to the associated StepConfig.

In [2]:
%load_ext nb_mypy

Version 1.0.5


In [7]:
from typing import  ClassVar, TypeVar, TypeAlias, Any, final, TypedDict
from abc import ABC, abstractmethod
from pathlib import Path

from pydantic_settings import BaseSettings
from sagemaker.processing import Processor
from sagemaker.estimator import EstimatorBase

from sagemaker.workflow.steps import ProcessingStep, TrainingStep, CreateModelStep, TransformStep, \
    TuningStep, ConfigurableRetryStep


## Why no generics?
- The end goal is simply to have an object that satisfies the (ConfigurableRetry)StepInterface. From the perspective of the pipeline, we don't care what type of step it is.
- The initial reason for looking into leveraging generics was for making sure that we are passing the right config for a given type of step. However, after a lot of trial and error, I still did not find a good way to create a simple class hierarchy based on what the Sagemaker SDK makes available to us. Instead, it looks more promising to simply create a very minimal interface for step factories, and later specific implementations decide what the best way to create that kind of step is.
  - Downside: let's reuse of code between different step factories. This makes it somewhat harder to get started with creating new step factories, because there is less structure imposed for how exactly to do it.
  - Upside: more flexibility for grading step factories. This may actually make it easier to create new step factories, and it will make it easier to maintain given step factories as the interface of the Sagemaker SDK changes.
  - Note: neither of these points will affect a basic library user who only uses inbuilt step factories.

# Configuration
Goal:  Abstract configuration into a single config class which loads all config's it needs in the directory (even if this requires traversing). This will not only make the intent of this method more clear, but it will also make it easier to have a single config façade that abstracts what config's are global and what are step-specific (step config simply need a reference to the shared config, so they can fall back to that if necessary, but the concrete logic can be implemented differently for each step type). Also, having a config façade makes it easy to define methods that compute derived values.

## Attempt 1: *Overarching* ConfigFacade

In [None]:
class ConfigFacade:
    def __init__(self, config_dir: Path):
        # load all the yaml config files
        shared_config_dict: dict[str, Any] = ...
        step_configs_dicts: dict[str, dict[str, Any]] = ...

        # Convert the dictionaries to pydantic models
        self.shared_config: SharedConfig = SharedConfig(**shared_config_dict)
        self.step_configs: dict[str, BaseSettings] = {}
        for step_name, step_config_dict  in step_configs_dicts.items():

            self.step_configs[step_name] = StepConfig(**step_config_dict)


Problem: how to get  the pydantic model for a given config? While it would be possible to have another look up table, similar to how we find the right specific step factory, it makes more sense that each step factory owns the associated config model. This is because the key challenge is that the  config model matches the specific factory. 

As a result, it  is better to not load all the configs upfront (except possibly into dictionaries).

## Attempt 2: *Separate* Configs w/o facade, but *reference* to shared config

In [None]:
from dataclasses import dataclass

class SharedConfigInterface(BaseSettings):
    """
    This interface defines all the configs that our library code expects to be present in the shared_config.
    """
    project_name: str
    project_version: str  # Versions data (and probably more in the future)


class StepConfigInterface(BaseSettings):
    """
    This ensures every step config has a step_type (required to determine step factory),  as well as a reference to the shared_config.

    Note: If the concrete step_config depends on any specific config values being set in the shared_config (in addition to the ones defined in the SharedConfigInterface), we should redefine the type of shared_config to this more specific type.
    """
    step_type: str # Identifies factory, which in turn identifies StepConfig model
    shared_config: SharedConfigInterface # So that we have access to sharedconfig

# Simple factory
This makes better use of factory, because depending on the arg passed to it, it creates a different type of step. Otherwise, we may as well us strategy pattern (only use of factory is to construct step later when configs etc are known - but a given factory always produces same kind of step, except from configuration).

In [8]:
class StepFactoryInterface(ABC):
    """
    In addition to the required methods defined below, it is recommended to implement the following attributes and methods in order to make implementation of the required methods easiest:
    - _config_model: ClassVar[type[BaseSetitings]] (Class used to convert config_dict to pydantic model to validate types and potentially compute derived attributes.
    """
    @abstractmethod
    def __init__(self, step_config_dict: dict[str, Any]) -> None:
        """
        We need a shared interface for instantiating factories for specific steps, so that we can instantiate any specific step factory in the StepFactoryFaçade, without the façade knowing which kind of specific factory is used.
        """
        # self.step_config = self._config_model(**step_config_dict)
        ...

    @property
    @abstractmethod
    def step_config(self) -> BaseSettings:
        ...

    # ~~todo: Consider making this a classmethod - unless this class needs to hold any state? That way, we simplify create_step, and we don't have to include the __init__ in the interface definition.~~
    @abstractmethod
    def create_step(self) -> ConfigurableRetryStep:
        ...

<cell>20: [34mnote:[m [m[1m"create_step"[m of [m[1m"StepFactoryInterface"[m defined here[m


In [None]:
class StepFactoryFacade:
    """
    An application will generally have a *single* instance of this StepFactoryFacade class, plus an instance of each concrete factory for every type of step it may need to create.

    This class serves as a façade for creating steps that abstracts the following tasks from the user:
    - It receives the step name from the user, based on which it retrieves the associated config for that step.
    - From that config, it looks up what kind of step the user wants to create.
    - It looks up which factory it should use for creating that kind of step. To be able to do so, it has a lookup table that maps step names to factory classes. (Note that this lookup table needs to be provided during instantiation. However, this library will also expose an instance of the StepFactoryFaçade that has already been initialized with a default lookup table, which will make the library even easier to use for less advanced users).
    - Great an instance of that specific step factory.
    - Finally, it will delegate the creation of the actual step to that specific factory, and then return the resulting step to the user.
    """
    def __init__(
        self,
        stepfactory_lookup_table: dict[str, StepFactoryInterface],
    ):
        self._stepfactory_lookup_table = stepfactory_lookup_table

    def create_step(self, step_config: StepConfigInterface) -> ConfigurableRetryStep:
        # Get the right *class* of step factory
        StepFactory: type[StepFactoryInterface] = self._stepfactory_lookup_table[step_type]
        # Create  *instance* of that factory class
        step_factory = StepFactory(step_config_dict=step_config_dict)
        # Use that factory to create the step
        return step_factory.create_step(step_config=step_config_dict)

In [None]:
class _FrameworkProcessingStepFactory():
    def __init__(self, step_config_dict: dict[str, Any]) -> None:
        ...

    @property
    def step_config(self) -> BaseSettings:
        ...

    def create_step(self) -> ProcessingStep:
        ...

Note that the StepFactoryWrapper is decoupled from the specific StepFactory that will be used to create the step. The latter is determined by a lookup table, which is injected into to the StepFactoryWrapper during instantiation.

The downside is that this is less convenient for simple use cases, where the user is content with choosing only from the default factories that ship with the library. To remediate this disadvantage, we can simply create a facade, which instantiates the StepFactoryWrapper with the default lookup table. More advanced users, by contrast, can directly import this default lookup table and customize it to point to custom StepFactory implementations. In a second step, they then initialize the StepFactoryWrapper directly, passing it the custom lookup table.

In [None]:
# higher-level-interface
# ======================

default_stepfactory_lookup_table: dict[str, StepFactoryInterface] = {
    'FrameworkProcessor': _FrameworkProcessingStepFactory,
}

# This is what user will import
stepfactory_wrapper = StepFactoryFacade(
    stepfactory_lookup_table=default_stepfactory_lookup_table,
)

In [None]:
# lower-level interface (if customization of factories is needed)
# ===============================================================

# Implement custom stepfactory
class  _CustomProcessingStepFactory():
    ...

# add it to the lookup table
default_stepfactory_lookup_table.update(
    {
        'CustomProcessor': _CustomProcessingStepFactory,
    },
)

# Instantiate StepFactory with customized lookup table
customized_step_factory = StepFactoryFacade(
    stepfactory_lookup_table=default_stepfactory_lookup_table
)

# Usage of library code

In [None]:
class PipelineFacade:
    def __init__(self):
        # Read configs
        self._shared_config: SharedConfigInterface = ...
        self._step_configs: dict[str, StepConfigInterface] = ...
        # Note this is a public attribute, so that the user can change the lookup table if necessary
        # todo: ~~consider making a property? Or~~ make it an optional argument in the __init__ method, which would allow making the attribute pricate?
        self.stepfactory_lookup_table={
            'FrameworkProcessor': _FrameworkProcessingStepFactory,
        }

    @property
    def step_factory_facade(self):
        """
        This is not set in the __init__ method, because it depends on the stepfactory_lookup_table, which is a public attribute that the user can change.
        """
        return StepFactoryFacade(
            stepfactory_lookup_table=self.stepfactory_lookup_table,
        )

    def _create_steps(self) -> list[ConfigurableRetryStep]:
        steps: list[ConfigurableRetryStep] = []
        for step_name, step_config in step_configs.items():
            step: ConfigurableRetryStep = step_factory_facade.create_step(step_config=step_config)
            steps.append(step)

        return steps