Idea: Leverage **Factory Method Pattern**: 
- StepFactory will be an abstract class with two abstract methods for instantiating the step actor (processor, etc.), and constructing the step's run args.
- StepFactory provides an implementation for step. This is achieved by calling the two abstract methods internally.
- User has to implement these abstract methods (how exactly to instantiate the step's actor, and how to construct its run args).

Why I **discarded this design**:
- While this design  makes it easier on the user to define different kinds of ProcessingSteps, because   it breaks it down into two easier problems, this design does not generalize to other steps. For example, rather than calling a  "run" method, training and human steps instead require a calling "fit" method. While it would  probably be possible to work around  lists, a ConditionStep  does not follow this pattern at all.

 **Lesson: Don't  construct step using `ProcessingStep(step_args=actor.run(inputs=..., ...))`.  Rather, pass an instance of the actor  and individual run args separately, i.e. `ProcessingStep(processor=my_processor, inputs=..., ...)`.**

In [15]:
%load_ext nb_mypy

from typing import Generic, TypeVar, TypeAlias, Any, final
from abc import abstractmethod

from sagemaker.processing import Processor, FrameworkProcessor
from sagemaker.estimator import EstimatorBase
from sagemaker.sklearn.estimator import SKLearn

from sagemaker.workflow.steps import ProcessingStep, TrainingStep, CreateModelStep, TransformStep, \
    TuningStep, ConfigurableRetryStep

The nb_mypy extension is already loaded. To reload it, use:
  %reload_ext nb_mypy


In [35]:
# todo: decide how stringent this should be while still allowing user to add new step types
StepType = TypeVar("StepType", bound=ConfigurableRetryStep)
StepActor: TypeAlias = Processor |  EstimatorBase  # todo: add more types as needed


class BaseStepFactory(Generic[StepType]):
    # @abstractmethod
    # def __init__(self, step: StepType):
    #     self.step_cls = StepType

    @abstractmethod
    def instantiate_step_actor(self) -> StepActor:
        """
        This method  is used internally by the factory method. However, it can also be used to instantiate a step actor (e.g., processor) directly for a quicker iteration during development.

        Note: It is consistent with LSP for the implementation to return a more *specific* type.
        """
        ...

    @abstractmethod
    def _construct_run_args(self) -> dict[str, Any]:  # todo: create dataclass for return types
        """
        Note: It is consistent with LSP for the implementation to return a more *specific* type.
        """
        ...

    @final
    def create_step(self) -> StepType:
        """
        This is the factory method. It is not meant to be overridden. Instead, subclasses should implement the two abstract methods, which in turn specify what exactly this factory method will do.
        """
        # Instantiate the actor (e.g., processor) for the step
        step_actor: StepActor = self.instantiate_step_actor()
        run_args: dict[str, Any] = self._construct_run_args()
        return step_actor.run(**run_args)


# Create concrete factory class by implementing abstract methods
class FrameworkProcessingStepFactory(BaseStepFactory[ProcessingStep]):
    def instantiate_step_actor(self) -> FrameworkProcessor:  # Note the more specific return type
        return FrameworkProcessor(
            estimator_cls=SKLearn, framework_version='0.23-1', role='role', instance_type=''
        )

    def _construct_run_args(self) -> dict[str, Any]:
        return {'inputs': [], 'outputs': [], 'source_dir': '', 'code': ''}


# instantiate factory and create step
fwp_factory = FrameworkProcessingStepFactory()
fwp_factory.create_step()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/thomas-22/.config/sagemaker/config.yaml


KeyError: 'sagemaker_submit_directory'

Note: while we get an error here, this is due to the fact that we have  not provided all the necessary configuration to instantiate the step.

# Why not abstract factory pattern?
On the surface it may seem like the abstract factory pattern would be a natural fit, as it creates **families of *related* products**. In our case, by contrast, **we want the user to be able to pick and choose from any step factory fits their use case.** This is better served by having factories that create only a single type of step. There is no reason to force the user to pick a single overarching factory that constrains the steps to a family of related implementations. All we care about is that all the produced steps satisfies the common StepInterface.

For example, the user may want to use two different types of ProcessingSteps  in the same pipeline, one based on a FrameworkProcessor and one on a different kind of Processor. To allow this, our abstract factory would have to  defined interfaces for each specific step type, which is at a too low level of abstraction, and would require a change to the interface  every time we want to support a new kind of processor.  Since interfaces are supposed to be stable, this is a huge downside.