# AWS Step Functions

<img src="../_assets/aws_service_icons/step_functions.svg" width="80" alt="AWS Step Functions">

## Goals
- Understand what **AWS Step Functions** are.
- Know common practical use-cases.
- See a minimal **AWS SDK** pseudo-code workflow (no execution).

---


## Prerequisites
- Basic AWS concepts (regions, IAM, ARNs).
- Comfort with JSON (Step Functions workflows are defined in JSON).

> This notebook includes **pseudo-code only**. It does not run any AWS SDK calls.


## What Step Functions are
**AWS Step Functions** is a managed **workflow orchestration** service. You define a **state machine** (a workflow) using **Amazon States Language (ASL)**, and Step Functions coordinates the work across other services.

Key ideas:
- A workflow is a directed graph of **states** (e.g., `Task`, `Choice`, `Parallel`, `Map`, `Wait`, `Succeed`, `Fail`).
- Step Functions passes a JSON document between states and can shape it via `InputPath`, `Parameters`, `ResultPath`, `OutputPath`.
- It provides built-in **timeouts**, **retries/backoff**, and **error handling**, so orchestration logic is not scattered across scripts.
- Two main types:
  - **Standard**: durable, long-running workflows (up to ~1 year).
  - **Express**: very high throughput, short-duration workflows (up to ~5 minutes).

Step Functions does not “do the compute” itself; `Task` states typically invoke or integrate with services like Lambda, ECS, Batch, SageMaker, Glue, DynamoDB, SNS/SQS, EventBridge, and more.


## What Step Functions are practically used for
Common patterns:
- **Orchestrating ML/data pipelines**: ingest → validate → preprocess → train → evaluate → register model → deploy.
- **Fan-out / fan-in**: run many tasks in parallel with `Map` (e.g., per-file processing), then aggregate results.
- **Reliable service coordination**: call multiple services with retries and compensating steps on failure.
- **Human-in-the-loop / approvals**: pause a workflow until an external system responds (callback token patterns).
- **Long-running jobs**: track async work (Batch/ECS/SageMaker) without keeping a worker process alive.

Why not just a script/cron?
- Step Functions gives you a visual DAG, execution history, and first-class retry/error semantics.
- Workflows can be triggered from EventBridge schedules/events, API calls, or other AWS services.


## Using Step Functions with the AWS SDK (pseudo-code)
Below is a minimal, **non-executable** sketch of creating a state machine and starting an execution.

Notes:
- In real projects, you often **deploy the state machine via IaC** (CDK/Terraform/CloudFormation) and use the SDK mainly for `start_execution`.
- The state machine needs an IAM **execution role** (trusted by Step Functions) with permissions to invoke the integrated services.
- `create_state_machine` fails if the name already exists; use `update_state_machine` for updates.

```python
# PSEUDO-CODE (do not run)

import json
import boto3

region = "us-east-1"
sf = boto3.client("stepfunctions", region_name=region)

state_machine_name = "demo-step-functions"
role_arn = "arn:aws:iam::<ACCOUNT_ID>:role/<STEP_FUNCTIONS_EXECUTION_ROLE>"
lambda_function_arn = "arn:aws:lambda:<REGION>:<ACCOUNT_ID>:function:<FUNCTION_NAME>"

# 1) Define the workflow (Amazon States Language / ASL).
# This example: invoke Lambda -> succeed.
definition = {
    "Comment": "Minimal Step Functions example",
    "StartAt": "InvokeLambda",
    "States": {
        "InvokeLambda": {
            "Type": "Task",
            "Resource": "arn:aws:states:::lambda:invoke",
            "Parameters": {"FunctionName": lambda_function_arn, "Payload.$": "$"},
            "Retry": [
                {
                    "ErrorEquals": ["States.ALL"],
                    "IntervalSeconds": 2,
                    "MaxAttempts": 3,
                    "BackoffRate": 2.0,
                }
            ],
            "Next": "Success",
        },
        "Success": {"Type": "Succeed"},
    },
}

# 2) Create the state machine (or update it if it already exists).
# try:
#     resp = sf.create_state_machine(
#         name=state_machine_name,
#         definition=json.dumps(definition),
#         roleArn=role_arn,
#         type="STANDARD",  # or "EXPRESS"
#     )
#     state_machine_arn = resp["stateMachineArn"]
# except sf.exceptions.StateMachineAlreadyExists:
#     state_machine_arn = "<STATE_MACHINE_ARN>"  # e.g., from IaC outputs
#     sf.update_state_machine(
#         stateMachineArn=state_machine_arn,
#         definition=json.dumps(definition),
#         roleArn=role_arn,
#     )

# 3) Start an execution with some JSON input.
start = sf.start_execution(
    stateMachineArn="<STATE_MACHINE_ARN>",
    name="demo-exec-001",
    input=json.dumps({"message": "hello"}),
)
execution_arn = start["executionArn"]

# 4) Observe status (or emit events to EventBridge and react there).
desc = sf.describe_execution(executionArn=execution_arn)
print(desc["status"])
```
