# Google Colab Version: [Open this notebook in Google Colab](https://colab.research.google.com/github/starfishdata/starfish/blob/main/examples/data_factory.ipynb)

#### Dependencies 

In [11]:
%pip install starfish-core

Note: you may need to restart the kernel to use updated packages.


In [12]:
## Fix for Jupyter Notebook only — do NOT use in production
## Enables async code execution in notebooks, but may cause issues with sync/async issues
## For production, please run in standard .py files without this workaround
## See: https://github.com/erdewit/nest_asyncio for more details
import nest_asyncio
nest_asyncio.apply()

from starfish import StructuredLLM, data_factory
from starfish.llm.utils import merge_structured_outputs

from starfish.common.env_loader import load_env_file ## Load environment variables from .env file
load_env_file()

In [13]:
# setup your openai api key if not already set
# import os
# os.environ["OPENAI_API_KEY"] = "your_key_here"

# If you dont have any API key, please navigate to local model section

In [14]:
## Helper function mock llm call
# When developing data pipelines with LLMs, making thousands of real API calls
# can be expensive. Using mock LLM calls lets you test your pipeline's reliability,
# failure handling, and recovery without spending money on API calls.
from starfish.data_factory.utils.mock import mock_llm_call

#### 1. Your First Data Factory: Simple Scaling

The @data_factory decorator transforms any async function into a scalable data processing pipeline.
It handles:
- Parallel execution 
- Automatic batching
- Error handling & retries
- Progress tracking

Let's start with a single LLM call and then show how easy it is to scale it.


In [15]:
# First, create a StructuredLLM instance for generating facts about cities
json_llm = StructuredLLM(
    model_name = "openai/gpt-4o-mini",
    prompt = "Funny facts about city {{city_name}}.",
    output_schema = [{'name': 'fact', 'type': 'str'}],
    model_kwargs = {"temperature": 0.7},
)

json_llm_response = await json_llm.run(city_name='New York')
json_llm_response.data

[{'fact': "New York City has its own 'official' pizza, but good luck finding a slice that isn't loved by at least ten other people who will argue about it passionately!"}]

In [16]:
# Now, scale to multiple cities using data_factory
# Just add the @data_factory decorator to process many cities in parallel

from datetime import datetime
@data_factory(max_concurrency=10)
async def process_json_llm(city_name: str):
    ## Adding a print statement to indicate the start of the processing
    print(f"Processing {city_name} at {datetime.now()}")
    json_llm_response = await json_llm.run(city_name=city_name)
    return json_llm_response.data

# This is all it takes to scale from one city to many cities!
process_json_llm(city_name=["New York", "London", "Tokyo", "Paris", "Sydney"])

[32m2025-04-17 10:48:39[0m | [1mINFO    [0m | [1m[1m[JOB START][0m [36mMaster Job ID: b46ec055-1103-469d-ba82-d681f348d7a5[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-04-17 10:48:39[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/5[0m | [33mRunning: 5[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
Processing New York at 2025-04-17 10:48:39.272932
Processing London at 2025-04-17 10:48:39.273200
Processing Tokyo at 2025-04-17 10:48:39.273915
Processing Paris at 2025-04-17 10:48:39.274188
Processing Sydney at 2025-04-17 10:48:39.274343
[32m2025-04-17 10:48:42[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 4/5[0m | [33mRunning: 1[0m | [36mAttempted: 4[0m    ([32mCompleted: 4[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 10:48:43[0m | [1mINFO    [0m | [1m[JOB FINISHED] [1mFinal Status:[0m [32mCompleted: 

[{'fact': "Sydney's famous Bondi Beach is so popular that it has its own 'Bondi Rescue' reality TV show, which means lifeguards in Sydney are basically celebrities!"},
 {'fact': 'New York City is home to more than 8.6 million people, but if you count all the pigeons, the population could easily double! It’s estimated there are over a million pigeons in the city, making it a real feathered metropolis!'},
 {'fact': "Paris has a 'secret' underground city that includes a network of tunnels, catacombs, and even a hidden restaurant called 'Le Train Bleu' where diners can enjoy gourmet meals while feeling like they're in a French train station from the 1900s!"},
 {'fact': 'In Tokyo, there are more vending machines than there are people! With over 5 million vending machines, you can buy everything from hot meals to umbrellas at any time of day or night.'},
 {'fact': "London has a 'Talking' statue! The statue of Sir Winston Churchill in Parliament Square is known to have a conversation with ano

#### 2. Works with any aysnc function

Data Factory works with any async function, not just LLM calls, you can build complex pipelines involving multiple LLMs, data processing, etc.

Here is example of two chained structured llm

In [17]:
# Example of a more complex function that chains multiple LLM calls
# This was grabbed from structured llm examples 

@data_factory(max_concurrency=10)
async def complex_process_cities(topic: str):
    ## topic → generator_llm → rating_llm → merged results
    # First LLM to generate question/answer pairs
    generator_llm = StructuredLLM(
        model_name="openai/gpt-4o-mini",
        prompt="Generate question/answer pairs about {{topic}}.",
        output_schema=[
            {"name": "question", "type": "str"},
            {"name": "answer", "type": "str"}
        ],
    )

    # Second LLM to rate the generated pairs
    rater_llm = StructuredLLM(
        model_name="openai/gpt-4o-mini",
        prompt='''Rate the following Q&A pairs based on accuracy and clarity (1-10).
        Pairs: {{generated_pairs}}''',
        output_schema=[
            {"name": "accuracy_rating", "type": "int"},
            {"name": "clarity_rating", "type": "int"}
        ],
        model_kwargs={"temperature": 0.5}
)

    generation_response = await generator_llm.run(topic=topic, num_records=5)
    rating_response = await rater_llm.run(generated_pairs=generation_response.data)
    
    # Merge the results
    return merge_structured_outputs(generation_response.data, rating_response.data)


### To save on token here we only use 3 topics as example
complex_process_cities_data = complex_process_cities.run(topic=['Science', 'History', 'Technology'])

[32m2025-04-17 10:48:43[0m | [1mINFO    [0m | [1m[1m[JOB START][0m [36mMaster Job ID: 781359f7-189e-47e9-83ff-54b807bd2ecf[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-04-17 10:48:43[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/3[0m | [33mRunning: 3[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 10:48:46[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/3[0m | [33mRunning: 3[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 10:48:49[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 1/3[0m | [33mRunning: 2[0m | [36mAttempted: 1[0m    ([32mCompleted: 1[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 10:48:52[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 1/3[0m | [33mRunning: 2[

In [18]:
### Each topic has 5 question/answer pairs so 3 topics has 15 pairs!
print(len(complex_process_cities_data))
print(complex_process_cities_data)

15
[{'question': 'Who was the first President of the United States?', 'answer': 'George Washington', 'accuracy_rating': 10, 'clarity_rating': 10}, {'question': 'What year did the Titanic sink?', 'answer': '1912', 'accuracy_rating': 10, 'clarity_rating': 10}, {'question': 'Which ancient civilization is known for building the pyramids?', 'answer': 'The Ancient Egyptians', 'accuracy_rating': 10, 'clarity_rating': 10}, {'question': 'What significant event started on July 28, 1914?', 'answer': 'World War I', 'accuracy_rating': 10, 'clarity_rating': 10}, {'question': 'Who was the first woman to fly solo across the Atlantic Ocean?', 'answer': 'Amelia Earhart', 'accuracy_rating': 10, 'clarity_rating': 10}, {'question': 'What is the primary function of a web browser?', 'answer': 'A web browser is primarily used to access and display content on the World Wide Web, allowing users to view websites, interact with online services, and navigate between different web pages.', 'accuracy_rating': 10, 'c

#### 3. Working with Different Input Formats


Data Factory is flexible with how you provide inputs. Let's demonstrate different ways to pass parameters to data_factory functions.

'data' is a reserved keyword expecting list(dict) or tuple(dict) - this design make it super easy to pass large data and support HuggingFace and Pandas dataframe very easily

In [19]:
## We will be using mock llm call for rest of example to save on token
## Mock LLM call is a function that simulates an LLM API call with random delays (controlled by sleep_time) and occasional failures (controlled by fail_rate)
await mock_llm_call(city_name="New York", num_records_per_city=3)

[{'answer': 'New York_3'}, {'answer': 'New York_2'}, {'answer': 'New York_2'}]

In [20]:
@data_factory(max_concurrency=100)
async def input_format_mock_llm(city_name: str, num_records_per_city: int):
    return await mock_llm_call(city_name=city_name, num_records_per_city=num_records_per_city, fail_rate=0.01)

In [21]:
# Format 1: Multiple lists that get zipped together
input_format_data1 = input_format_mock_llm.run(city_name=["New York", "London", "Tokyo", "Paris", "Sydney"], num_records_per_city=[2, 1, 1, 1, 1])

[32m2025-04-17 10:48:54[0m | [1mINFO    [0m | [1m[1m[JOB START][0m [36mMaster Job ID: 404be27d-938a-478a-9aaa-647f3fe42957[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-04-17 10:48:54[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/5[0m | [33mRunning: 5[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 10:48:55[0m | [1mINFO    [0m | [1m[JOB FINISHED] [1mFinal Status:[0m [32mCompleted: 5/5[0m | [33mAttempted: 5[0m (Failed: 0, Filtered: 0, Duplicate: 0)[0m


In [22]:
# Format 2: List + single value (single value gets broadcasted)
input_format_data2 = input_format_mock_llm.run(city_name=["New York", "London", "Tokyo", "Paris", "Sydney"], num_records_per_city=1)

[32m2025-04-17 10:48:55[0m | [1mINFO    [0m | [1m[1m[JOB START][0m [36mMaster Job ID: bd85fe93-b67a-43c7-9ba0-ffff4fcd9f27[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-04-17 10:48:55[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/5[0m | [33mRunning: 5[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 10:48:56[0m | [1mINFO    [0m | [1m[JOB FINISHED] [1mFinal Status:[0m [32mCompleted: 5/5[0m | [33mAttempted: 5[0m (Failed: 0, Filtered: 0, Duplicate: 0)[0m


In [23]:
# Format 3: Special 'data' parameter
# 'data' is a reserved keyword expecting list(dict) or tuple(dict)
# Makes integration with various data sources easier
input_format_data3 = input_format_mock_llm.run(data=[{"city_name": "New York", "num_records_per_city": 2}, {"city_name": "London", "num_records_per_city": 1}, {"city_name": "Tokyo", "num_records_per_city": 1}, {"city_name": "Paris", "num_records_per_city": 1}, {"city_name": "Sydney", "num_records_per_city": 1}])

[32m2025-04-17 10:48:56[0m | [1mINFO    [0m | [1m[1m[JOB START][0m [36mMaster Job ID: f41ea24c-3d40-45c6-81aa-8d04e9cb494a[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-04-17 10:48:56[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/5[0m | [33mRunning: 5[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 10:48:57[0m | [1mINFO    [0m | [1m[JOB FINISHED] [1mFinal Status:[0m [32mCompleted: 5/5[0m | [33mAttempted: 5[0m (Failed: 0, Filtered: 0, Duplicate: 0)[0m


#### 4. Resilient error retry
Data Factory automatically handles errors and retries, making your pipelines robust.

Let's demonstrate with a high failure rate example.

In [24]:
@data_factory(max_concurrency=100)
async def high_error_rate_mock_llm(city_name: str, num_records_per_city: int):
    return await mock_llm_call(city_name=city_name, num_records_per_city=num_records_per_city, fail_rate=0.3) # Hardcode to 30% chance of failure

# Process all cities - some will fail, but data_factory keeps going
cities = ["New York", "London", "Tokyo", "Paris", "Sydney"] * 5  # 25 cities
high_error_rate_mock_lllm_data = high_error_rate_mock_llm.run(city_name=cities, num_records_per_city=1)

print(f"\nSuccessfully completed {len(high_error_rate_mock_lllm_data)} out of {len(cities)} tasks")
print("Data Factory automatically handled the failures and continued processing")
print("The results only include successful tasks")

[32m2025-04-17 10:48:57[0m | [1mINFO    [0m | [1m[1m[JOB START][0m [36mMaster Job ID: d7aa224a-d0e1-4e3d-b564-3f013b8cbdeb[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-04-17 10:48:57[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/25[0m | [33mRunning: 25[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 10:49:00[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 15/25[0m | [33mRunning: 10[0m | [36mAttempted: 15[0m    ([32mCompleted: 15[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 10:49:00[0m | [31m[1mERROR   [0m | [31m[1mError running task: Mock LLM failed to process city: Paris[0m
[32m2025-04-17 10:49:00[0m | [31m[1mERROR   [0m | [31m[1mError running task: Mock LLM failed to process city: Tokyo[0m
[32m2025-04-17 10:49:00[0m | [31m[1mERROR   [0m | [31m[1mError running task: Mo

#### 5. Re run

If a job is interrupted, you can pick up where you left off using re_run().

This is essential for long-running jobs with thousands of tasks.

We're simulating an interruption here. In a real scenario, this might happen if your notebook kernel crashes or you stop execution manually

In [25]:
@data_factory(max_concurrency=10)
async def re_run_mock_llm(city_name: str, num_records_per_city: int):
    return await mock_llm_call(city_name=city_name, num_records_per_city=num_records_per_city, fail_rate=0.3)

cities = ["New York", "London", "Tokyo", "Paris", "Sydney"] * 20  # 100 cities
re_run_mock_llm_data_1 = re_run_mock_llm.run(city_name=cities, num_records_per_city=1)

[32m2025-04-17 10:49:08[0m | [1mINFO    [0m | [1m[1m[JOB START][0m [36mMaster Job ID: a8af7f8f-d440-4cc6-8d4f-836a0f71b32f[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-04-17 10:49:08[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/100[0m | [33mRunning: 10[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 10:49:11[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 27/100[0m | [33mRunning: 10[0m | [36mAttempted: 27[0m    ([32mCompleted: 27[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 10:49:11[0m | [31m[1mERROR   [0m | [31m[1mError running task: Mock LLM failed to process city: London[0m
[32m2025-04-17 10:49:11[0m | [31m[1mERROR   [0m | [31m[1mError running task: Mock LLM failed to process city: Paris[0m
[32m2025-04-17 10:49:14[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted

In [26]:
print("When a job is interrupted, you'll see a message like:")
print("'🚨 Job stopped unexpectedly. You can resume the job using master_job_id'")
print("'by resume_from_checkpoint(\"9a590f0f-994d-4d3b-b2db-30e75143ff14\")'")

print("\nTo resume an interrupted job, simply call:")
print("interrupted_job_mock_llm.resume_from_checkpoint(\"YOUR_JOB_ID\")")
print('')
print(f"For this example we have {len(re_run_mock_llm_data_1)}/{len(cities)} data generated and not finished yet!")

When a job is interrupted, you'll see a message like:
'🚨 Job stopped unexpectedly. You can resume the job using master_job_id'
'by re_run("9a590f0f-994d-4d3b-b2db-30e75143ff14")'

To resume an interrupted job, simply call:
interrupted_job_mock_llm.re_run("YOUR_JOB_ID")

For this example we have 64/100 data generated and not finished yet!


In [28]:
## Lets keep continue the rest of run by resume_from_checkpoint 
re_run_mock_llm_data_2 = re_run_mock_llm.resume("a8af7f8f-d440-4cc6-8d4f-836a0f71b32f")

[32m2025-04-17 10:49:33[0m | [1mINFO    [0m | [1m[1m[JOB RE-RUN START][0m [33mPICKING UP FROM WHERE THE JOB WAS LEFT OFF...[0m
[0m
[32m2025-04-17 10:49:33[0m | [1mINFO    [0m | [1m[1m[RE-RUN PROGRESS]STATUS AT THE TIME OF RE-RUN:[0m [32mCompleted: 64 / 100[0m | [31mFailed: 3[0m | [31mDuplicate: 0[0m | [33mFiltered: 0[0m[0m
[32m2025-04-17 10:49:33[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 64/100[0m | [33mRunning: 10[0m | [36mAttempted: 67[0m    ([32mCompleted: 64[0m, [31mFailed: 3[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 10:49:36[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 77/100[0m | [33mRunning: 10[0m | [36mAttempted: 80[0m    ([32mCompleted: 77[0m, [31mFailed: 3[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 10:49:36[0m | [31m[1mERROR   [0m | [31m[1mError running task: Mock LLM failed to process city: New York[0m
[32m2025-04-17 10:49:36[0m | [31m

In [29]:
print(f"Now we still able to finished with what is left!! {len(re_run_mock_llm_data_2)} data generated!")

Now we still able to finished with what is left!! 100 data generated!


#### 6. Dry run
Before running a large job, you can do a "dry run" to test your pipeline. This only processes a single item and doesn't save state to the database.

In [30]:
@data_factory(max_concurrency=10)
async def dry_run_mock_llm(city_name: str, num_records_per_city: int):
    return await mock_llm_call(city_name=city_name, num_records_per_city=num_records_per_city, fail_rate=0.3)

dry_run_mock_llm_data = dry_run_mock_llm.dry_run(city_name=["New York", "London", "Tokyo", "Paris", "Sydney"]*20, num_records_per_city=1)

[32m2025-04-17 10:49:45[0m | [1mINFO    [0m | [1m[1m[JOB START][0m [36mMaster Job ID: None[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-04-17 10:49:45[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/1[0m | [33mRunning: 1[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 10:49:46[0m | [1mINFO    [0m | [1m[JOB FINISHED] [1mFinal Status:[0m [32mCompleted: 1/0[0m | [33mAttempted: 1[0m (Failed: 0, Filtered: 0, Duplicate: 0)[0m


#### 8. Advanced Usage
Data Factory offers more advanced capabilities for complete pipeline customization, including hooks that execute at key stages and shareable state to coordinate between tasks. These powerful features enable complex workflows and fine-grained control. Our dedicated examples for advanced data_factory usage will be coming soon!