# Google Colab Version: [Open this notebook in Google Colab](https://colab.research.google.com/github/starfishdata/starfish/blob/main/examples/data_factory.ipynb)

### Dependencies 

In [None]:
%pip install starfish-core

In [1]:
## Fix for Jupyter Notebook only â€” do NOT use in production
## Enables async code execution in notebooks, but may cause issues with sync/async issues
## For production, please run in standard .py files without this workaround
## See: https://github.com/erdewit/nest_asyncio for more details
import nest_asyncio
nest_asyncio.apply()

from starfish import StructuredLLM, data_factory
from starfish.llm.utils import merge_structured_outputs

from starfish.common.env_loader import load_env_file ## Load environment variables from .env file
load_env_file()

In [2]:
# setup your openai api key if not already set
# os.environ["OPENAI_API_KEY"] = "your_key_here"

# If you dont have any API key, please navigate to local model section

In [3]:
## Helper function mock llm call
# When developing data pipelines with LLMs, making thousands of real API calls
# can be expensive. Using mock LLM calls lets you test your pipeline's reliability,
# failure handling, and recovery without spending money on API calls.
from starfish.data_factory.utils.mock import mock_llm_call

#### 1. Your First Data Factory: Simple Scaling

The @data_factory decorator transforms any async function into a scalable data processing pipeline.
It handles:
- Parallel execution 
- Automatic batching
- Error handling & retries
- Progress tracking

Let's start with a single LLM call and then show how easy it is to scale it.


In [None]:
# First, create a StructuredLLM instance for generating facts about cities
json_llm = StructuredLLM(
    model_name = "openai/gpt-4o-mini",
    prompt = "Funny facts about city {{city_name}}.",
    output_schema = [{'name': 'fact', 'type': 'str'}],
    model_kwargs = {"temperature": 0.7},
)

json_llm_response = await json_llm.run(city_name='New York')
json_llm_response.data

[{'fact': "In New York City, it's illegal to honk your horn unless it's an emergencyâ€”so if you hear a lot of honking, someone must be in a real rush!"}]

In [5]:
# Now, scale to multiple cities using data_factory
# Just add the @data_factory decorator to process many cities in parallel

from datetime import datetime
@data_factory(max_concurrency=10)
async def process_json_llm(city_name: str):
    ## Adding a print statement to indicate the start of the processing
    print(f"Processing {city_name} at {datetime.now()}")
    json_llm_response = await json_llm.run(city_name=city_name)
    return json_llm_response.data

# This is all it takes to scale from one city to many cities!
process_json_llm(city_name=["New York", "London", "Tokyo", "Paris", "Sydney"])

[32m2025-04-17 09:50:06[0m | [1mINFO    [0m | [1m[1mJob START:[0m [36mMaster Job ID: b417d8f3-c8cb-4a21-b6c4-15f862a7db31[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-04-17 09:50:06[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/5[0m | [33mRunning: 5[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
Processing New York at 2025-04-17 09:50:06.012056
Processing London at 2025-04-17 09:50:06.012331
Processing Tokyo at 2025-04-17 09:50:06.013621
Processing Paris at 2025-04-17 09:50:06.013810
Processing Sydney at 2025-04-17 09:50:06.013972
[32m2025-04-17 09:50:09[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [1mJob Finished:[0m [32mCompleted: 5/5[0m | [33mAttempted: 5[0m (Failed: 0, Filtered: 0, Duplicate: 0)[0m


[{'fact': "Sydney's famous Bondi Beach is so popular that it even has its own social media influencer - a dog named 'Hugo' who has over 100,000 followers on Instagram!"},
 {'fact': 'In Paris, there are more dogs than children! With around 300,000 pups frolicking around the city, it seems like the dogs might just be winning the battle for the best city pet.'},
 {'fact': "London has a 'Pigeon Pie' shop where you can buy pies made with actual pigeon meat, which might explain why the city's pigeons are so well-fed and sassy!"},
 {'fact': "Tokyo has a 'Cat Island' called Aoshima, where more cats outnumber humans 6 to 1, making it a purr-fect getaway for feline lovers!"},
 {'fact': "New Yorkers are so busy that they often walk faster than the speed limit, which is usually 25 mphâ€”making them the only pedestrians in the world who might get a ticket for being 'too fast'!"}]

#### 2. Works with any aysnc function

Data Factory works with any async function, not just LLM calls, you can build complex pipelines involving multiple LLMs, data processing, etc.

Here is example of two chained structured llm

In [29]:
# Example of a more complex function that chains multiple LLM calls
# This was grabbed from structured llm examples 

@data_factory(max_concurrency=10)
async def complex_process_cities(topic: str):
    ## topic â†’ generator_llm â†’ rating_llm â†’ merged results
    # First LLM to generate question/answer pairs
    generator_llm = StructuredLLM(
        model_name="openai/gpt-4o-mini",
        prompt="Generate question/answer pairs about {{topic}}.",
        output_schema=[
            {"name": "question", "type": "str"},
            {"name": "answer", "type": "str"}
        ],
    )

    # Second LLM to rate the generated pairs
    rater_llm = StructuredLLM(
        model_name="openai/gpt-4o-mini",
        prompt='''Rate the following Q&A pairs based on accuracy and clarity (1-10).
        Pairs: {{generated_pairs}}''',
        output_schema=[
            {"name": "accuracy_rating", "type": "int"},
            {"name": "clarity_rating", "type": "int"}
        ],
        model_kwargs={"temperature": 0.5}
)

    generation_response = await generator_llm.run(topic=topic, num_records=5)
    rating_response = await rater_llm.run(generated_pairs=generation_response.data)
    
    # Merge the results
    return merge_structured_outputs(generation_response.data, rating_response.data)


### To save on token here we only use 3 topics as example
complex_process_cities_data = complex_process_cities.run(topic=['Science', 'History', 'Technology'])

[32m2025-04-17 10:08:33[0m | [1mINFO    [0m | [1m[1mJob START:[0m [36mMaster Job ID: 5061522d-ff62-46ce-87fe-39c59cc6227c[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-04-17 10:08:33[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/3[0m | [33mRunning: 3[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 10:08:36[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/3[0m | [33mRunning: 3[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 10:08:39[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 2/3[0m | [33mRunning: 1[0m | [36mAttempted: 2[0m    ([32mCompleted: 2[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 10:08:41[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [1mJob Finished:[0m [32mCompleted: 3/3[0

In [31]:
### Each topic has 5 question/answer pairs so 3 topics has 15 pairs!
print(len(complex_process_cities_data))
print(complex_process_cities_data)

15
[{'question': 'What is the chemical symbol for water?', 'answer': 'H2O', 'accuracy_rating': 10, 'clarity_rating': 10}, {'question': 'What planet is known as the Red Planet?', 'answer': 'Mars', 'accuracy_rating': 10, 'clarity_rating': 10}, {'question': 'What is the process by which plants make their food using sunlight?', 'answer': 'Photosynthesis', 'accuracy_rating': 10, 'clarity_rating': 10}, {'question': 'What is the center of an atom called?', 'answer': 'Nucleus', 'accuracy_rating': 10, 'clarity_rating': 10}, {'question': 'What gas do living beings need to breathe?', 'answer': 'Oxygen', 'accuracy_rating': 10, 'clarity_rating': 10}, {'question': 'What is the largest planet in our solar system?', 'answer': 'Jupiter is the largest planet in our solar system.', 'accuracy_rating': 10, 'clarity_rating': 10}, {'question': 'What is the process by which plants make their food?', 'answer': 'Plants make their food through a process called photosynthesis.', 'accuracy_rating': 10, 'clarity_ra

#### 3. Working with Different Input Formats


Data Factory is flexible with how you provide inputs. Let's demonstrate different ways to pass parameters to data_factory functions.

'data' is a reserved keyword expecting list(dict) or tuple(dict) - this design make it super easy to pass large data and support HuggingFace and Pandas dataframe very easily

In [8]:
## We will be using mock llm call for rest of example to save on token
## Mock LLM call is a function that simulates an LLM API call with random delays (controlled by sleep_time) and occasional failures (controlled by fail_rate)
await mock_llm_call(city_name="New York", num_records_per_city=3)

[{'answer': 'New York_5'}, {'answer': 'New York_4'}, {'answer': 'New York_5'}]

In [32]:
@data_factory(max_concurrency=100)
async def input_format_mock_llm(city_name: str, num_records_per_city: int):
    return await mock_llm_call(city_name=city_name, num_records_per_city=num_records_per_city, fail_rate=0.01)

In [None]:
# Format 1: Multiple lists that get zipped together
input_format_data1 = input_format_mock_llm.run(city_name=["New York", "London", "Tokyo", "Paris", "Sydney"], num_records_per_city=[2, 1, 1, 1, 1])

In [10]:
# Format 2: List + single value (single value gets broadcasted)
input_format_data2 = input_format_mock_llm.run(city_name=["New York", "London", "Tokyo", "Paris", "Sydney"], num_records_per_city=1)

[32m2025-04-17 09:50:17[0m | [1mINFO    [0m | [1m[1mJob START:[0m [36mMaster Job ID: 7d742262-e631-4e8f-9da1-fe755cfdd112[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-04-17 09:50:17[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/5[0m | [33mRunning: 5[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 09:50:18[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [1mJob Finished:[0m [32mCompleted: 5/5[0m | [33mAttempted: 5[0m (Failed: 0, Filtered: 0, Duplicate: 0)[0m


In [11]:
# Format 3: Special 'data' parameter
# 'data' is a reserved keyword expecting list(dict) or tuple(dict)
# Makes integration with various data sources easier
input_format_data3 = input_format_mock_llm.run(data=[{"city_name": "New York", "num_records_per_city": 2}, {"city_name": "London", "num_records_per_city": 1}, {"city_name": "Tokyo", "num_records_per_city": 1}, {"city_name": "Paris", "num_records_per_city": 1}, {"city_name": "Sydney", "num_records_per_city": 1}])

[32m2025-04-17 09:50:18[0m | [1mINFO    [0m | [1m[1mJob START:[0m [36mMaster Job ID: db640456-827f-4df2-8463-b4f1fd9d6f9c[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-04-17 09:50:18[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/5[0m | [33mRunning: 5[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 09:50:19[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [1mJob Finished:[0m [32mCompleted: 5/5[0m | [33mAttempted: 5[0m (Failed: 0, Filtered: 0, Duplicate: 0)[0m


#### 4. Resilient error retry
Data Factory automatically handles errors and retries, making your pipelines robust.

Let's demonstrate with a high failure rate example.

In [33]:
@data_factory(max_concurrency=100)
async def high_error_rate_mock_llm(city_name: str, num_records_per_city: int):
    return await mock_llm_call(city_name=city_name, num_records_per_city=num_records_per_city, fail_rate=0.3) # Hardcode to 30% chance of failure

# Process all cities - some will fail, but data_factory keeps going
cities = ["New York", "London", "Tokyo", "Paris", "Sydney"] * 5  # 25 cities
high_error_rate_mock_lllm_data = high_error_rate_mock_llm.run(city_name=cities, num_records_per_city=1)

print(f"\nSuccessfully completed {len(high_error_rate_mock_lllm_data)} out of {len(cities)} tasks")
print("Data Factory automatically handled the failures and continued processing")
print("The results only include successful tasks")

[32m2025-04-17 10:12:04[0m | [1mINFO    [0m | [1m[1mJob START:[0m [36mMaster Job ID: 3ac13811-28b9-4a95-b819-102d9a616040[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-04-17 10:12:04[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/25[0m | [33mRunning: 25[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 10:12:07[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 20/25[0m | [33mRunning: 5[0m | [36mAttempted: 20[0m    ([32mCompleted: 20[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 10:12:07[0m | [31m[1mERROR   [0m | [31m[1mError running task: Mock LLM failed to process city: Tokyo[0m
[32m2025-04-17 10:12:07[0m | [31m[1mERROR   [0m | [31m[1mError running task: Mock LLM failed to process city: Sydney[0m
[32m2025-04-17 10:12:07[0m | [31m[1mERROR   [0m | [31m[1mError running task: Moc

#### 5. Re run

If a job is interrupted, you can pick up where you left off using re_run().

This is essential for long-running jobs with thousands of tasks.

We're simulating an interruption here. In a real scenario, this might happen if your notebook kernel crashes or you stop execution manually

In [34]:
@data_factory(max_concurrency=10)
async def re_run_mock_llm(city_name: str, num_records_per_city: int):
    return await mock_llm_call(city_name=city_name, num_records_per_city=num_records_per_city, fail_rate=0.3)

cities = ["New York", "London", "Tokyo", "Paris", "Sydney"] * 20  # 100 cities
re_run_mock_llm_data_1 = re_run_mock_llm.run(city_name=cities, num_records_per_city=1)

[32m2025-04-17 10:13:41[0m | [1mINFO    [0m | [1m[1mJob START:[0m [36mMaster Job ID: dc2a08a0-3267-482d-a817-3d1a5c637ef1[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-04-17 10:13:41[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/100[0m | [33mRunning: 10[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 10:13:44[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 21/100[0m | [33mRunning: 10[0m | [36mAttempted: 21[0m    ([32mCompleted: 21[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 10:13:44[0m | [31m[1mERROR   [0m | [31m[1mError running task: Mock LLM failed to process city: New York[0m
[32m2025-04-17 10:13:45[0m | [31m[1mERROR   [0m | [31m[1mError running task: Mock LLM failed to process city: New York[0m
[32m2025-04-17 10:13:45[0m | [31m[1mERROR   [0m | [31m[1mError occurred:

In [37]:
print("When a job is interrupted, you'll see a message like:")
print("'ðŸš¨ Job stopped unexpectedly. You can resume the job using master_job_id'")
print("'by re_run(\"9a590f0f-994d-4d3b-b2db-30e75143ff14\")'")

print("\nTo resume an interrupted job, simply call:")
print("interrupted_job_mock_llm.re_run(\"YOUR_JOB_ID\")")
print('')
print(f"For this example we have {len(re_run_mock_llm_data_1)}/{len(cities)} data generated and not finished yet!")

When a job is interrupted, you'll see a message like:
'ðŸš¨ Job stopped unexpectedly. You can resume the job using master_job_id'
'by re_run("9a590f0f-994d-4d3b-b2db-30e75143ff14")'

To resume an interrupted job, simply call:
interrupted_job_mock_llm.re_run("YOUR_JOB_ID")

For this example we have 27/100 data generated and not finished yet!


In [43]:
## Lets keep continue the rest of run by re_run 
re_run_mock_llm_data_2 = re_run_mock_llm.re_run("dc2a08a0-3267-482d-a817-3d1a5c637ef1")

[32m2025-04-17 10:15:03[0m | [1mINFO    [0m | [1m[1mJob START:[0m [33mPICKING UP FROM WHERE THE JOB WAS LEFT OFF...[0m
[1mSTATUS AT THE TIME OF RE-RUN:[0m [32mCompleted: 95 / 100[0m | [31mFailed: 12[0m | [31mDuplicate: 0[0m | [33mFiltered: 0[0m[0m
[32m2025-04-17 10:15:03[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 95/100[0m | [33mRunning: 5[0m | [36mAttempted: 107[0m    ([32mCompleted: 95[0m, [31mFailed: 12[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 10:15:04[0m | [31m[1mERROR   [0m | [31m[1mError occurred: KeyboardInterrupt[0m
[32m2025-04-17 10:15:04[0m | [1mINFO    [0m | [1mðŸš¨ Job stopped unexpectedly. You can resume the job using master_job_id by re_run("dc2a08a0-3267-482d-a817-3d1a5c637ef1")[0m
[32m2025-04-17 10:15:04[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [1mJob Finished:[0m [32mCompleted: 100/100[0m | [33mAttempted: 112[0m (Failed: 12, Filtered: 0, Duplicate: 0)[0m


In [44]:
print(f"Now we still able to finished with what is left!! {len(re_run_mock_llm_data_2)} data generated!")

Now we still able to finished with what is left!! 100 data generated!


#### 6. Dry run
Before running a large job, you can do a "dry run" to test your pipeline. This only processes a single item and doesn't save state to the database.

In [28]:
@data_factory(max_concurrency=10)
async def dry_run_mock_llm(city_name: str, num_records_per_city: int):
    return await mock_llm_call(city_name=city_name, num_records_per_city=num_records_per_city, fail_rate=0.3)

dry_run_mock_llm_data = dry_run_mock_llm.dry_run(city_name=["New York", "London", "Tokyo", "Paris", "Sydney"]*20, num_records_per_city=1)

[32m2025-04-17 10:01:02[0m | [1mINFO    [0m | [1m[1mJob START:[0m [36mMaster Job ID: None[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-04-17 10:01:02[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/1[0m | [33mRunning: 1[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-17 10:01:03[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [1mJob Finished:[0m [32mCompleted: 1/0[0m | [33mAttempted: 1[0m (Failed: 0, Filtered: 0, Duplicate: 0)[0m


#### 8. Advanced Usage
Data Factory offers more advanced capabilities for complete pipeline customization, including hooks that execute at key stages and shareable state to coordinate between tasks. These powerful features enable complex workflows and fine-grained control. Our dedicated examples for advanced data_factory usage will be coming soon!