# **Homework: Speed up your pipeline**

### **Goal**

Use the public **Jaffle Shop API** to build a `dlt` pipeline and apply everything you've learned about performance:

- Chunking
- Parallelism
- Buffer control
- File rotation
- Worker tuning

Your task is to **make the pipeline as fast as possible**, while keeping the results correct.



### **What you’ll need**

- API base: `https://jaffle-shop.scalevector.ai/api/v1`
- Docs: [https://jaffle-shop.scalevector.ai/docs](https://jaffle-shop.scalevector.ai/docs)
- Start with these endpoints:
  - `/customers`
  - `/orders`
  - `/products`

Each of them returns **paged responses** — so you'll need to handle pagination.



### **What to implement**

1. **Extract** from the API using `dlt`
   - Use `dlt.resource` and [`RESTClient`](https://dlthub.com/docs/devel/general-usage/http/rest-client) with proper pagination

2. **Apply all performance techniques**
   - Group resources into sources
   - Yield **chunks/pages**, not single rows
   - Use `parallelized=True`
   - Set `EXTRACT__WORKERS`, `NORMALIZE__WORKERS`, and `LOAD__WORKERS`
   - Tune buffer sizes and enable **file rotation**

3. **Measure performance**
   - Time the extract, normalize, and load stages separately
   - Compare a naive version vs. optimized version
   - Log thread info or `pipeline.last_trace` if helpful


### **Deliverables**

Share your code as a Google Colab or [GitHub Gist](https://gist.github.com/) in Homework Google Form. **This step is required for certification.**


It should include:
- Working pipeline for at least 2 endpoints
- Before/after timing comparison
- A short explanation of what changes made the biggest difference if there're any differences

In [28]:
import dlt
import time

from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.paginators import PageNumberPaginator

In [37]:
os.environ["DATA_WRITER__BUFFER_MAX_ITEMS"] = "5000"

In [29]:
base_url = 'https://jaffle-shop.scalevector.ai/api/v1'

In [30]:
@dlt.source(name='jaffle_shop')
def jaffle_shop_source():
    
    client = RESTClient(
        base_url=base_url, 
        paginator=PageNumberPaginator(
            page_param='page',
            base_page=1,
            total_path=None
        )
    )

    @dlt.resource(name='customers')
    def get_customers():
        for page in client.paginate('customers'):
            yield page
            
    @dlt.resource(name='orders')
    def get_orders():
        for page in client.paginate('orders'):
            yield page
    
    @dlt.resource(name='products')
    def get_products():
        for page in client.paginate('products'):
            yield page
    
    return [get_customers(), get_orders(), get_products()]

In [31]:
pipeline = dlt.pipeline(
    destination='duckdb',
    dataset_name='jaffle_shop',
    full_refresh=True,
    progress='log'
)

In [32]:
%timeit

pipeline.extract(jaffle_shop_source())

----------------------------- Extract jaffle_shop ------------------------------
Resources: 0/3 (0.0%) | Time: 0.00s | Rate: 0.00/s
Memory usage: 51.38 MB (85.90%) | CPU usage: 0.00%

----------------------------- Extract jaffle_shop ------------------------------
Resources: 0/3 (0.0%) | Time: 1.07s | Rate: 0.00/s
customers: 100  | Time: 0.00s | Rate: 32263876.92/s
Memory usage: 64.98 MB (85.70%) | CPU usage: 0.00%

----------------------------- Extract jaffle_shop ------------------------------
Resources: 0/3 (0.0%) | Time: 2.50s | Rate: 0.00/s
customers: 100  | Time: 1.43s | Rate: 69.88/s
orders: 100  | Time: 0.00s | Rate: 19065018.18/s
Memory usage: 66.30 MB (85.30%) | CPU usage: 0.00%

----------------------------- Extract jaffle_shop ------------------------------
Resources: 0/3 (0.0%) | Time: 3.47s | Rate: 0.00/s
customers: 100  | Time: 2.40s | Rate: 41.66/s
orders: 100  | Time: 0.97s | Rate: 103.16/s
products: 10  | Time: 0.00s | Rate: 2467237.65/s
Memory usage: 64.92 MB (85.30%

ExtractInfo(pipeline=<dlt.pipeline.pipeline.Pipeline object at 0x12174a110>, metrics={'1748263925.1556349': [{'started_at': DateTime(2025, 5, 26, 12, 52, 5, 159860, tzinfo=Timezone('UTC')), 'finished_at': DateTime(2025, 5, 26, 13, 3, 52, 614968, tzinfo=Timezone('UTC')), 'schema_name': 'jaffle_shop', 'job_metrics': {'customers.f5fe9da844.typed-jsonl': DataWriterMetrics(file_path='/Users/redperiabras/.dlt/pipelines/dlt_ipykernel_launcher/normalize/627c2760bdd3c33a/1748263925.1556349/new_jobs/customers.f5fe9da844.0.typed-jsonl', items_count=935, file_size=64816, created=1748263926.227572, last_modified=1748263950.141396), 'orders.cd787b0610.typed-jsonl': DataWriterMetrics(file_path='/Users/redperiabras/.dlt/pipelines/dlt_ipykernel_launcher/normalize/627c2760bdd3c33a/1748263925.1556349/new_jobs/orders.cd787b0610.0.typed-jsonl', items_count=61948, file_size=25523121, created=1748263927.658583, last_modified=1748264632.548816), 'products.a9e957e93e.typed-jsonl': DataWriterMetrics(file_path='

In [33]:
%%timeit

pipeline.normalize()

----------------- Normalize jaffle_shop in 1748263925.1556349 ------------------
Files: 0/4 (0.0%) | Time: 0.00s | Rate: 0.00/s
Memory usage: 77.28 MB (84.70%) | CPU usage: 0.00%

----------------- Normalize jaffle_shop in 1748263925.1556349 ------------------
Files: 0/4 (0.0%) | Time: 0.00s | Rate: 0.00/s
Items: 0  | Time: 0.00s | Rate: 0.00/s
Memory usage: 77.28 MB (84.70%) | CPU usage: 0.00%

----------------- Normalize jaffle_shop in 1748263925.1556349 ------------------
Files: 5/4 (125.0%) | Time: 5.48s | Rate: 0.91/s
Items: 0  | Time: 5.48s | Rate: 0.00/s
Memory usage: 196.34 MB (84.00%) | CPU usage: 0.00%

----------------- Normalize jaffle_shop in 1748263925.1556349 ------------------
Files: 5/4 (125.0%) | Time: 5.49s | Rate: 0.91/s
Items: 153794  | Time: 5.49s | Rate: 27997.67/s
Memory usage: 201.72 MB (84.00%) | CPU usage: 0.00%

8.14 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [38]:
%%timeit

load_info = pipeline.load()

8.26 ms ± 145 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [40]:
print(pipeline.last_trace)

Run started at 2025-05-26 12:52:05.142543+00:00 and COMPLETED in 34 minutes and 4.45 seconds with 100 steps.
Step load COMPLETED in 0.01 seconds.
Pipeline dlt_ipykernel_launcher load step completed in ---
0 load package(s) were loaded to destination duckdb and into dataset None
The duckdb destination used duckdb:////Users/redperiabras/Developer/dlt-advanced/dlt_ipykernel_launcher.duckdb location to store data

Step load COMPLETED in 0.01 seconds.
Pipeline dlt_ipykernel_launcher load step completed in ---
0 load package(s) were loaded to destination duckdb and into dataset None
The duckdb destination used duckdb:////Users/redperiabras/Developer/dlt-advanced/dlt_ipykernel_launcher.duckdb location to store data

Step load COMPLETED in 0.01 seconds.
Pipeline dlt_ipykernel_launcher load step completed in ---
0 load package(s) were loaded to destination duckdb and into dataset None
The duckdb destination used duckdb:////Users/redperiabras/Developer/dlt-advanced/dlt_ipykernel_launcher.duckdb 