# Getting Started

## Installation

```
pip install pypekit
```

## Define Tasks

To define a task, create a subclass of `Task`, override the `input_types` and `output_types` properties, and implement the `run` method. 

`input_types` and `output_types` are lists of strings that represent the types of inputs and outputs for the task 
and the `run` method contains the logic for processing the inputs and producing the outputs.

When we later use these tasks to create pipelines, they will run from tasks with the input type `"source"` to tasks with the output type `"sink"`.
All other input and output types are intermediate types, that you can use to define how tasks are connected together.

For two tasks to be connected, at least one of the output types of the first task must match one of the input types of the second task.

In [1]:
from pypekit import Task

class Source(Task):
    input_types = ["source"]
    output_types = ["a"]

    def run(self, _):
        print("Running Source")
        return "source"

class Transform1(Task):
    input_types = ["a"]
    output_types = ["b"]

    def run(self, x):
        print("Running Transform1")
        return x + "_transformed-1"
    
class Transform2(Task):
    input_types = ["a", "b"]
    output_types = ["b"]

    def run(self, x):
        print("Running Transform2")
        return x + "_transformed-2"
    
class Sink1(Task):
    input_types = ["b"]
    output_types = ["sink"]

    def run(self, x):
        print("Running Sink1")
        return x + "_sink-1"
    
class Sink2(Task):
    input_types = ["b"]
    output_types = ["sink"]

    def run(self, x):
        print("Running Sink2")
        return x + "_sink-2"


## Build Repository

To create all possible pipelines from a list of tasks, we now create a `Repository`.
We can then use the `build_tree()` method to build a tree of all possible pipelines from the tasks defined in the repository.

In [2]:
from pypekit import Repository

repository = Repository([
    Source,
    Transform1,
    Transform2,
    Sink1,
    Sink2
])

root = repository.build_tree()

## Inspect Tree

To inspect the tree structure, we can use the `build_tree_string()` method, which will return a string representation of the tree.

In [3]:
tree_representation = repository.build_tree_string()
print(tree_representation)

└── Root()
    └── Source()
        ├── Transform1()
        │   ├── Transform2()
        │   │   ├── Sink1()
        │   │   └── Sink2()
        │   ├── Sink1()
        │   └── Sink2()
        └── Transform2()
            ├── Sink1()
            └── Sink2()



## Build Pipelines

With the `build_pipelines()` method, we can build all possible pipelines from the tree.

In [4]:
pipelines = repository.build_pipelines()
for p in pipelines:
    print(p)

Pipeline(tasks=[Source(), Transform1(), Transform2(), Sink1()])
Pipeline(tasks=[Source(), Transform1(), Transform2(), Sink2()])
Pipeline(tasks=[Source(), Transform1(), Sink1()])
Pipeline(tasks=[Source(), Transform1(), Sink2()])
Pipeline(tasks=[Source(), Transform2(), Sink1()])
Pipeline(tasks=[Source(), Transform2(), Sink2()])


## Execute Pipelines with Caching

Running many similar pipelines can be wasteful if they share sub-chains. 
To cache intermediate results, we can use the `CachedExecutor`. 
Simply pass a list of pipelines to the `CachedExecutor` and call the `run()` method.
As you can see in the output, the executor only runs the tasks that are not cached.

In [5]:
from pypekit import CachedExecutor

executor = CachedExecutor(pipelines, verbose=True)
results = executor.run()

Running Source
Running Transform1
Running Transform2
Running Sink1
Pipeline 1/6 completed. Runtime: 0.00s.
Running Sink2
Pipeline 2/6 completed. Runtime: 0.00s.
Running Sink1
Pipeline 3/6 completed. Runtime: 0.00s.
Running Sink2
Pipeline 4/6 completed. Runtime: 0.00s.
Running Transform2
Running Sink1
Pipeline 5/6 completed. Runtime: 0.00s.
Running Sink2
Pipeline 6/6 completed. Runtime: 0.00s.


## Inspect Results

After `run()` finishes, `executor.results` is a nested dict whose keys are pipeline IDs.
Each entry records:

* **`output`** – the final value produced,
* **`runtime`** – cumulative seconds spent (for cached tasks, the runtime is also taken from the cache),
* **`tasks`** – the human-readable task list that formed the pipeline.

In [6]:
import json

for r in results.values():
    print(json.dumps(r, indent=2))

{
  "output": "source_transformed-1_transformed-2_sink-1",
  "runtime": 7.836699933250202e-05,
  "tasks": [
    "Source()",
    "Transform1()",
    "Transform2()",
    "Sink1()"
  ]
}
{
  "output": "source_transformed-1_transformed-2_sink-2",
  "runtime": 8.094199893093901e-05,
  "tasks": [
    "Source()",
    "Transform1()",
    "Transform2()",
    "Sink2()"
  ]
}
{
  "output": "source_transformed-1_sink-1",
  "runtime": 7.130399899324402e-05,
  "tasks": [
    "Source()",
    "Transform1()",
    "Sink1()"
  ]
}
{
  "output": "source_transformed-1_sink-2",
  "runtime": 7.185499907791382e-05,
  "tasks": [
    "Source()",
    "Transform1()",
    "Sink2()"
  ]
}
{
  "output": "source_transformed-2_sink-1",
  "runtime": 6.648499856964918e-05,
  "tasks": [
    "Source()",
    "Transform2()",
    "Sink1()"
  ]
}
{
  "output": "source_transformed-2_sink-2",
  "runtime": 6.663499880232848e-05,
  "tasks": [
    "Source()",
    "Transform2()",
    "Sink2()"
  ]
}


# Reusing Cache

If you already have a cache from a previous run, you can reuse it by passing the `cache` argument to the `CachedExecutor`.

In [7]:
new_executor = CachedExecutor(pipelines, cache=executor.cache, verbose=True)
new_executor.run();

Pipeline 1/6 completed. Runtime: 0.00s.
Pipeline 2/6 completed. Runtime: 0.00s.
Pipeline 3/6 completed. Runtime: 0.00s.
Pipeline 4/6 completed. Runtime: 0.00s.
Pipeline 5/6 completed. Runtime: 0.00s.
Pipeline 6/6 completed. Runtime: 0.00s.


# Instances, Parameters and Pipelines

A repository can mix and match several flavours of “tasks”:

| What you pass                                | How the repository treats it                          |
| -------------------------------------------- | ----------------------------------------------------- |
| A **class**                                  | Instantiated at every node of the tree.               |
| An **instance** (`Task()`)                   | Re-use the same instance everywhere it’s needed.      |
| A **tuple** (`Task`, `kwargs`)               | Task class with kwargs to use on instantiation.       |
| An existing **Pipeline**                     | Used as an instance of a task.                        |

This flexibility lets you

* reuse heavyweight objects (e.g. a loaded ML model),
* scan hyper-parameters by specifying multiple (class, kwargs) tuples,
* embed pre-fabricated sub-pipelines inside larger graphs,

In [8]:
from pypekit import Pipeline

# New task, that takes arguments
class Transform3(Task):
    input_types = ["a"]
    output_types = ["b"]

    def __init__(self, **kwargs):
        self.test = kwargs.get("test", False)

    def run(self, x):
        print("Running Transform3 with test =", self.test)
        return x + "_transformed-3" + ("-test" if self.test else "")

pipeline = Pipeline([
    Transform1(),
    Sink2()
])

repository = Repository([
    Source,
    Transform1,
    (Transform3, {"test": True}),   # Transform3 will be instantiated with the argument test=True every time it is used
    Sink1(),                        # Every node with the task Sink1 will have the same instance
    pipeline                        # Pipeline instance as task
])

repository.build_tree()
print(repository.build_tree_string())

└── Root()
    └── Source()
        ├── Transform1()
        │   └── Sink1()
        ├── Transform3(test=True)
        │   └── Sink1()
        └── Pipeline(tasks=[Transform1(), Sink2()])

