Skip to content

Commit

Permalink
Update more parts of the documentation. (#441)
Browse files Browse the repository at this point in the history
  • Loading branch information
tobiasraabe committed Oct 7, 2023
1 parent fbda956 commit b191559
Show file tree
Hide file tree
Showing 18 changed files with 178 additions and 169 deletions.
3 changes: 2 additions & 1 deletion docs/source/changes.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ chronological order. Releases follow [semantic versioning](https://semver.org/)
releases are available on [PyPI](https://pypi.org/project/pytask) and
[Anaconda.org](https://anaconda.org/conda-forge/pytask).

## 0.4.0 - 2023-xx-xx
## 0.4.0 - 2023-10-07

- {pull}`323` remove Python 3.7 support and use a new Github action to provide mamba.
- {pull}`384` allows to parse dependencies from every function argument if `depends_on`
Expand Down Expand Up @@ -56,6 +56,7 @@ releases are available on [PyPI](https://pypi.org/project/pytask) and
{func}`pytask.is_task_function`.
- {pull}`438` clarifies some types.
- {pull}`440` refines more types.
- {pull}`441` updates more parts of the documentation.
- {pull}`442` allows users to import `from pytask import mark` and use `@mark.skip`.

## 0.3.2 - 2023-06-07
Expand Down
112 changes: 0 additions & 112 deletions docs/source/how_to_guides/bp_scalable_repetitions_of_tasks.md

This file was deleted.

103 changes: 103 additions & 0 deletions docs/source/how_to_guides/bp_scaling_tasks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Scaling tasks

In any bigger project you quickly come to the point where you stack multiple repetitions
of tasks on top of each other.

For example, you have one dataset, four different ways to prepare it, and three
statistical models to analyze the data. The cartesian product of all steps combined
comprises twelve differently fitted models.

Here you find some tips on how to set up your tasks such that you can easily modify the
cartesian product of steps.

## Scalability

Let us dive right into the aforementioned example. We start with one dataset `data.csv`.
Then, we will create four different specifications of the data and, finally, fit three
different models to each specification.

This is the structure of the project.

```
my_project
├───pyproject.toml
├───src
│ └───my_project
│ ├────config.py
│ │
│ ├───data
│ │ └────data.csv
│ │
│ ├───data_preparation
│ │ ├────__init__.py
│ │ ├────config.py
│ │ └────task_prepare_data.py
│ │
│ └───estimation
│ ├────__init__.py
│ ├────config.py
│ └────task_estimate_models.py
├───setup.py
├───.pytask.sqlite3
└───bld
```

The folder structure, the main `config.py` which holds `SRC` and `BLD`, and the tasks
follow the same structure advocated throughout the tutorials.

New are the local configuration files in each subfolder of `my_project`, which contain
objects shared across tasks. For example, `config.py` holds the paths to the processed
data and the names of the data sets.

```{literalinclude} ../../../docs_src/how_to_guides/bp_scaling_tasks_1.py
```

The task file `task_prepare_data.py` uses these objects to build the repetitions.

```{literalinclude} ../../../docs_src/how_to_guides/bp_scaling_tasks_2.py
```

All arguments for the loop and the {func}`@task <pytask.task>` decorator are built
within a function to keep the logic in one place and the module's namespace clean.

Ids are used to make the task {ref}`ids <ids>` more descriptive and to simplify their
selection with {ref}`expressions <expressions>`. Here is an example of the task ids with
an explicit id.

```
# With id
.../my_project/data_preparation/task_prepare_data.py::task_prepare_data[data_0]
```

Next, we move to the estimation to see how we can build another repetition on top.

```{literalinclude} ../../../docs_src/how_to_guides/bp_scaling_tasks_3.py
```

In the local configuration, we define `ESTIMATIONS` which combines the information on
data and model. The dictionary's key can be used as a task id whenever the estimation is
involved. It allows triggering all tasks related to one estimation - estimation,
figures, tables - with one command.

```console
pytask -k linear_probability_data_0
```

And here is the task file.

```{literalinclude} ../../../docs_src/how_to_guides/bp_scaling_tasks_4.py
```

Replicating this pattern across a project allows a clean way to define repetitions.

## Extending repetitions

Some parametrized tasks are costly to run - costly in terms of computing power, memory,
or time. Users often extend repetitions triggering all repetitions to be rerun. Thus,
use the {func}`@pytask.mark.persist <pytask.mark.persist>` decorator, which is explained
in more detail in this {doc}`tutorial <../tutorials/making_tasks_persist>`.
33 changes: 22 additions & 11 deletions docs/source/how_to_guides/bp_structure_of_task_files.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,20 @@
# Structure of task files

This section provides advice on how to structure task files.
This guide presents some best-practices for structuring your task files. You do not have
to follow them to use pytask or to create a reproducible research project. But, if you
are looking for orientation or inspiration, here are some tips.

## TL;DR

- There might be multiple task functions in a task module, but only if the code is still
readable and not too complex and if runtime for all tasks is low.
- Use task modules to separate task functions from another. Separating tasks by the
stages in research project like data management, analysis, plotting is a good start.
Separate further when task modules become crowded.

- A task function should be the first function in a task module.
- Task functions should be at the top of a task module to easily identify what the
module is for.

:::{seealso}
The only exception might be for {doc}`repetitions <bp_scalable_repetitions_of_tasks>`.
The only exception might be for {doc}`repetitions <bp_scaling_tasks>`.
:::

- The purpose of the task function is to handle IO operations like loading and saving
Expand All @@ -20,25 +24,32 @@ This section provides advice on how to structure task files.
- Non-task functions in the task module are {term}`private functions <private function>`
and only used within this task module. The functions should not have side-effects.

- Functions used to accomplish tasks in multiple task modules should have their own
module.
- It should never be necessary to import from task modules. So if you need a function in
multiple task modules, put it in a separate module (which does not start with
`task_`).

## Best Practices

### Number of tasks in a module

There are two reasons to split tasks across several modules.

The first reason concerns readability and complexity. Multiple tasks deal with
(slightly) different concepts and, thus, should be split content-wise. Even if tasks
deal with the same concept, they might be very complex on its own and separate modules
help the reader (most likely you or your colleagues) to focus on one thing.
The first reason concerns readability and complexity. Tasks deal with different concepts
and, thus, should be split. Even if tasks deal with the same concept, they might becna
very complex and separate modules help the reader (most likely you or your colleagues)
to focus on one thing.

The second reason is about runtime. If a task module is changed, all tasks within the
module are re-run. If the runtime of all tasks in the module is high, you wait longer
for your tasks to finish or until an error occurs which prolongs your feedback loops and
hurts your productivity.

:::{seealso}
Use {func}`@pytask.mark.persist <pytask.mark.persist>` if you want to avoid accidentally
triggering an expensive task. It is also explained in [this
tutorial](../tutorials/making_tasks_persist).
:::

### Structure of the module

For the following example, let us assume that the task module contains one task.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/how_to_guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,5 +38,5 @@ maxdepth: 1
bp_structure_of_a_research_project
bp_structure_of_task_files
bp_templates_and_projects
bp_scalable_repetitions_of_tasks
bp_scaling_tasks
```
6 changes: 3 additions & 3 deletions docs/source/how_to_guides/writing_custom_nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ your own to improve your workflows.
## Use-case

A typical task operation is to load data like a {class}`pandas.DataFrame` from a pickle
file, transform it, and store it on disk. The usual way would be to use paths to point to
inputs and outputs and call {func}`pandas.read_pickle` and
file, transform it, and store it on disk. The usual way would be to use paths to point
to inputs and outputs and call {func}`pandas.read_pickle` and
{meth}`pandas.DataFrame.to_pickle`.

```{literalinclude} ../../../docs_src/how_to_guides/writing_custom_nodes_example_1.py
Expand Down Expand Up @@ -54,7 +54,7 @@ A custom node needs to follow an interface so that pytask can perform several ac
- Load and save values when tasks are executed.

This interface is defined by protocols [^structural-subtyping]. A custom node must
follow at least the protocol {class}`pytask.Node` or, even better,
follow at least the protocol {class}`pytask.PNode` or, even better,
{class}`pytask.PPathNode` if it is based on a path. The common node for paths,
{class}`pytask.PathNode`, follows the protocol {class}`pytask.PPathNode`.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ We reuse the task from the previous {doc}`tutorial <write_a_task>`, which genera
random data and repeat the same operation over several seeds to receive multiple,
reproducible samples.

Apply the {func}`@task <pytask.task>` decorator, loop over the function
and supply different seeds and output paths as default arguments of the function.
Apply the {func}`@task <pytask.task>` decorator, loop over the function and supply
different seeds and output paths as default arguments of the function.

::::{tab-set}

Expand Down Expand Up @@ -355,7 +355,7 @@ for id_, kwargs in ID_TO_KWARGS.items():
```

The
{doc}`best-practices guide on parametrizations <../how_to_guides/bp_scalable_repetitions_of_tasks>`
{doc}`best-practices guide on parametrizations <../how_to_guides/bp_scaling_tasks>`
goes into even more detail on how to scale parametrizations.

## A warning on globals
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,16 @@
from my_project.config import SRC


DATA = ["data_0", "data_1", "data_2", "data_3"]
DATA = {
"data_0": {"subset": "subset_1"},
"data_1": {"subset": "subset_2"},
"data_2": {"subset": "subset_3"},
"data_3": {"subset": "subset_4"},
}


def path_to_input_data(name: str) -> Path:
return SRC / "data" / f"{name}.csv"
return SRC / "data" / "data.csv"


def path_to_processed_data(name: str) -> Path:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,19 @@
from my_project.data_preparation.config import DATA
from my_project.data_preparation.config import path_to_input_data
from my_project.data_preparation.config import path_to_processed_data
from pandas import pd
from pytask import Product
from pytask import task
from typing_extensions import Annotated


def _create_parametrization(data: list[str]) -> dict[str, Path]:
id_to_kwargs = {}
for data_name in data:
for data_name, kwargs in data.items():
id_to_kwargs[data_name] = {
"path_to_input_data": path_to_input_data(data_name),
"path_to_processed_data": path_to_processed_data(data_name),
**kwargs,
}

return id_to_kwargs
Expand All @@ -27,6 +29,11 @@ def _create_parametrization(data: list[str]) -> dict[str, Path]:

@task(id=id_, kwargs=kwargs)
def task_prepare_data(
path_to_input_data: Path, path_to_processed_data: Annotated[Path, Product]
path_to_input_data: Path,
subset: str,
path_to_processed_data: Annotated[Path, Product],
) -> None:
df = pd.read_csv(path_to_input_data)
...
subset = df.loc[df["subset"].eq(subset)]
subset.to_pickle(path_to_processed_data)
Loading

0 comments on commit b191559

Please sign in to comment.