Update more parts of the documentation. (#441)

pytask-dev · Oct 7, 2023 · b191559 · b191559
1 parent fbda956
commit b191559
Show file tree

Hide file tree

Showing 18 changed files with 178 additions and 169 deletions.
diff --git a/docs/source/changes.md b/docs/source/changes.md
@@ -5,7 +5,7 @@ chronological order. Releases follow [semantic versioning](https://semver.org/)
 releases are available on [PyPI](https://pypi.org/project/pytask) and
 [Anaconda.org](https://anaconda.org/conda-forge/pytask).
 
-## 0.4.0 - 2023-xx-xx
+## 0.4.0 - 2023-10-07
 
 - {pull}`323` remove Python 3.7 support and use a new Github action to provide mamba.
 - {pull}`384` allows to parse dependencies from every function argument if `depends_on`
@@ -56,6 +56,7 @@ releases are available on [PyPI](https://pypi.org/project/pytask) and
   {func}`pytask.is_task_function`.
 - {pull}`438` clarifies some types.
 - {pull}`440` refines more types.
+- {pull}`441` updates more parts of the documentation.
 - {pull}`442` allows users to import `from pytask import mark` and use `@mark.skip`.
 
 ## 0.3.2 - 2023-06-07

diff --git a/docs/source/how_to_guides/bp_scalable_repetitions_of_tasks.md b/docs/source/how_to_guides/bp_scalable_repetitions_of_tasks.md
diff --git a/docs/source/how_to_guides/bp_scaling_tasks.md b/docs/source/how_to_guides/bp_scaling_tasks.md
@@ -0,0 +1,103 @@
+# Scaling tasks
+
+In any bigger project you quickly come to the point where you stack multiple repetitions
+of tasks on top of each other.
+
+For example, you have one dataset, four different ways to prepare it, and three
+statistical models to analyze the data. The cartesian product of all steps combined
+comprises twelve differently fitted models.
+
+Here you find some tips on how to set up your tasks such that you can easily modify the
+cartesian product of steps.
+
+## Scalability
+
+Let us dive right into the aforementioned example. We start with one dataset `data.csv`.
+Then, we will create four different specifications of the data and, finally, fit three
+different models to each specification.
+
+This is the structure of the project.
+
+```
+my_project
+├───pyproject.toml
+│
+├───src
+│   └───my_project
+│       ├────config.py
+│       │
+│       ├───data
+│       │   └────data.csv
+│       │
+│       ├───data_preparation
+│       │   ├────__init__.py
+│       │   ├────config.py
+│       │   └────task_prepare_data.py
+│       │
+│       └───estimation
+│           ├────__init__.py
+│           ├────config.py
+│           └────task_estimate_models.py
+│
+│
+├───setup.py
+│
+├───.pytask.sqlite3
+│
+└───bld
+```
+
+The folder structure, the main `config.py` which holds `SRC` and `BLD`, and the tasks
+follow the same structure advocated throughout the tutorials.
+
+New are the local configuration files in each subfolder of `my_project`, which contain
+objects shared across tasks. For example, `config.py` holds the paths to the processed
+data and the names of the data sets.
+
+```{literalinclude} ../../../docs_src/how_to_guides/bp_scaling_tasks_1.py
+```
+
+The task file `task_prepare_data.py` uses these objects to build the repetitions.
+
+```{literalinclude} ../../../docs_src/how_to_guides/bp_scaling_tasks_2.py
+```
+
+All arguments for the loop and the {func}`@task <pytask.task>` decorator are built
+within a function to keep the logic in one place and the module's namespace clean.
+
+Ids are used to make the task {ref}`ids <ids>` more descriptive and to simplify their
+selection with {ref}`expressions <expressions>`. Here is an example of the task ids with
+an explicit id.
+
+```
+# With id
+.../my_project/data_preparation/task_prepare_data.py::task_prepare_data[data_0]
+```
+
+Next, we move to the estimation to see how we can build another repetition on top.
+
+```{literalinclude} ../../../docs_src/how_to_guides/bp_scaling_tasks_3.py
+```
+
+In the local configuration, we define `ESTIMATIONS` which combines the information on
+data and model. The dictionary's key can be used as a task id whenever the estimation is
+involved. It allows triggering all tasks related to one estimation - estimation,
+figures, tables - with one command.
+
+```console
+pytask -k linear_probability_data_0
+```
+
+And here is the task file.
+
+```{literalinclude} ../../../docs_src/how_to_guides/bp_scaling_tasks_4.py
+```
+
+Replicating this pattern across a project allows a clean way to define repetitions.
+
+## Extending repetitions
+
+Some parametrized tasks are costly to run - costly in terms of computing power, memory,
+or time. Users often extend repetitions triggering all repetitions to be rerun. Thus,
+use the {func}`@pytask.mark.persist <pytask.mark.persist>` decorator, which is explained
+in more detail in this {doc}`tutorial <../tutorials/making_tasks_persist>`.
diff --git a/docs/source/how_to_guides/bp_structure_of_task_files.md b/docs/source/how_to_guides/bp_structure_of_task_files.md
@@ -1,16 +1,20 @@
 # Structure of task files
 
-This section provides advice on how to structure task files.
+This guide presents some best-practices for structuring your task files. You do not have
+to follow them to use pytask or to create a reproducible research project. But, if you
+are looking for orientation or inspiration, here are some tips.
 
 ## TL;DR
 
-- There might be multiple task functions in a task module, but only if the code is still
-  readable and not too complex and if runtime for all tasks is low.
+- Use task modules to separate task functions from another. Separating tasks by the
+  stages in research project like data management, analysis, plotting is a good start.
+  Separate further when task modules become crowded.
 
-- A task function should be the first function in a task module.
+- Task functions should be at the top of a task module to easily identify what the
+  module is for.
 
   :::{seealso}
-  The only exception might be for {doc}`repetitions <bp_scalable_repetitions_of_tasks>`.
+  The only exception might be for {doc}`repetitions <bp_scaling_tasks>`.
   :::
 
 - The purpose of the task function is to handle IO operations like loading and saving
@@ -20,25 +24,32 @@ This section provides advice on how to structure task files.
 - Non-task functions in the task module are {term}`private functions <private function>`
   and only used within this task module. The functions should not have side-effects.
 
-- Functions used to accomplish tasks in multiple task modules should have their own
-  module.
+- It should never be necessary to import from task modules. So if you need a function in
+  multiple task modules, put it in a separate module (which does not start with
+  `task_`).
 
 ## Best Practices
 
 ### Number of tasks in a module
 
 There are two reasons to split tasks across several modules.
 
-The first reason concerns readability and complexity. Multiple tasks deal with
-(slightly) different concepts and, thus, should be split content-wise. Even if tasks
-deal with the same concept, they might be very complex on its own and separate modules
-help the reader (most likely you or your colleagues) to focus on one thing.
+The first reason concerns readability and complexity. Tasks deal with different concepts
+and, thus, should be split. Even if tasks deal with the same concept, they might becna
+very complex and separate modules help the reader (most likely you or your colleagues)
+to focus on one thing.
 
 The second reason is about runtime. If a task module is changed, all tasks within the
 module are re-run. If the runtime of all tasks in the module is high, you wait longer
 for your tasks to finish or until an error occurs which prolongs your feedback loops and
 hurts your productivity.
 
+:::{seealso}
+Use {func}`@pytask.mark.persist <pytask.mark.persist>` if you want to avoid accidentally
+triggering an expensive task. It is also explained in [this
+tutorial](../tutorials/making_tasks_persist).
+:::
+
 ### Structure of the module
 
 For the following example, let us assume that the task module contains one task.

diff --git a/docs/source/how_to_guides/index.md b/docs/source/how_to_guides/index.md
@@ -38,5 +38,5 @@ maxdepth: 1
 bp_structure_of_a_research_project
 bp_structure_of_task_files
 bp_templates_and_projects
-bp_scalable_repetitions_of_tasks
+bp_scaling_tasks
 ```
diff --git a/docs/source/how_to_guides/writing_custom_nodes.md b/docs/source/how_to_guides/writing_custom_nodes.md
@@ -10,8 +10,8 @@ your own to improve your workflows.
 ## Use-case
 
 A typical task operation is to load data like a {class}`pandas.DataFrame` from a pickle
-file, transform it, and store it on disk. The usual way would be to use paths to point to
-inputs and outputs and call {func}`pandas.read_pickle` and
+file, transform it, and store it on disk. The usual way would be to use paths to point
+to inputs and outputs and call {func}`pandas.read_pickle` and
 {meth}`pandas.DataFrame.to_pickle`.
 
 ```{literalinclude} ../../../docs_src/how_to_guides/writing_custom_nodes_example_1.py
@@ -54,7 +54,7 @@ A custom node needs to follow an interface so that pytask can perform several ac
 - Load and save values when tasks are executed.
 
 This interface is defined by protocols [^structural-subtyping]. A custom node must
-follow at least the protocol {class}`pytask.Node` or, even better,
+follow at least the protocol {class}`pytask.PNode` or, even better,
 {class}`pytask.PPathNode` if it is based on a path. The common node for paths,
 {class}`pytask.PathNode`, follows the protocol {class}`pytask.PPathNode`.
 

diff --git a/docs/source/tutorials/repeating_tasks_with_different_inputs.md b/docs/source/tutorials/repeating_tasks_with_different_inputs.md
@@ -8,8 +8,8 @@ We reuse the task from the previous {doc}`tutorial <write_a_task>`, which genera
 random data and repeat the same operation over several seeds to receive multiple,
 reproducible samples.
 
-Apply the {func}`@task <pytask.task>` decorator, loop over the function
-and supply different seeds and output paths as default arguments of the function.
+Apply the {func}`@task <pytask.task>` decorator, loop over the function and supply
+different seeds and output paths as default arguments of the function.
 
 ::::{tab-set}
 
@@ -355,7 +355,7 @@ for id_, kwargs in ID_TO_KWARGS.items():
 ```
 
 The
-{doc}`best-practices guide on parametrizations <../how_to_guides/bp_scalable_repetitions_of_tasks>`
+{doc}`best-practices guide on parametrizations <../how_to_guides/bp_scaling_tasks>`
 goes into even more detail on how to scale parametrizations.
 
 ## A warning on globals

diff --git a/...des/bp_scalable_repetitions_of_tasks_1.py → docs_src/how_to_guides/bp_scaling_tasks_1.py b/...des/bp_scalable_repetitions_of_tasks_1.py → docs_src/how_to_guides/bp_scaling_tasks_1.py
@@ -5,11 +5,16 @@
 from my_project.config import SRC
 
 
-DATA = ["data_0", "data_1", "data_2", "data_3"]
+DATA = {
+    "data_0": {"subset": "subset_1"},
+    "data_1": {"subset": "subset_2"},
+    "data_2": {"subset": "subset_3"},
+    "data_3": {"subset": "subset_4"},
+}
 
 
 def path_to_input_data(name: str) -> Path:
-    return SRC / "data" / f"{name}.csv"
+    return SRC / "data" / "data.csv"
 
 
 def path_to_processed_data(name: str) -> Path:

diff --git a/...des/bp_scalable_repetitions_of_tasks_2.py → docs_src/how_to_guides/bp_scaling_tasks_2.py b/...des/bp_scalable_repetitions_of_tasks_2.py → docs_src/how_to_guides/bp_scaling_tasks_2.py
@@ -4,17 +4,19 @@
 from my_project.data_preparation.config import DATA
 from my_project.data_preparation.config import path_to_input_data
 from my_project.data_preparation.config import path_to_processed_data
+from pandas import pd
 from pytask import Product
 from pytask import task
 from typing_extensions import Annotated
 
 
 def _create_parametrization(data: list[str]) -> dict[str, Path]:
     id_to_kwargs = {}
-    for data_name in data:
+    for data_name, kwargs in data.items():
         id_to_kwargs[data_name] = {
             "path_to_input_data": path_to_input_data(data_name),
             "path_to_processed_data": path_to_processed_data(data_name),
+            **kwargs,
         }
 
     return id_to_kwargs
@@ -27,6 +29,11 @@ def _create_parametrization(data: list[str]) -> dict[str, Path]:
 
     @task(id=id_, kwargs=kwargs)
     def task_prepare_data(
-        path_to_input_data: Path, path_to_processed_data: Annotated[Path, Product]
+        path_to_input_data: Path,
+        subset: str,
+        path_to_processed_data: Annotated[Path, Product],
     ) -> None:
+        df = pd.read_csv(path_to_input_data)
         ...
+        subset = df.loc[df["subset"].eq(subset)]
+        subset.to_pickle(path_to_processed_data)
diff --git a/...des/bp_scalable_repetitions_of_tasks_3.py → docs_src/how_to_guides/bp_scaling_tasks_3.py b/...des/bp_scalable_repetitions_of_tasks_3.py → docs_src/how_to_guides/bp_scaling_tasks_3.py
diff --git a/...des/bp_scalable_repetitions_of_tasks_4.py → docs_src/how_to_guides/bp_scaling_tasks_4.py b/...des/bp_scalable_repetitions_of_tasks_4.py → docs_src/how_to_guides/bp_scaling_tasks_4.py