# Caching

Caching is an important feature of any machine learning pipeline. Through caching we make sure that the artifacts that a step produces can be reused in a consecutive pipeline run, thus saving time and energy. However it is important to understand how this caching works, how to disable it and when it is invalidated.

Lets start and get all the imports out of the way:

In [1]:
%%capture
!pip install zenml rich

In [2]:
import random
import time

from rich.console import Console

from zenml.pipelines import pipeline
from zenml.steps import BaseStepConfig, Output, step

console = Console()

To learn more about caching we will create a very basic pipeline that takes one random number, subtracts a user-configured number from it and prints the result. Simple as can be.

## The first step will simply create a random float and return it

In [3]:
@step
def create_data() -> Output(random_data=float):
    """Create an array of random data 0 <= elements <= 1. 
    
    Returns:
        Random float between 0 and 1
    """
    random_data = random.random()
    return random_data

## The second step subtracts a user defined float from the first number

See the tutorial on [StepConfigurations](tbd) to learn more on how to use step configurations.

In [4]:
class TransformerConfig(BaseStepConfig):
    """Transformer params

    Params:
        subtrahend - Amount to be subtracted from the input_data array
    """

    subtrahend: float = 0.5
        
@step  
def transform_data(
    config: TransformerConfig,
    random_data: float
) -> Output(transformed_data=float):
    """Subtract subtrahend from random_data - we added a sleep in this function to simulate a
    complex data transformation step
    
    Args:
        config - TransformerConfig that specifies the subtrahend for the transformer
        random_data - random data to be transformed
        
    Returns:
        The random_data minus the subtrahend specified in the config
    """
    time.sleep(10)

    transformed_data = random_data - config.subtrahend
    return transformed_data

## Finally the third step prints the result

In [5]:
@step
def print_data(
    transformed_data: float
) -> None:
    """Print resulting data array"""
    console.print(f"The pipeline produced the following number: {transformed_data}", style="bold red")

## Lets connect our steps with a ZenML pipeline and instatiate 

In [6]:
# Define how data flows through steps
@pipeline
def transformer_pipeline(
    create_data,
    transform_data,
    print_data
):
    # Define how data flows through the steps of the pipeline
    rand_arr = create_data()
    transformed_arr = transform_data(rand_arr)
    print_data(transformed_arr)

In [7]:
config_subtrahend = random.random()

console.print('[u]First Pipeline Run[/u] \n \n')
# Define which step functions implement the pipeline steps and create
#  a pipeline instance
pipeline_instance = transformer_pipeline(
    create_data=create_data(),
    transform_data=transform_data(
        TransformerConfig(subtrahend=config_subtrahend)),
    print_data=print_data()
)
pipeline_instance.run()

console.print('\n \n [u]Second Pipeline Run[/u] \n \n')
# Define which step functions implement the pipeline steps and create
#  a pipeline instance
pipeline_instance_2 = transformer_pipeline(
    create_data=create_data(),
    transform_data=transform_data(
        TransformerConfig(subtrahend=config_subtrahend)),
    print_data=print_data()
)
pipeline_instance_2.run()

[1;35mCreating run for pipeline: `[0m[33;21mtransformer_pipeline`[1;35m[0m
[1;35mCache enabled for pipeline `[0m[33;21mtransformer_pipeline`[1;35m[0m
[1;35mUsing stack `[0m[33;21msecrets_stack2`[1;35m to run pipeline `[0m[33;21mtransformer_pipeline`[1;35m...[0m
[1;35mStep `[0m[33;21mcreate_data`[1;35m has started.[0m
[1;35mUsing cached version of `[0m[33;21mcreate_data`[1;35m [`[0m[33;21mcreate_data`[1;35m] from pipeline_run_id `[0m[33;21mtransformer_pipeline-01_Apr_22-12_03_25_054987`[1;35m.[0m
[1;35mStep `[0m[33;21mcreate_data`[1;35m has finished in 0.070s.[0m
[1;35mStep `[0m[33;21mtransform_data`[1;35m has started.[0m
[1;35mStep `[0m[33;21mtransform_data`[1;35m has finished in 10.157s.[0m
[1;35mStep `[0m[33;21mprint_data`[1;35m has started.[0m


[1;35mStep `[0m[33;21mprint_data`[1;35m has finished in 0.158s.[0m
[1;35mPipeline run `[0m[33;21mtransformer_pipeline-01_Apr_22-12_03_25_054987`[1;35m has finished in 10.401s.[0m


[1;35mCreating run for pipeline: `[0m[33;21mtransformer_pipeline`[1;35m[0m
[1;35mCache enabled for pipeline `[0m[33;21mtransformer_pipeline`[1;35m[0m
[1;35mUsing stack `[0m[33;21msecrets_stack2`[1;35m to run pipeline `[0m[33;21mtransformer_pipeline`[1;35m...[0m
[1;35mStep `[0m[33;21mcreate_data`[1;35m has started.[0m
[1;35mUsing cached version of `[0m[33;21mcreate_data`[1;35m [`[0m[33;21mcreate_data`[1;35m] from pipeline_run_id `[0m[33;21mtransformer_pipeline-01_Apr_22-12_03_36_029006`[1;35m.[0m
[1;35mStep `[0m[33;21mcreate_data`[1;35m has finished in 0.079s.[0m
[1;35mStep `[0m[33;21mtransform_data`[1;35m has started.[0m
[1;35mUsing cached version of `[0m[33;21mtransform_data`[1;35m [`[0m[33;21mtransform_data`[1;35m] from pipeline_run_id `[0m[33;21mtransformer_pipeline-01_Apr_22-12_03_36_029006`[1;35m.[0m
[1;35mStep `[0m[33;21mtransform_data`[1;35m has finished in 0.094s.[0m
[1;35mStep `[0m[33;21mprint_data`[1;35m has star

If you ran this notebook for the first time, the first pipeline run should have taken significantly longer than the second run. In the second run none of the steps were run as the cached values could be taken for every step.

!!! Note: The last step also didn't run so the resulting value was not printed - this is not what we want, we'll have to disable cache for the last step for the pipeline to behave as expected. Find out how to do this and many more things below.

## Try it for yourself - here's five ways to disable/invalidate cache

Caching is an amazing tool, but its important to know when its happening and to have fine grained control over it.

Here are some ways in which you can disable caching - feel free to apply them above and rerun the pipeline_instances to familiarize yourself and to see the effects for yourself.


### 1. Disable caching on a step level using the **step decorator**:

```python
@step(enable_cache=False)
```

### 2. Disable caching for the whole pipeline through the **pipeline decorator**:

```python
@pipeline(enable_cache=False)
```

### 3. Invalidate cache by **changing the code** within a step

Make a change in any piece of code within a step and the cache for that step will be invalidated.

For example replace the create_data code with this one:

```python
@step
def create_data() -> Output(random_data=float):
    """Create an array of random data 0 <= elements <= 1. 

    Returns:
        Random float between 0 and 1
    """
    different_var_name = random.random()
    return different_var_name
```
    

### 4. Invalidate cache by changing a parameter within the **step configuration**

For example you could change the subtrahend in the TransformerConfig of the second pipeline instance

```python
pipeline_instance_2 = transformer_pipeline(
    create_data = create_data(),
    transform_data = transform_data(TransformerConfig(subtrahend=0.8)),
    print_data = print_data()
)
```

### 5. Disable cache explicitly in the **runtime configuration**

```python
pipeline_instance_2.run(enable_cache=False)
```