<div align="center">
  <img src="http://vlpavlov.org/Pythagoras-Logo3.svg"><br>
</div>

# Speed Up Your Data Science Project Using Persistent Caching Tools from Pythagoras Package

## Introductory Tutorial

Work with large dataframes could be slow. This notebook demonstrates how Pythagoras 
can help you win extra seconds and minutes (sometimes - hours) every time you need 
to load a large csv file or to execute a complex data-processing function.

In [1]:
import numpy as np
import pandas as pd
import sys
import time
import logging

np.random.seed(42)

## How To Cache Function Outputs

### A slow function

Let's assume we have an important function that 
takes some unpleasantly long time to run:

In [2]:
# Let's create a sample DataFrame to experiment with

a_dataframe = pd.DataFrame(
    data = {
        'COL_1': [1.1, 2.2, 3.3]
        ,'COL_2': [4.4, 5.5, None]
        ,'COL_3': [7.7, 8.8, 9.9]
        ,'COL_4': [None, 11.11, 12.12]
    })   

In [3]:
a_dataframe

Unnamed: 0,COL_1,COL_2,COL_3,COL_4
0,1.1,4.4,7.7,
1,2.2,5.5,8.8,11.11
2,3.3,,9.9,12.12


In [4]:
# Now, let's create a slow function.
# In real life, such slow function could be a part of 
# feature-engineering pipeline

def slowly_process_dataframe(df:pd.core.frame.DataFrame, a:float):
    result = df + a
    time.sleep(3)
    return result

In [5]:
%%time
demo_result = slowly_process_dataframe(a_dataframe,3.14)

CPU times: user 2.06 ms, sys: 1.17 ms, total: 3.23 ms
Wall time: 3 s


In [6]:
demo_result

Unnamed: 0,COL_1,COL_2,COL_3,COL_4
0,4.24,7.54,10.84,
1,5.34,8.64,11.94,14.25
2,6.44,,13.04,15.26


### A slow function + Pythagoras

It took 3+ seconds to execute **slowly_process_dataframe()** in the cell above.
Let's see how Pythagoras helps speed this up:

In [7]:
import Pythagoras # this is the library which will provide us with the
                  # advanced caching tools

demo_cache_obj = Pythagoras.PickleCache(
    cache_dir = "./cache_files"   # Here Pythagoras will store cashed data, 
                                  # if/when it needs to.
    ,input_dir = "."  # From here Pythagoras will read .csv files, 
                      # if/when asked.
    )




In [8]:
print(demo_cache_obj) # print the status of the cache 

PickleCache in directory <./cache_files> contains 0 files, with total size 0 B. There are 742 Gb of free space available in the directory. Cache files are expected to have <.pkl> extension. Input files should be located in <.> folder, which contains 270 files, with total size 850 Mb. Cache READER is ACTIVE: cached versions of objects are loaded from disk if they are available there. Cache WRITER is ACTIVE: new objects get saved to disk as they are created. Names of cache files can not be longer than 250 characters. Parent logger name is 'Pythagoras'; object REVEALS self-identity while logging. 


In [9]:
# This time we add a decorator while creating our function 
# everything else is exactly the same as above
@demo_cache_obj
def slowly_process_dataframe(df:pd.core.frame.DataFrame, a:float):
    result = df + a
    time.sleep(3)
    return result

In [10]:
%%time
demo_result = slowly_process_dataframe(a_dataframe,3.14)

04:52:52 Pythagoras.PickleCache.demo_cache_obj INFO: Starting generating data using slowly_process_dataframe() ...
04:52:55 Pythagoras.PickleCache.demo_cache_obj INFO: ...finished generating data using slowly_process_dataframe(). The process took 3.01 seconds. 

04:52:55 Pythagoras.PickleCache.demo_cache_obj INFO: Created 920 B file ./cache_files/Func__slowly_process_dataframe__bdca2/bdca2__DataFrame(3x4nans2)__Real3.14__b155d1097bc76a.pkl.



CPU times: user 24.5 ms, sys: 5.97 ms, total: 30.4 ms
Wall time: 3.04 s


In [11]:
demo_result

Unnamed: 0,COL_1,COL_2,COL_3,COL_4
0,4.24,7.54,10.84,
1,5.34,8.64,11.94,14.25
2,6.44,,13.04,15.26


The first call above took 3+ seconds to execute. Let's call the fucntion egain with exactly the same parameters: 

In [12]:
%%time
demo_result = slowly_process_dataframe(a_dataframe,3.14)

04:52:55 Pythagoras.PickleCache.demo_cache_obj INFO: Finished reading 920 B file ./cache_files/Func__slowly_process_dataframe__bdca2/bdca2__DataFrame(3x4nans2)__Real3.14__b155d1097bc76a.pkl. The process took 0.00392 seconds now, while in the past it costed 3.01 seconds to generate the same data using function slowly_process_dataframe().



CPU times: user 10 ms, sys: 2.34 ms, total: 12.4 ms
Wall time: 11.6 ms


In [13]:
demo_result

Unnamed: 0,COL_1,COL_2,COL_3,COL_4
0,4.24,7.54,10.84,
1,5.34,8.64,11.94,14.25
2,6.44,,13.04,15.26


The second and all subsequent calls to **slowly_process_dataframe()** 
are now much faster: we went down from 3 seconds to just 11 milliseconds.

### Calling the same function with different parameters

In [14]:
a_dataframe

Unnamed: 0,COL_1,COL_2,COL_3,COL_4
0,1.1,4.4,7.7,
1,2.2,5.5,8.8,11.11
2,3.3,,9.9,12.12


In [15]:
a_dataframe.iat[1,1] = -100

In [16]:
a_dataframe

Unnamed: 0,COL_1,COL_2,COL_3,COL_4
0,1.1,4.4,7.7,
1,2.2,-100.0,8.8,11.11
2,3.3,,9.9,12.12


In [17]:
%%time
demo_result = slowly_process_dataframe(a_dataframe,3.14)

04:52:55 Pythagoras.PickleCache.demo_cache_obj INFO: Starting generating data using slowly_process_dataframe() ...
04:52:58 Pythagoras.PickleCache.demo_cache_obj INFO: ...finished generating data using slowly_process_dataframe(). The process took 3.00 seconds. 

04:52:58 Pythagoras.PickleCache.demo_cache_obj INFO: Created 920 B file ./cache_files/Func__slowly_process_dataframe__bdca2/bdca2__DataFrame(3x4nans2)__Real3.14__4ca1211fe40051.pkl.



CPU times: user 25.1 ms, sys: 5.91 ms, total: 31 ms
Wall time: 3.03 s


In [18]:
demo_result

Unnamed: 0,COL_1,COL_2,COL_3,COL_4
0,4.24,7.54,10.84,
1,5.34,-96.86,11.94,14.25
2,6.44,,13.04,15.26


In [19]:
%%time
demo_result = slowly_process_dataframe(a_dataframe,3.14)

04:52:58 Pythagoras.PickleCache.demo_cache_obj INFO: Finished reading 920 B file ./cache_files/Func__slowly_process_dataframe__bdca2/bdca2__DataFrame(3x4nans2)__Real3.14__4ca1211fe40051.pkl. The process took 0.00401 seconds now, while in the past it costed 3.00 seconds to generate the same data using function slowly_process_dataframe().



CPU times: user 9.73 ms, sys: 2.44 ms, total: 12.2 ms
Wall time: 11.2 ms


In [20]:
demo_result

Unnamed: 0,COL_1,COL_2,COL_3,COL_4
0,4.24,7.54,10.84,
1,5.34,-96.86,11.94,14.25
2,6.44,,13.04,15.26


The first time we called the function with sligtly different parameters (one value in the input dataframe was changed), it again took 3 seconds to execute. However, the second attempt to call the function with the same new parameters was way faster.

In [21]:
print(demo_cache_obj)  # print the status of the cache 

PickleCache in directory <./cache_files> contains 2 files, with total size 2 Kb. There are 742 Gb of free space available in the directory. Cache files are expected to have <.pkl> extension. Input files should be located in <.> folder, which contains 272 files, with total size 850 Mb. Cache READER is ACTIVE: cached versions of objects are loaded from disk if they are available there. Cache WRITER is ACTIVE: new objects get saved to disk as they are created. Names of cache files can not be longer than 250 characters. Parent logger name is 'Pythagoras'; object REVEALS self-identity while logging. 


### How does it work?
The first time we ran **slowly_process_dataframe()**, Pythagoras cached the output of the function.
The next time the function was called, Pythagoras re-used that output, without actually executing the original fucntion. The output is stored as a pickle file, so it will save you time even when you run your notebook again next month.

If we pass some other arguments to the functions, the same process will repeat.
For each combination of different values, passed as function arguments, Pythagoras will create a new cache file.
If you have enough disk space, it will save you a lot of time.

> Pythagoras works not only with simple types of function arguments 
> (such as int or str),  but also with many others,
> including DataFrames, dicts, lists, sets, 
> and all their possible combinations.
> **Most of other persistent caching libraries can not do it today, 
>  they only work with limited group of basic argument types.**

> If some existing or new (created by you) datatype
> is not supported by Pythagoras out-of-the-box,
> it provides a simple extensibility mechanism that allows you to 
> add support for any new type with just a few lines of code.
> **Other persistent caching libraries can not easily recognize new types today.**

> How can Pythagoras offer such a powerful flexibility 
> while almost all the other caching libraries only support 
> a limited number of datatypes for function arguments?

> Most of the existing caching libraries in Python use
> easy-to-implement algorithms that only allow to work 
> with immutable values. Pythagoras uses different approach 
> that can work with both mutable and immutable values.
> This allows Pythagoras to cache functions whose parameters 
> can be of virtually any existing type/class.

So, the next time you need to do complex feature engineering
by transforming a large dataframe into another, even larger dataframe,
put your feature engineering code into a function and decorate it with PickleCache object.

### Important note

This approach only works with functions that create their output 
using exclusively the input argumets, without accessing any outside data.
If a function reads from global variables, or files, or Internet, 
or uses current time, etc.,
such function is not compatible with PickleCache

### What is under the hood?


For every function, which we modified with PickleCach decorator, 
Pythagoras creates a sub-folder within the main cache_dir folder.
The name of sub-folder consists of a keyword "Func" plus the function's name 
plus its digital fingerprint. 

For example, for our slow function above, the subfolder was named 
**Func\_\_slowly\_process\_dataframe\_\_bdca2**


In [22]:
!ls cache_files/

[34mFunc__slowly_process_dataframe__bdca2[m[m


Inside each subfolder, Pythagoras puts .pkl files 
with different versions of function output.

The name of each file consists of function's digital signature, 
plus slim human-readable summary representation of function arguments,
plus digital fingerprint of function arguments.

For example, for our slow function the subfolder now includes 2 files:

In [23]:
!ls cache_files/Func__slowly_process_dataframe__bdca2

bdca2__DataFrame(3x4nans2)__Real3.14__4ca1211fe40051.pkl
bdca2__DataFrame(3x4nans2)__Real3.14__b155d1097bc76a.pkl


Slim human-readable parameter representations in the file-names allow to visually inpect the caching directory and help you better understan behavior of your PickleCache-enabled code. Digital fingerprints in the file-names allow PickleCache to uniquely identify and distinguish different parameter values, passed to the function when it was called.

### A fast function + Pythagoras

But what if a function is fast and an extra step of saving its output into 
a .pkl file actually slows down the process insted of speeding it up?

Pythagoras will still do what we told it to do, but it will give us a warning:

In [24]:
@demo_cache_obj
def fast_function(x,y):
    return x+" "+y

In [25]:
%%time
result = fast_function("Message for","Bob")

04:52:59 Pythagoras.PickleCache.demo_cache_obj INFO: Starting generating data using fast_function() ...
04:52:59 Pythagoras.PickleCache.demo_cache_obj INFO: ...finished generating data using fast_function(). The process took 3.81e-06 seconds. 

04:52:59 Pythagoras.PickleCache.demo_cache_obj INFO: Created 93 B file ./cache_files/Func__fast_function__2b376/2b376__StrMessage_for__StrBob__dee1e74f6f5448.pkl.



CPU times: user 19.7 ms, sys: 5.58 ms, total: 25.3 ms
Wall time: 24.8 ms


In [26]:
%%time
result = fast_function("Message for","Bob")




CPU times: user 8.21 ms, sys: 2.06 ms, total: 10.3 ms
Wall time: 9.76 ms


In [27]:
print(demo_cache_obj)  # print the status of the cache 

PickleCache in directory <./cache_files> contains 3 files, with total size 2 Kb. There are 742 Gb of free space available in the directory. Cache files are expected to have <.pkl> extension. Input files should be located in <.> folder, which contains 273 files, with total size 850 Mb. Cache READER is ACTIVE: cached versions of objects are loaded from disk if they are available there. Cache WRITER is ACTIVE: new objects get saved to disk as they are created. Names of cache files can not be longer than 250 characters. Parent logger name is 'Pythagoras'; object REVEALS self-identity while logging. 


## How To Cache .read_csv() 

Reading *.csv* files can be slow. Let's create one and play with it:

In [28]:
large_dataframe = pd.DataFrame(data = 1000*np.random.rand(7000,7000))

In [29]:
large_dataframe.to_csv("example.csv", index=False)

In [30]:
%%time
new_dataframe = pd.read_csv("example.csv")

CPU times: user 9.7 s, sys: 384 ms, total: 10.1 s
Wall time: 10.1 s


In [31]:
new_dataframe.shape

(7000, 7000)

Now, let's try a version of **.read_csv()** offered by Pythagoras:

In [32]:
%%time
new_dataframe = demo_cache_obj.read_csv("example.csv")

04:54:08 Pythagoras.PickleCache.demo_cache_obj INFO: Finished reading 849 Mb file ./example.csv. The process took 10.0 seconds.

04:54:09 Pythagoras.PickleCache.demo_cache_obj INFO: Created 374 Mb file ./cache_files/Data__example.csv__44e43/44e43__size_889972651__mtime_y2020_m07_d05_h16_m53_s48__e44e5703b9edb3.pkl.



CPU times: user 9.76 s, sys: 518 ms, total: 10.3 s
Wall time: 10.4 s


In [33]:
%%time
new_dataframe = demo_cache_obj.read_csv("example.csv")

04:54:09 Pythagoras.PickleCache.demo_cache_obj INFO: Finished reading 374 Mb file ./cache_files/Data__example.csv__44e43/44e43__size_889972651__mtime_y2020_m07_d05_h16_m53_s48__e44e5703b9edb3.pkl. The process took 0.073 seconds now, while in the past it costed 10.0 seconds to read the same data from the original file ./example.csv.



CPU times: user 9.55 ms, sys: 86.8 ms, total: 96.4 ms
Wall time: 95.4 ms


In [34]:
new_dataframe.shape

(7000, 7000)

The second and all subsequent calls to Pythagoras' **.read_csv()** are much faster: we went down from 10.4 seconds to 96 milliseconds - it's 108-x faster. For larger .csv files the difference is even more drastic.

###  How does it work?

The first time we ran **PickleCache.read_csv()**, Pythagoras cached the output of the function.
The next time  **PickleCache.read_csv()** was called with the same filename as an argument,
Pythagoras re-used that output, without actually reading the original **.csv**.
The cached output is stored as a pickle file.

If we modify the **.csv** file outside of our notebook, or if you add more arguments
to **PickleCache.read_csv()** call, the same process will repeat.
For each new version of the **.csv** file, and for each combination of additional parameters,
Pythagoras will create a new cache file.
If you have enough disk space, it will save you a lot of your time.

Pickle can be loaded much faster than .csv,
this approach saves substantial time while working with large .csv files.

**PickleCache.read_csv()** accepts all *keyword arguments*
which **pandas.read_csv()** accepts,
such as *sep*, *names*, *index_col*, *dtype*, *na_values*, etc.
You can use them to fine-tune behaviour of **read_csv()**

### What is under the hood?

For every csv file, which we read with **PickleCach.read_csv()** function ,
Pythagoras creates a sub-folder within the main cache_dir folder. 
The name of sub-folder consists of a keyword "Data" 
plus the original file's name plus a digital fingerprint of the filename.

For instance, for our file **example.csv** the subfolder was named **Data\_\_example.csv\_\_44e43**


In [35]:
!ls cache_files

[34mData__example.csv__44e43[m[m              [34mFunc__slowly_process_dataframe__bdca2[m[m
[34mFunc__fast_function__2b376[m[m


Inside each subfolder Pythagoras puts .pkl files with different versions of 
data from the original .csv file.

The name of each .pkl file consists of digital fingerprint of the original .csv filename
, plus the size of the original .csv file, plus datetime of its modification
, plus encoded information about additional arguments, that were passed to the function.

For instance, for our **example.csv** file, the subfolder should include only one file now:

In [36]:
!ls cache_files/Data__example.csv__44e43

44e43__size_889972651__mtime_y2020_m07_d05_h16_m53_s48__e44e5703b9edb3.pkl


But let's call **read_csv()** with an extra argument, 
and then look into our subfolder again:

In [37]:
new_dataframe = demo_cache_obj.read_csv(
    "example.csv", usecols=[2,3,4,5,6,7,8,9,10,11,12,12,14,15,16,17,18])

04:54:12 Pythagoras.PickleCache.demo_cache_obj INFO: Finished reading 849 Mb file ./example.csv. The process took 2.56 seconds.

04:54:12 Pythagoras.PickleCache.demo_cache_obj INFO: Created 876 Kb file ./cache_files/Data__example.csv__44e43/44e43__size_889972651__mtime_y2020_m07_d05_h16_m53_s48__usecols=List(17)__cc7433ba843fa4.pkl.



In [38]:
new_dataframe.shape

(7000, 16)

In [39]:
!ls cache_files/Data__example.csv__44e43

44e43__size_889972651__mtime_y2020_m07_d05_h16_m53_s48__e44e5703b9edb3.pkl
44e43__size_889972651__mtime_y2020_m07_d05_h16_m53_s48__usecols=List(17)__cc7433ba843fa4.pkl


Pythagoras has created the second file. We can see from its name that **read_csv()** was called with an additional parameter named **usecols**, and the value of that parameter was a list with 17 elements. Exact values of these 17 elements are encoded into digital signature **cc7433ba843fa4** at the very end of the filename.

## F.A.Q.

### How to temporarily disable caching functionality?

**PickleCache.__init__()** method accepts many parameters. 
You need to set two of them to **False**
if you want to completely disable
caching functionality provided by your PickleCache object.
These parameters are **read_from_cache** and **write_to_cache**.

When **write_to_cache** is set to **False**,
PickleCache will never save new objects to disk,
but it may still use existing cached objects when they are available
(depends on **read_from_cache** ).

When **read_from_cache** is set to **False**,
PickleCache will never return cached values, stored in .pkl files. 
It will always generate the value by re-running the original method.
However, PickleCache may still create new cached objects on disk in .pkl
files if the files do not exists  (depends on flag **write_to_cache**).

If you set both of these parameters to **False**, 
PickleCache will transparently forward your calls 
to the original data generation/reading functions, 
as if PickleCache does not exist; 
all caching functionality will be disabled.

If you set **write_to_cache** to **True**,
and **read_from_cache** to **False**,
it will force PickleCache to
overwrite existing cache entries, if they exist.

**read_from_cache** and **write_to_cache** parameters
can also be passed to **read_csv()** or to a decorated
function at a time when a function is called.
In this case the parameters will only affect
behaviour of an individual function during the call,
and will temporarily override values
set for a PickleCache object via **__init__()** method.

**read_from_cache** and **write_to_cache** parameters
accept values of **True**, **False** and **None**.
Passing **None** as an equivalent of
not passing any value at all.

### How to invalidate my cache?

To invalidate the cache, simply delete the cache directory or appropriate sub-directories on your disk. 

### How to make PickleCache more or less verbose?

PickleCache is actively using logging engine, 
provided by the standard Python **logging** module.

Depending on the current logging level, PickleCache will print more or less messages.
PickleCache specifically recognizes 3 levels: 
logging.DEBUG (the most verbose), logging.INFO 
and logging.WARNING (the least verbose).

One way to change the current level is to put 
the following code at the very beginning of your notebook:
    
    import logging
    logging.basicConfig(level=logging.WARNING)
    
This code will change logging level for the root logger, 
which means for your entire program. 
See Python documentation to learn about some very unique constraints
associated with basicConfig() usage.

Another way to change the current logging level is to 
use parameter **new_logging_level** 
while creating your PickleCache object:

    demo_cache_obj = Pythagoras.PickleCache(
    
        cache_dir = "./cache_files"   # Here Pythagoras will store cashed data, 
                                      # if/when it needs to.
                                    
        ,input_dir = "."  # From here Pythagoras will read .csv files, 
                          # if/when asked.
                          
        ,new_logging_level = logging.WARNING # Print only warnings,
                                             # discard less important messages.
                                             
        )

### Is persistent caching the only functionality offered by Pythagoras?

No. Pythagoras is a Python library which provides various tools to help
data scientists be more efficient. Persistent Caching is just one of many tools (to come).

## Next Steps

Congratulations! You've just finished the **introductory level** 
of the Guide to persistent caching with Pythagoras. 

Please, proceed to the **[Advanced level](Pythagoras_caching_advanced_tutorial.ipynb)** of the Tutorial to learn how you can make PickleCache work with new classes and types.