<div align="center">
  <img src="http://vlpavlov.org/Pythagoras-Logo3.svg"><br>
</div>

# Speed Up Your Data Science Project Using Caching Tools from Pythagoras Package
## Advanced Tutorial

In the [Introductory Tutorial](Pythagoras_caching_introductory_tutorial.ipynb)
we reviewed the basic scenarious
of using Pythagoras for persisting caching. We demonstrated how Pythagoras can
speed up functions that take basic datatypes as their arguments.

But what if you want to cache a function which needs to work with a custom class?
Or, what if you need to cache a method of a custom class?
Pythagoras offers several extensibility mechanisms to address such cases.
These mechanisms will be explained below.

In [1]:
import numpy as np
import pandas as pd
import sys
import time
import logging

np.random.seed(42)

## Your Custom Classes

### A Naive Example

In [2]:
# Suppose, you have a class in your program which stores a US address:

class USAddress:
    street_address: str
    city: str
    zip_code: int
    state: str

    def __init__(self
            ,street_address: str
            , city: str
            , zip_code: int
            , state: str) -> None:
        self.street_address = street_address
        self.city = city
        self.zip_code = zip_code
        self.state = state

In [3]:
# Now, let's create a function which takes an instance of the class USAddress 
# as an argument, and let's PickleCache this function

from pythagoras import * # this is the library which will provide us with the
                         # advanced caching tools

my_cache_obj = PickleCache(
    cache_dir = "./pickle_cache_files" # Here Pythagoras will store cached data
    )

@my_cache_obj
def slowly_check_if_California(an_address:USAddress) -> bool:
    state_name = an_address.state.upper()
    time.sleep(3)
    if state_name == "CA" or state_name == "CALIFORNIA":
        return True
    else:
        return False





In [4]:
an_address = USAddress("1024 Lytton Ave", "Palo Alto", 94301, "CA")
slowly_check_if_California(an_address)



True

As you can see above, PickleCache did not know how to deal with your new class, so it generated an error.

How to fix it?

### Two _repr() methods

If you want PickleCache to be able to deal with your new class, you must implement 2 methods in the class:

        fingerpint_repr(self, fprepr_builder: FingerprintReprBuilder) -> str

        slim_repr(self, srepr_builder: SlimReprBuilder) -> str

The first method must return a unique text string. If two objects hold different values,
their **fingerpint_repr()** methods should return different strings. It's a digital fingerpint of your object.

The second method must return short human-readable string. If two objects hold different values,
their **slim_repr()** methods may (or may not) return the same string. It's a concise text summary of the object.

* **fingerpint_repr()** returns UNIQUE  string, not necessarily short and not necessarily human-readable.
* **slim_repr()** returns SHORT and HUMAN-READABLE string, not necessarily unique.

Let's implement these methods for our class. For now, we will ignore their second arguments,
as they provide access to advanced functionality beyond the scope of this tutorial.

In [5]:
class USAddress:
    street_address: str
    city: str
    zip_code: int
    state: str

    def __init__(self
            ,street_address: str
            , city: str
            , zip_code: int
            , state: str) -> None:
        self.street_address = street_address
        self.city = city
        self.zip_code = zip_code
        self.state = state

    def fingerprint_repr(self, _: FingerprintReprBuilder) -> str:
        fingerprint_str = type(self).__name__
        fingerprint_str += self.street_address
        fingerprint_str += self.city
        fingerprint_str += str(self.zip_code)
        fingerprint_str += self.state
        assert len(fingerprint_str), "fingerpint_repr can not return an empty string"
        return fingerprint_str

    def slim_repr(self , srepr_builder: SlimReprBuilder) -> str:
        return "USAddr_"+self.state[:2].upper()

In [6]:
@my_cache_obj
def slowly_check_if_California(an_address:USAddress) -> bool:
    state_name = an_address.state.upper()
    time.sleep(3)
    if state_name == "CA" or state_name == "CALIFORNIA":
        return True
    else:
        return False

In [7]:
an_address = USAddress("1024 Lytton Ave", "Palo Alto", 94301, "CA")
slowly_check_if_California(an_address)

True

Take a look at the log message above. Specifically, take a look at filename of a cache file.
As you can see, Pythagoras uses outcome of **slim_repr()** method to compose a name of a cache file.
It uses the outcome of **fingerprint_repr()** in a similar way, but indirectly - the returned string
was used to calculate final fingerprint string 16132cef23a875, which then became a part of the file name.

Now, let's see how our cache works when a function is called again with the same arguments:

In [8]:
slowly_check_if_California(an_address)

True

If you want to PickleCache methods of your class, you must always implement **fingerpint_repr()** and **slim_repr()**.
Why? Very simple. Your methods will always get *self* as the first argument. As we discussed above,
it means *self* must support these 2 methods - otherwise PickleCache will refuse to work.

Let's take a look at the example.

In [9]:
class USAddress:
    street_address: str
    city: str
    zip_code: int
    state: str

    def __init__(self
            ,street_address: str
            , city: str
            , zip_code: int
            , state: str) -> None:
        self.street_address = street_address
        self.city = city
        self.zip_code = zip_code
        self.state = state

    
    def fingerprint_repr(self, _: FingerprintReprBuilder) -> str:
        fingerprint_str = type(self).__name__
        fingerprint_str += self.street_address
        fingerprint_str += self.city
        fingerprint_str += str(self.zip_code)
        fingerprint_str += self.state
        assert len(fingerprint_str), "fingerpint_repr can not return an empty string"
        return fingerprint_str


    def slim_repr(self , srepr_builder: SlimReprBuilder) -> str:
        slim_str =  "USAddr_"+self.state[:2].upper()
        assert len(slim_str), "slim_repr can not return an empty string"
        return slim_str

  
    @my_cache_obj
    def average_household_income_in_the_neighbourhood(self, year:int) -> int:
        time.sleep(2)
        return 59000

In [10]:
an_address = USAddress("1024 Lytton Ave", "Palo Alto", 94301, "CA")

an_address.average_household_income_in_the_neighbourhood(2019)

59000

Please, inspect the logging output above. While technically method **average_household_income_in_the_neighbourhood()**
was called with just one argument, filename of a cache-file clearly shows that two arguments were processed.

Now, let's try again to see if our cache saves time:

In [11]:
an_address.average_household_income_in_the_neighbourhood(2019)

59000

Indeed, it does!

### Disabling caching for individual objects

In the Introductory Tutorial we explained how to temporarily disable caching functionality using boolean **read_from_cache** and **write_to_cache** parameters. We demonstrated how to use these parameters on 2 levels:

* On the level of PickleCache object by passing the parameters while creating the cache. In this scenario **read_from_cache** and **write_to_cache** parameters change behavior  of the entire cache.
* On the level of individual calls of cache-decorated functions by passing the parameters while calling these functions.  In this scenario **read_from_cache** and **write_to_cache** parameters only impact caching behavior during those individual calls.

There is also a third way to use **read_from_cache** and **write_to_cache**, on the level of individual objects that have cache-decorated methods. In this scenario, the parameters will only impact behavior of the methods of the specific objects.

Check the example below. Inspect the logging messages. Notice, that for the 2nd address caching functionality has been disabled:

In [12]:
address_1 = USAddress("1024 Lytton Ave", "Palo Alto", 94301, "CA")
address_2 = USAddress("2048 Everett Ave", "Palo Alto", 94301, "CA")

address_2.write_to_cache = False
address_2.read_from_cache = False

ahi_1 = address_1.average_household_income_in_the_neighbourhood(1980)
ahi_2 = address_2.average_household_income_in_the_neighbourhood(1980)

In [13]:
ahi_1 = address_1.average_household_income_in_the_neighbourhood(1980)
ahi_2 = address_2.average_household_income_in_the_neighbourhood(1980)

There are 3 different levels where you can define optional boolean **read_from_cache** and **write_to_cache** parameters. You can pass them to PickleCache object when you create it, you can set them as attributes of individual objects that have PickleCache-decorated methods, and you can also pass them as arguments to PickleCache-decorated methods when you call the methods. 

What is the order of precedence if conflicting values of these parameters are defined on different levels?

**read_from_cache** and **write_to_cache** parameters, passed to methods during the individual calls, take the highest precedence and define PickleCache behavior for those individual calls. If **read_from_cache** and **write_to_cache** parameters were not passed (or if None values were passed) as method arguments, then object attributes define the caching behavior. Finally, if an object does not have  **read_from_cache** and **write_to_cache** attributes (or it does, but their values are None), then the caching behavior is defined by what was set when PickleCache constructor was called. 

This 3-level model allows for fine-grained control of caching mechanism.

## Third-Party Custom Classes

Now, suppose you are using a third-party library. It is written by someone else,
you have no way to force them to implement **fingerprint_repr()** and **slim_repr()** methods in their classes.

        class InternationalAddress:
            street_address: str
            city: str
            province: str
            postal_code: int
            country: str

Still, your function uses these classes and you need to cache the function. What to do?

The easiest way is to implement custom representation handlers and pass them to PickleCache constructor.
These are two fucntions which take any object, and return either None or a string.
They return None if they do not know how to build slim/fingerprint string for an object.
And they return a string with slim/fingerprint representation when they know how to create such a string.

Below is an example.

In [14]:
############## Third-Party Code, You Can Not Modify It ############################

class InternationalAddress:
    street_address: str
    city: str
    province: str
    postal_code: int
    country: str

    def __init__(self
            , street_address: str
            , city: str
            , postal_code: int
            , province: str
            , country:str) -> None:
        self.street_address = street_address
        self.city = city
        self.postal_code = postal_code
        self.province = province
        self.country = country

############## Your Code, You Write It ############################

def my_slim_repr_handler(an_object:Any,_:SlimReprBuilder) -> Optional[str]:
    if type(an_object) == InternationalAddress:
        slim_str = "IntAddr_" + an_object.country[:3].upper()
        return slim_str
    else:
        return None

def my_fingerprint_repr_handler(an_object:Any,_:FingerprintReprBuilder) -> Optional[str]:
    if type(an_object) == InternationalAddress:
        fingerprint_str = type(an_object).__name__
        fingerprint_str += an_object.street_address
        fingerprint_str += an_object.city
        fingerprint_str += an_object.province
        fingerprint_str += str(an_object.postal_code)
        fingerprint_str += an_object.country
        return fingerprint_str
    else:
        return None

my_custom_cache_obj = PickleCache(
    cache_dir = "./pickle_cache_files"   
    , custom_slim_repr_handler = my_slim_repr_handler
    , custom_fingerprint_repr_handler = my_fingerprint_repr_handler
    )

@my_custom_cache_obj
def slowly_check_if_Canada(an_address:InternationalAddress) -> bool:
    state_name = an_address.country.upper()
    time.sleep(3)
    if state_name == "CANADA":
        return True
    else:
        return False

an_address = InternationalAddress("1024 Lytton Ave", "Palo Alto", 94301, "California","USA")

slowly_check_if_Canada(an_address)

False

In [15]:
slowly_check_if_Canada(an_address)

False

Please, inspect the logging outputs from the two cells above to better understand what happened.
Pay attention to the name of the cache file.

## Next Steps

Congratulations! You are now well equipped to speed up your .csv file-loading ans subsequent feature engineering code using Pythagoras PickleCache.

In these 2 tutorials you have learned: 
* how to use PickleCache with popular datatypes (such as floats, lists, and Pandas DataFrames);
* how to write your own classes compatible with PickleCache;
* how to make PickleCache work with thirt-party classes.

In the late 1880th, psychologist Hermann Ebbinghaus discovered what he called "the forgetting curve": roughly 56% of new information is forgotten in one hour, 66% after a day, and 75% after six days. Most probably, you will forget about PickleCache in a week. 

But there is a simple strategy to fight this fenomena: practice. Please, ***SPEND ANOTHER HOUR TODAY PRACTICING USING PYTHAGORAS PICKLECACHE IN ONE OF YOUR REAL-LIFE PROJECTS***. It will imporve project performance as well as will help you remember how to use PickleCache.