# Caching for ML Model Deployments

In a [previous blog post]() we introduced the decorator pattern for ML model deployments and then showed how to use the pattern to build extensions to a normal deployment. For example, in [this blog post]() we added data enrichment, in [this blog post]() we added logging, in [this blog post]() we added metrics, and in [this blog post]() we added distributed tracing. All of these extensions were added without having to modify the machine learning model code at all, we were able to do it by using the decorator pattern. In this blog post we’ll add caching functionality to a model in the same way.

This blog post is written in a Jupyter notebook, some of the code and commands found in it reflects this.

## Introduction

In a software system, a [cache](https://en.wikipedia.org/wiki/Cache_(computing)) is a data store that is used to temporarily store computation results or frequently-accessed data. When accessing the results of a computation from a cache, we are able to avoid paying the cost of recomputing the result. When accessing a frequently accessed piece of data we are able to avoid paying the cost of accessing the data from a slower data store, this type of caching is used when accessing data from a slower data store than the cache. When a cache hit occurs, the data being sought is found and returned to the caller. When a “miss” occurs, the data is not found and must be recomputed or accessed from the slower data store by the caller. A data cache is generally built using storage that has low latency, which means that it is more expensive to run. 

Machine learning model deployments can benefit from caching because making predictions with a model is usually a CPU-bound process, especially for large and complex models. Predictions that take a long time to make can be cached and returned later when the same prediction is requested. This type of caching is also known as [memoization](https://en.wikipedia.org/wiki/Memoization).

In order to enable prediction caching possible from ML models, we need to make sure that the model can produce deterministic predictions. Determinism is a property of algorithms that says that the algorithm will always return the same output for the same input. If the model for which we want to cache predictions returns a different prediction for the same inputs, then we wouldn’t be able to cache the predictions at all since we wouldn’t be able to guarantee that the model would return the same prediction that we had cached.

The effectiveness of a cache is measured by the cache hit ratio, which is defined as the number of times a piece of needed data is found in the cache, divided by the total number of times a piece of data is requested from the cache. A bigger cache can hold more data which means it can have a higher hit ratio, but it will cost more money to run. The size of a cache must be balanced with the performance benefits that it provides to make the economics make sense.

Another aspect of cache performance is the amount of time that is saved when using the cache to access data. A cache is only a net benefit if the time saved during cache hits exceeds the time lost from the additional overhead of recalculating the needed results.

## Software Architecture

![Software Architecture](software_architecture_cfmlm.png)
![Software Architecture]({attach}software_architecture_cfmlm.png){ width=100% }

For caching predictions, we’ll be using [Redis](https://en.wikipedia.org/wiki/Redis). Redis is a data structure store that allows users to save and modify data structures in a remote service. This allows many clients to safely access the same data from a centralized service. Redis supports many different data structures, but we’ll be using the key-value store functionality to save our predictions.

## Installing the Model

To make this blog post a little shorter we won't train a completely new model. Instead we'll install a model that we've [built in a previous blog post](https://www.tekhnoal.com/regression-model.html). The code for the model is in [this github repository](https://github.com/schmidtbri/regression-model).

To install the model, we can use the pip command and point it at the github repo of the model.

In [1]:
from IPython.display import clear_output
from IPython.display import Markdown as md

!pip install -e git+https://github.com/schmidtbri/regression-model#egg=insurance_charges_model

clear_output()

To make a prediction with the model, we'll import the model's class.

In [2]:
from insurance_charges_model.prediction.model import InsuranceChargesModel

Now we can instantiate the model:

In [3]:
model = InsuranceChargesModel()

clear_output()

To make a prediction, we'll need to use the model's input schema class.

In [4]:
from insurance_charges_model.prediction.schemas import InsuranceChargesModelInput, \
    SexEnum, RegionEnum

model_input = InsuranceChargesModelInput(
    age=42, 
    sex=SexEnum.female,
    bmi=24.0,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

The model's input schema is called InsuranceChargesModelInput and it encompasses all of the features required by the model to make a prediction.

Now we can make a prediction with the model by calling the predict() method with an instance of the InsuranceChargesModelInput class.

In [5]:
prediction = model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=8640.78)

The model predicts that the charges will be $8640.78.

When deploying the model we’ll pretend that the age, sex, bmi, children, smoker, and region fields are not available from the client system that is calling the model. Because of this, we’ll need to add it to the model input by loading the data from the database.

We can view input schema of the model as a JSON schema document by calling the .schema() method on the instance.

In [6]:
model.input_schema.schema()

{'title': 'InsuranceChargesModelInput',
 'description': "Schema for input of the model's predict method.",
 'type': 'object',
 'properties': {'age': {'title': 'Age',
   'description': 'Age of primary beneficiary in years.',
   'minimum': 18,
   'maximum': 65,
   'type': 'integer'},
  'sex': {'title': 'Sex',
   'description': 'Gender of beneficiary.',
   'allOf': [{'$ref': '#/definitions/SexEnum'}]},
  'bmi': {'title': 'Body Mass Index',
   'description': 'Body mass index of beneficiary.',
   'minimum': 15.0,
   'maximum': 50.0,
   'type': 'number'},
  'children': {'title': 'Children',
   'description': 'Number of children covered by health insurance.',
   'minimum': 0,
   'maximum': 5,
   'type': 'integer'},
  'smoker': {'title': 'Smoker',
   'description': 'Whether beneficiary is a smoker.',
   'type': 'boolean'},
  'region': {'title': 'Region',
   'description': 'Region where beneficiary lives.',
   'allOf': [{'$ref': '#/definitions/RegionEnum'}]}},
 'definitions': {'SexEnum': {'titl

## Profiling the Model

In order to get an idea of how much time it takes for our model to make a prediction, we'll profile it by making predictions with ranom data. To do this, we'll use the [Faker package](https://faker.readthedocs.io/en/master/). We can install it with this command:

In [7]:
!pip install Faker

clear_output()

We'll create a function that can generate a random sample that meets the model's input schema:

In [8]:
from faker import Faker

faker = Faker()

def generate_record() -> InsuranceChargesModelInput:
    record = {
        "age": faker.random_int(min=18, max=65),
        "sex": faker.random_choices(elements=("male", "female"), length=1)[0],
        "bmi": faker.random_int(min=15000, max=50000)/1000.0,
        "children": faker.random_int(min=0, max=5),
        "smoker": faker.boolean(),
        "region": faker.random_choices(elements=("southwest", "southeast", "northwest", "northeast"), length=1)[0]
    }
    return InsuranceChargesModelInput(**record)

The function returns an instance of the InsuranceChargesModelInput class, which is the type required by the model's predict() method. We'll use this function to profile the predict() method of the model.

It's really hard to see a performance difference with one sample, so we'll perform a test with many random samples to see the difference. To start, we'll generate 1000 samples and save them:

In [9]:
samples = []

for _ in range(1000):
    samples.append(generate_record())

By using the timeit module from the standard library, we can measure how much time it takes to call the model's predict method with a random sample generated by the generate_record() function. We'll call the method 1000 times.

In [10]:
import timeit

total_seconds = timeit.timeit("[model.predict(sample) for sample in samples]", number=1, globals=globals())

In [11]:
seconds_per_sample = total_seconds / 1000.0
milliseconds_per_sample = seconds_per_sample * 1000.0

md("The model took {} seconds to perform 1000 predictions, therefore it took "
   "{} seconds to make a single prediction. \n\nThe model takes about {} milliseconds to "
   "make a prediction".format(round(total_seconds, 3), 
                              round(seconds_per_sample, 4), 
                              round(milliseconds_per_sample, 3)))

The model took 33.307 seconds to perform 1000 predictions, therefore it took 0.0333 seconds to make a single prediction. 

The model takes about 33.307 milliseconds to make a prediction

## Hashing Model Inputs

Before we can build a caching decorator, we'll need to understand a little bit about hashing and how to use it for caching. A hashing operation is an operation takes in data of arbritrary size as input and returns data of a fixed size. A "hash" value refers to the fixed-size data that is returned from a hashing operation. Hashing has many uses in computer science, in this application we'll us hashing to uniquely identify some inputs that are provided to the ML model that we are decorating.

Hashing is already built into the Python standard library through the hash() function, but it is only supported on certain types of objects. We can try it out using an instance of the model's input schema:

In [12]:
model_input = InsuranceChargesModelInput(
    age=42, 
    sex=SexEnum.female,
    bmi=24.0,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

model_input_dict = model_input.dict()
frozen_dict = frozenset(model_input_dict.keys()), frozenset(model_input_dict.values())

hash(frozen_dict)

1297609423663202376

To try out hashing, we converted an instance of the model's input schema into a dictionary, and then converted the keys and values of the dictionary into [frozensets](https://docs.python.org/3/library/stdtypes.html#frozenset). We then used the frozensets with the hash() function to create an integer value. The integer is the hashed value that we need to uniquely identify the inputs to the model.

To see how hashing works, we'll create a separate input instance for the model that has the exact same values and hash it:

In [13]:
model_input = InsuranceChargesModelInput(
    age=42, 
    sex=SexEnum.female,
    bmi=24.0,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

model_input_dict = model_input.dict()
frozen_dict = frozenset(model_input_dict.keys()), frozenset(model_input_dict.values())

hash(frozen_dict)

1297609423663202376

The hashed values are exactly the same, as we expected. The hashes value should be different if any of the values in the model input change:

In [14]:
model_input = InsuranceChargesModelInput(
    age=42, 
    sex=SexEnum.female,
    bmi=24.2,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

model_input_dict = model_input.dict()
frozen_dict = frozenset(model_input_dict.keys()), frozenset(model_input_dict.values())

hash(frozen_dict)

-7143663760078629168

The "bmi" field changed from 24.0 to 24.2, so we got a completely different hashed value.

Hashing is a quick and easy way to identify inputs which will allow us to store the predictions of the model in the cache and retrieve them later. 

## Creating the Redis Cache Decorator

We'll be using Redis to hold the cached predictions of the model. To access the Redis instance, we'll use the redis python package, which we'll install with this command:

In [15]:
!pip install redis

clear_output()

Now we can implement the decorator class:

In [16]:
import os
from typing import List, Optional
from ml_base.decorator import MLModelDecorator
import redis
import json


class RedisCachingDecorator(MLModelDecorator):
    """Decorator for caching around an MLModel instance."""

    def __init__(self, host: str, port: str, database: str, prefix: Optional[str] = None, 
                 hashing_fields: Optional[List[str]] = None) -> None:
        
        super().__init__(host=host, port=port, database=database, prefix=prefix, 
                         hashing_fields=hashing_fields)
        
        self.__dict__["_redis_client"] = redis.Redis(host=host, port=port, db=database)

    def predict(self, data):
        if self._configuration["prefix"] is not None:
            prefix = "{}/{}/{}/".format(self._configuration["prefix"], 
                                        self._model.qualified_name, 
                                        self._model.version)
        else:
            prefix = "{}/{}/".format(self._model.qualified_name, 
                                     self._model.version)

        # select hashing fields from input
        if self._configuration["hashing_fields"] is not None:
            data_dict = {key: data.dict()[key] for key in self._configuration["hashing_fields"]}
        else:
            data_dict = data.dict()
        
        # creating a key for the prediction inputs provided
        frozen_data = frozenset(data_dict.keys()), frozenset(data_dict.values())
        key = prefix + str(hash(frozen_data))
       
        # check if the prediction is in the cache
        prediction = self.__dict__["_redis_client"].get(key)
        
        # if the prediction is present in the cache, then deserialize it and return the prediction
        if prediction is not None:
            prediction = json.loads(prediction)
            prediction = self._model.output_schema(**prediction)
            return prediction
        # if the prediction is not present in the cache, then make a prediction, save it to the cache, and return the prediction
        else:
            prediction = self._model.predict(data)
            serialized_prediction = json.dumps(prediction.dict())
            self.__dict__["_redis_client"].set(key, serialized_prediction)
            return prediction

The caching decorator works very simply, when it receives inputs for the model it:

- creates a key for the data structures
- checks if the key is present in the cache
- if the key is present:
    - retrieves the prediction for that key 
    - deserializes the contents of the cache into the output type of the model
    - returns the prediction to the caller
- if the key is not present:
    - makes a prediction with the model it is decorating
    - serializes the prediction to a JSON string
    - saves the prediction to the cache with the key generated
    - returns the prediction to the caller

The key created for each cache entry is made up of the model's qualified name, the model version and an optional prefix. The prefix is optional and is used to differentiate the predictions that are cached in a more flexible way. The caching decorator uses JSON as a serialization format to store information in the cache. 

## Using the Redis Cache Decorator

In order to try out the decorator, we'll need to run a local Redis instance. We can start one using Docker with this command:

In [17]:
!docker run -d -p 6379:6379 --name local-redis redis/redis-stack-server:latest

aa2cfbefa2290fcc314aeaeab9935ce92dabcc650becb9807319ef90473a292e


To test out the decorator we first need to instantiate the model object that we want to use with the decorator.

In [18]:
model = InsuranceChargesModel()

Next, we’ll instantiate the decorator with the parameters.

In [19]:
caching_decorator = RedisCachingDecorator(host="localhost", 
                                          port=6379,
                                          database=0,
                                          prefix="prefix")

We can add the model instance to the decorator after it’s been instantiated like this:

In [20]:
decorated_model = caching_decorator.set_model(model)

We can see the decorator and the model objects by printing the reference to the decorator:

In [21]:
decorated_model

RedisCachingDecorator(InsuranceChargesModel)

The decorator object is printing out it's own type along with the type of the model that it is decorating.

Now we’ll try to use the decorator and the model together by making a few predictions.

In [22]:
model_input = InsuranceChargesModelInput(
    age=46,
    sex=SexEnum.female,
    bmi=24.0,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

prediction = decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=9612.64)

The first time we make a prediction with a given input, we'll get the prediction made by the model and the decorator will store the prediction in the cache. 

We can view the key in the redis database to see how it is stored.

In [23]:
!docker exec local-redis redis-cli SCAN 0 

0
prefix/insurance_charges_model/0.1.0/6204525449924069161


There is a single key in the redis database. We'll access they key like this:

In [24]:
!docker exec local-redis redis-cli GET prefix/insurance_charges_model/0.1.0/198181040256854193




The prediction is stored in the key as a JSON string.

We'll try the same prediction again:

In [25]:
model_input = InsuranceChargesModelInput(
    age=46, 
    sex=SexEnum.female,
    bmi=24.0,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

prediction = decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=9612.64)

This time the prediction was not made by the model, it was found in the Redis cache and returned by the decorator instead of being made again.

Next, we'll use the samples to make predictions with the decorated model:

In [26]:
decorated_total_seconds = timeit.timeit("[decorated_model.predict(sample) for sample in samples]", number=1, globals=globals())

In [27]:
decorated_seconds_per_sample = decorated_total_seconds / 1000.0
decorated_milliseconds_per_sample = decorated_seconds_per_sample * 1000.0

md("The decorated model took {} seconds to perform 1000 predictions the first time that it saw "
   "the prediction inputs, therefore it took {} seconds to make a single prediction. "
   "\n\nThe decorated model takes about {} milliseconds to make a prediction.".
   format(round(decorated_total_seconds, 3), 
          round(decorated_seconds_per_sample, 4), 
          round(decorated_milliseconds_per_sample, 3)))

The decorated model took 35.53 seconds to perform 1000 predictions the first time that it saw the prediction inputs, therefore it took 0.0355 seconds to make a single prediction. 

The decorated model takes about 35.53 milliseconds to make a prediction.

We'll run the same samples through again:

In [28]:
decorated_total_seconds = timeit.timeit("[decorated_model.predict(sample) for sample in samples]", number=1, globals=globals())

In [29]:
decorated_seconds_per_sample = decorated_total_seconds / 1000.0
decorated_milliseconds_per_sample = decorated_seconds_per_sample * 1000.0

md("The decorated model took {} seconds to perform 1000 predictions the second time that it saw "
   "the prediction inputs, therefore it took {} seconds to make a single prediction. "
   "\n\nThe model takes about {} milliseconds to make a prediction.".
   format(round(decorated_total_seconds, 3), 
          round(decorated_seconds_per_sample, 4), 
          round(decorated_milliseconds_per_sample, 3)))

The decorated model took 0.894 seconds to perform 1000 predictions the second time that it saw the prediction inputs, therefore it took 0.0009 seconds to make a single prediction. 

The model takes about 0.894 milliseconds to make a prediction.

It took less time because the cached predictions were returned more quickly because we requested the same predictions from the model.

We can get the amount of memory used by the cache by accessing the keys and summing up the length of the byte array.

In [30]:
r = redis.StrictRedis(host='localhost', port=6379, db=0)

decorated_number_of_bytes = 0
decorated_total_entries = 0
for key in r.scan_iter("prefix*"):
    decorated_number_of_bytes += len(r.get(key))
    decorated_total_entries = decorated_total_entries + 1
    
decorated_average_number_of_bytes = decorated_number_of_bytes / decorated_total_entries
    
md("The keys in the cache take up a total of {} bytes. "
   "The average number of bytes per cache entry is {}."
   .format(decorated_number_of_bytes, 
           round(decorated_average_number_of_bytes, 2)))

The keys in the cache take up a total of 20630 bytes. The average number of bytes per cache entry is 20.61.

We'll clear the redis database to make sure the contents don't intefere with the next things we want to try.

In [31]:
!docker exec local-redis redis-cli FLUSHDB

OK


## Selecting Fields For Hashing

In certain situations, not all of the fields in the model's input should be used to create a hash. This may be because not all of the model's input fields are actually used for making a prediction. Some fields may be used for logging or debugging and do not actually affect the prediction created by the model. If changing the value of a field does not affect the value of the prediction created by the model, it should not be used to create the hashed key for the cache.

The caching decorator supports selecting specific fields from the input to create the cache key. The option is called "hashing_fields" and is provided to the decorator instance like this:

In [32]:
caching_decorator = RedisCachingDecorator(host="localhost", 
                                          port=6379,
                                          database=0,
                                          prefix="prefix",
                                          hashing_fields=["age", "sex", "bmi", "children", "smoker"])

decorated_model = caching_decorator.set_model(model)

The decorator now uses all of the input fields except for the "region" field to create the key.

To try out the functionality, we'll create a prediction with the decorated model. The prediction will get saved in the cache.

In [33]:
model_input = InsuranceChargesModelInput(
    age=52, 
    sex=SexEnum.female,
    bmi=24.0,
    children=3,
    smoker=False,
    region=RegionEnum.northwest)

prediction = decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=15219.19)

We'll now make the same prediction, but this time the prediction will come from the cache because it was saved there previously.

In [34]:
model_input = InsuranceChargesModelInput(
    age=52, 
    sex=SexEnum.female,
    bmi=24.0,
    children=3,
    smoker=False,
    region=RegionEnum.northwest)

prediction = decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=15219.19)

We'll make the prediction one more time, but this time we'll change the value of the "region" field.

In [35]:
model_input = InsuranceChargesModelInput(
    age=52, 
    sex=SexEnum.female,
    bmi=24.0,
    children=3,
    smoker=False,
    region=RegionEnum.southeast)

prediction = decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=15219.19)

The predicted value should have changed because the region changed. It didn't change because we accessed the prediction from the cache instead of creating a new one. This happened because we ignored the value of the "region" field when creating the hashed key in the cache.

In [36]:
!docker exec local-redis redis-cli FLUSHDB

OK


## Improving the Performance of the Decorator

When a prediction is stored in the cache, it is currently serialized using the JSON format. This format is simple and easy to understand, but it is not the most efficient format for serialization in terms of the size of the data and the time it takes to do the serialization.

To try to improve the efficiency of the caching decorator we'll add options for other serialization formats and also try to use compression. Another way to reduce the memory usage of the cache is to reduce the precision of the numbers given to the model. These approaches will be fully explained below.

We'll be using [MessagePack](https://msgpack.org/index.html) to to serialization, so we need to install the package:

In [37]:
!pip install msgpack
!pip install python-snappy

clear_output()

We'll recreate the RedisCachingDecorator class with the code needed to support the new features we want to work with.

In [38]:
import msgpack
import snappy


class RedisCachingDecorator(MLModelDecorator):
    """Decorator for caching around an MLModel instance."""

    def __init__(self, host: str, port: str, database: str, prefix: Optional[str] = None, 
                 hashing_fields: Optional[List[str]] = None, serder: str = "JSON", 
                 use_compression: bool = False, 
                 reduced_precision_fields: Optional[List[str]] = None,
                 number_of_places: Optional[int] = None
                ) -> None:
        
        if serder not in ["JSON", "MessagePack"]:
            raise ValueError("Serder option not supported.")
            
        if reduced_precision_fields is None and number_of_places is not None:
            raise ValueError("number_of_places must be provided when reduced_precision_fields is provided.")
            
        if number_of_places is None and reduced_precision_fields is not None:
            raise ValueError("reduced_precision_fields must be provided when number_of_places is provided.")
        
        super().__init__(host=host, port=port, database=database, prefix=prefix, 
                         hashing_fields=hashing_fields, serder=serder, 
                         use_compression=use_compression, 
                         reduced_precision_fields=reduced_precision_fields,
                         number_of_places=number_of_places)
        
        self.__dict__["_redis_client"] = redis.Redis(host=host, port=port, db=database)

    def predict(self, data):
        if self._configuration["prefix"] is not None:
            prefix = "{}/{}/{}/".format(self._configuration["prefix"], 
                                        self._model.qualified_name, 
                                        self._model.version)
        else:
            prefix = "{}/{}/".format(self._model.qualified_name,
                                     self._model.version)
        #print(data)
        if self._configuration["reduced_precision_fields"] is not None:
                for field_name, field_value in data.dict().items():
                    if field_name in self._configuration["reduced_precision_fields"]:
                        setattr(data, field_name, round(field_value, self._configuration["number_of_places"]))
        #print(data)

        # select hashing fields from input
        if self._configuration["hashing_fields"] is not None:
            data_dict = {key: data.dict()[key] for key in self._configuration["hashing_fields"]}
        else:
            data_dict = data.dict()
        
        # creating a key for the prediction inputs provided
        frozen_data = frozenset(data_dict.keys()), frozenset(data_dict.values())
        key = prefix + str(hash(frozen_data))
       
        # check if the prediction is in the cache
        prediction = self.__dict__["_redis_client"].get(key)
        
        # if the prediction is present in the cache
        if prediction is not None:

            # optionally decompressing the bytes
            if self._configuration["use_compression"]:
                decompressed_prediction = snappy.decompress(prediction)
            else:
                decompressed_prediction = prediction
            
            # deserializing to bytes
            if self._configuration["serder"] == "JSON":
                deserialized_prediction = json.loads(decompressed_prediction.decode())
            elif self._configuration["serder"] == "MessagePack":
                deserialized_prediction = msgpack.loads(decompressed_prediction)
            else: 
                raise ValueError("Serder option not supported.")
                
            # creating the output instance
            prediction = self._model.output_schema(**deserialized_prediction)

            return prediction

        # if the prediction is not present in the cache
        else:
            # making a prediction with the model
            prediction = self._model.predict(data)

            # serializing to bytes
            if self._configuration["serder"] == "JSON":
                serialized_prediction = str.encode(json.dumps(prediction.dict()))
            elif self._configuration["serder"] == "MessagePack":
                serialized_prediction = msgpack.dumps(prediction.dict())
            else: 
                raise ValueError("Serder option not supported.")
                
            # optionally compressing the bytes
            if self._configuration["use_compression"]:
                serialized_prediction = snappy.compress(serialized_prediction)
                
            # saving the prediction to the cache
            self.__dict__["_redis_client"].set(key, serialized_prediction)

            return prediction

The new implementation above includes options to enable MessagePack for serialization/deserialization, snappy for compression, and the ability to reduce the precision of numerical fields in the model input. We'll try out each option individually.

### MessagePack Serialization

An alternative serialization format is [MessagePack](https://msgpack.org/index.html). This format is a binary serialization format designed for small and efficient and flexible serialization. 

To enable MessagePack, we'll instantiate the decorator setting the "serder" option to "MessagePack". We'll use a prefix to separate the cache entries that use MessagePack from the other cache entries.

In [39]:
msgpack_caching_decorator = RedisCachingDecorator(host="localhost", 
                                                  port=6379,
                                                  database=0,
                                                  prefix="msgpack",
                                                  serder="MessagePack")

mspgpack_decorated_model = msgpack_caching_decorator.set_model(model)

The first time we make a prediction, the model will be used and the prediction will get serialized to MessagePack and saved to the cache.

In [40]:
model_input = InsuranceChargesModelInput(
    age=55, 
    sex=SexEnum.female,
    bmi=25.0,
    children=4,
    smoker=False,
    region=RegionEnum.northwest)

prediction = mspgpack_decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=15113.29)

The second time we make a prediction, the cache entry will be used instead.

In [41]:
model_input = InsuranceChargesModelInput(
    age=55, 
    sex=SexEnum.female,
    bmi=25.0,
    children=4,
    smoker=False,
    region=RegionEnum.northwest)

prediction = mspgpack_decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=15113.29)

The MessagePack format works, now we'll do some testing to see if it improves the serialization/deserialization performance.

As before, we'll make the predictions on the samples to fill in the cache with predictions. We'll be using the 1000 samples generated above to keep the comparison fair.

In [42]:
msgpack_total_seconds = timeit.timeit("[mspgpack_decorated_model.predict(sample) for sample in samples]", number=1, globals=globals())

In [43]:
msgpack_seconds_per_sample = msgpack_total_seconds / 1000.0
msgpack_milliseconds_per_sample = msgpack_seconds_per_sample * 1000.0

md("The model that uses MessagePack took {} seconds to perform 1000 predictions the first time that it saw "
   "the prediction inputs. The model takes about {} milliseconds to make a prediction.".
   format(round(msgpack_total_seconds, 3), 
          round(msgpack_seconds_per_sample, 4), 
          round(msgpack_milliseconds_per_sample, 3)))

The model that uses MessagePack took 35.817 seconds to perform 1000 predictions the first time that it saw the prediction inputs. The model takes about 0.0358 milliseconds to make a prediction.

Most of the time for this step is taken up by the model's prediction algorithm, this is the reason why its a similar amount of time as the JSON serder we used before.

Now we can try the same predictions again. This time, they'll be accessed from the cache and returned more quickly.

In [44]:
msgpack_total_seconds = timeit.timeit("[mspgpack_decorated_model.predict(sample) for sample in samples]", number=1, globals=globals())

In [45]:
msgpack_seconds_per_sample = msgpack_total_seconds / 1000.0
msgpack_milliseconds_per_sample = msgpack_seconds_per_sample * 1000.0

md("The model that uses MessagePack took {} seconds to perform 1000 predictions the second time that it saw "
   "the prediction inputs. The model takes about {} milliseconds to make a prediction.".
   format(round(msgpack_total_seconds, 3), 
          round(msgpack_seconds_per_sample, 3), 
          round(msgpack_milliseconds_per_sample, 3)))

The model that uses MessagePack took 0.927 seconds to perform 1000 predictions the second time that it saw the prediction inputs. The model takes about 0.001 milliseconds to make a prediction.

In [46]:
md("The MessagePack serder performs slightly better than the JSON serder. "
   "The test we did with JSON above took about {} ms for each sample, "
   "the MessagePack serder took {} ms per sample."
   .format(round(decorated_milliseconds_per_sample, 3), 
           round(msgpack_milliseconds_per_sample, 3)))

The MessagePack serder performs slightly better than the JSON serder. The test we did with JSON above took about 0.8935623240000012 ms for each sample, the MessagePack serder took 0.9270776179999984 ms per sample.

We can see how much space the cache entries is taking up by querying each key and summing up the number of bytes:

In [47]:
msgpack_number_of_bytes = 0
msgpack_total_entries = 0
for key in r.scan_iter("msgpack*"):
    msgpack_number_of_bytes += len(r.get(key))
    msgpack_total_entries = msgpack_total_entries + 1
    
msgpack_average_number_of_bytes = msgpack_number_of_bytes / msgpack_total_entries

md("The keys in the original JSON cache took up a total of {} bytes. "
   "The keys in the MessagePack cache take up a total of {} bytes and the "
   "average number of bytes per MessagePack cache entry is {}."
   .format(decorated_number_of_bytes,
           msgpack_number_of_bytes, 
           msgpack_average_number_of_bytes))

The keys in the original JSON cache took up a total of 20630 bytes. The keys in the MessagePack cache take up a total of 18018 bytes and the average number of bytes per MessagePack cache entry is 18.0.

By using MessagePack serialization we were able to save memory in the cache.

In [48]:
!docker exec local-redis redis-cli FLUSHDB

OK


### Snappy Compression

[Snappy](https://github.com/google/snappy) is a compression algorithm built by Google that targets high compression ratios and high compressions speed. We can try to reduce the memory used by the cache by compressing the cache entries with the Snappy algorithm. This approach was inspired by [another blog post](https://doordash.engineering/2019/01/02/speeding-up-redis-with-compression/).

Enabling compression on the decorator is very simple, we'll just set the "use_compression" parameter to "True" when instantiating the caching decorator. In this example we'll use JSON serialization combined with compression.

In [49]:
compressing_caching_decorator = RedisCachingDecorator(host="localhost", 
                                                      port=6379,
                                                      database=0,
                                                      prefix="json+compression",
                                                      serder="JSON",
                                                      use_compression=True)

compressing_decorated_model = compressing_caching_decorator.set_model(model)

The first time we make a prediction, the model will be used and the prediction will get serialized to JSON, then compressed, and saved to the cache.

In [50]:
model_input = InsuranceChargesModelInput(
    age=53, 
    sex=SexEnum.female,
    bmi=25.0,
    children=4,
    smoker=False,
    region=RegionEnum.northwest)

prediction = compressing_decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=15207.01)

The second time we make a prediction, the compressed cache entry will be used instead.

In [51]:
model_input = InsuranceChargesModelInput(
    age=53, 
    sex=SexEnum.female,
    bmi=25.0,
    children=4,
    smoker=False,
    region=RegionEnum.northwest)

prediction = compressing_decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=15207.01)

The compression works, now we'll do some testing to see if it improves the serialization/deserialization performance.



In [52]:
compressed_total_seconds = timeit.timeit("[compressing_decorated_model.predict(sample) for sample in samples]", number=1, globals=globals())

In [53]:
compressed_seconds_per_sample = compressed_total_seconds / 1000.0
compressed_milliseconds_per_sample = compressed_seconds_per_sample * 1000.0

md("The decorator that does compressiontook around {} ms to make a prediction and add it to the cache the "
   "first time that it sees the prediction inputs.".
   format(round(compressed_milliseconds_per_sample, 3)))

The decorator that does compressiontook around 35.559 ms to make a prediction and add it to the cache the first time that it sees the prediction inputs.

Most of the time for this step is taken up by the model's prediction algorithm.

Now we can try the same predictions again.

In [54]:
compressed_total_seconds = timeit.timeit("[compressing_decorated_model.predict(sample) for sample in samples]", number=1, globals=globals())

In [55]:
compressed_seconds_per_sample = compressed_total_seconds / 1000.0
compressed_milliseconds_per_sample = compressed_seconds_per_sample * 1000.0

md("The decorator that uses compressed JSON took {} ms the second time that it saw the prediction inputs.".
   format(round(compressed_milliseconds_per_sample, 3)))

The decorator that uses compressed JSON took 0.853 ms the second time that it saw the prediction inputs.

In [56]:
md("The serder that uses JSON serialization and compression performs slightly better than the JSON serder. "
   "The test we did with uncompressed JSON above took about {} ms for each sample.".
   format(decorated_milliseconds_per_sample))

The serder that uses JSON serialization and compression performs slightly better than the JSON serder. The test we did with uncompressed JSON above took about 0.8935623240000012 ms for each sample.

We can see how much space the cache entries is taking up by querying each key and summing up the number of bytes:

In [57]:
compressed_number_of_bytes = 0
compressed_total_entries = 0
for key in r.scan_iter("json+compression*"):
    compressed_number_of_bytes += len(r.get(key))
    compressed_total_entries = compressed_total_entries + 1
    
compressed_average_number_of_bytes = compressed_number_of_bytes / compressed_total_entries

md("The keys in the original JSON cache took up a total of {} bytes. "
   "The keys in the MessagePack cache take up a total of {} bytes. "
   "The keys in the compressed JSON cache take up a total of {} bytes, "
   "and the average number of bytes per cache entry is {}.".
   format(decorated_number_of_bytes, 
          msgpack_number_of_bytes,
          compressed_number_of_bytes,
          round(compressed_average_number_of_bytes, 2)))

The keys in the original JSON cache took up a total of 20630 bytes. The keys in the MessagePack cache take up a total of 18018 bytes. The keys in the compressed JSON cache take up a total of 22633 bytes, and the average number of bytes per cache entry is 22.61.

The keys that were serialized with JSON and compressed were a few bytes bigger than the keys serialized and not compressed. It seems that compression is not saving memory in the cache, this is probably due to the small size of the entries and the fact that information was not repeated inside of the serialized data structures.

In [58]:
!docker exec local-redis redis-cli FLUSHDB

OK


### Reducing the Precision of the Inputs

We can also try to limit the size of the cache by reducing the number of possible inputs to the hashing function. We'll demonstrate this with a few examples.

We'll start by hashing a single sample of the input of the model:

In [59]:
model_input = InsuranceChargesModelInput(
    age=42, 
    sex=SexEnum.female,
    bmi=24.12345,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

model_input_dict = model_input.dict()
frozen_dict = frozenset(model_input_dict.keys()), frozenset(model_input_dict.values())
hash(frozen_dict)

4604654438722747517

Next, we'll hash a very similar model input:

In [60]:
model_input = InsuranceChargesModelInput(
    age=42, 
    sex=SexEnum.female,
    bmi=24.12346,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

model_input_dict = model_input.dict()
frozen_dict = frozenset(model_input_dict.keys()), frozenset(model_input_dict.values())
hash(frozen_dict)

3599773909132942364

The hash value produced is the second time is completely different even though the "bmi" field only changed by 0.00001. This means that these two predictions will have two different cache entries even though they are very lilely to be exactly the same prediction. Just to make sure, we'll make the predictions using these inputs:

In [61]:
model_input = InsuranceChargesModelInput(
    age=42, 
    sex=SexEnum.female,
    bmi=24.12345,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

prediction = model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=8640.78)

Let's try the prediction and hash with a different value for the "bmi" field:

In [62]:
model_input = InsuranceChargesModelInput(
    age=42, 
    sex=SexEnum.female,
    bmi=24.12346,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

prediction = model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=8640.78)

The prediction came out to be the same for both values of "bmi". However, the hashed value of the input was completely different. These predictions would be saved separately from each other in the cache, event though they are exactly the same. We can cut down on the number of entries in the cache by reducing the precision of floating point numbers so that these predictions can be cached one time instead of many. By rounding down the number we'll be reducing the number of cache entries that will be placed in the cache but also affecting the accuracy of the model's predictions. 

The caching decorator supports this feature, we'll just enable it by adding the "reduced_precision_fields" and "number_of_places" options to the configuration:

In [63]:
low_precision_caching_decorator = RedisCachingDecorator(host="localhost", 
                                                        port=6379,
                                                        database=0,
                                                        prefix="power_precision",
                                                        reduced_precision_fields=["bmi"],
                                                        number_of_places=0)

low_precision_decorated_model = low_precision_caching_decorator.set_model(model)

The first time we make a prediction, the model will be used and the prediction input will get the precision of the "bmi" field reduced to one decimal place, then the prediction will get serialized to JSON, and saved to the cache.

In [64]:
model_input = InsuranceChargesModelInput(
    age=42, 
    sex=SexEnum.female,
    bmi=24.12345,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

prediction = low_precision_decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=8640.78)

The second time the prediction is requested, the precision of the "bmi" field is reduced again in the same way, making the prediction input the same as before even though the values are not exactly the same. This will create the same hashed value which will retrieve the prediction from the cache and return it to the user.

In [65]:
model_input = InsuranceChargesModelInput(
    age=42, 
    sex=SexEnum.female,
    bmi=24.4321,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

prediction = low_precision_decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=8640.78)

The predictions are the same even though the inputs were different. 

We can check on the performance by making 10000 predictions with the decorator:

In [66]:
!docker exec local-redis redis-cli FLUSHDB

OK


In [68]:
for _ in range(10000):
    low_precision_decorated_model.predict(generate_record())

We can expect to find around 10,000 predictions in the cache.

We can view the number of entries in the cache with this command:

In [69]:
!docker exec local-redis redis-cli DBSIZE

9430


The fact that we reduced the precision of the "bmi" field cause the decorator to save around 500 fewer predictions in the cache.

Although this is not always an ideal way to save memory, there are some model deployments that can benefit from this approach. All that is needed is to analyze how much precision the model needs from its numerical inputs. It rarely makes sense to store predictions with an unlimited precision in their inputs in the cache.

In [70]:
!docker exec local-redis redis-cli FLUSHDB

OK


Now that we're done with the local redis instance we'll stop the docker container.

In [71]:
!docker kill local-redis
!docker rm local-redis

local-redis
local-redis


## Using a Cache Decorator in Production

Adding a highly available strategy …

What is the cache eviction policy?


## Creating the Model Service

In order to deploy the Redis cache decorator in a real-world scenario, we’ll deploy it along with the model inside of a RESTful model service. In order to do this we’ll use the rest_model_service package that we developed in this blog post.

## Creating a Docker Image

## Deploying the Solution
To show the system in action, we’ll deploy the service and the Redis instance to a Kubernetes cluster. 

## Create a deployment and service for Redis …

## Create a configuration for REST model service …

## Create a docker image for the service, model, and decorator…

## Create a service for REST model service …

## Create a deployment for REST model service … 

## Closing

In this blog post, we showed how to build a decorator class that is able to cache predictions made by a machine learning model. Caching is a simple way to speed up predictions that we know can be reused and are requested often from a model. 

The cache decorator classes can be applied to any model that uses the MLModel base class without having to modify the model class at all. The caching functionality is contained completely in the RedisCacheDecorator class. The same thing is true for the RESTful model service, the cache functionality did not need to be added to the service because we separated the concerns of the service and the cache decorator. We were able to add caching to the deployed model by modifying the configuration. 

By using decorators we’re able to create software components that can be reused in many different contexts. For example, if we chose to deploy the cache decorator in gRPC service we should be able to do so as long as we instantiate and manage the decorator instance correctly.

In the implementation of the decorators that we presented in this blog post each prediction is identified by hashing all of the input fields of the prediction request. This might not be necessary or advisable in different scenarios. In order to support different scenarios, we can provide a way to select the fields that are hashed together to create the identifier for the prediction. In this way, certain inputs can be ignored when selecting a prediction that is saved in the cache. Another way of doing this is to use a specific identifier field that is provided by the client in order to identify the prediction, however this puts the responsibility for determining how caching is done on the client that is using the predictions.

Caching is often done to speed up operations that rely on I/O heavily, in the model deployment that we did in this blog post the model did not rely on I/O to make a prediction so it did not benefit from caching as much. An example of model deployments that rely on I/O is a model that needs to do data enrichment in order to make a prediction. Although most caching is done to speed up operations that rely on I/O, there are some types of models that are CPU intensive, such models would also benefit for 

Combining the caching decorator with other decorators that require I/O like data enrichment...