# SimFin Tutorial 06 - Performance Tips

[Original repository on GitHub](https://github.com/simfin/simfin-tutorials)

This tutorial was originally written by [Hvass Labs](https://github.com/Hvass-Labs)

----

"Are you employed, Sir? You don't go out looking for a job dressed like that on a week-day, do you? Is this a ... what day is this?" &ndash; [The Big Lebowski](https://www.youtube.com/watch?v=xJjCnWm5cvE)

## Introduction

This is a collection of tips on how to improve performance when using the simfin package. It is assumed you are already familiar with the previous tutorials on the basics of simfin.

## Imports

In [1]:
%matplotlib inline
import pandas as pd

# Import the main functionality from the SimFin Python API.
import simfin as sf

# Import names used for easy access to SimFin's data-columns.
from simfin.names import *

In [2]:
# SimFin Python API version.
sf.__version__

'0.8.3'

In [3]:
# Pandas version.
pd.__version__

'1.1.0'

## Config

In [4]:
# SimFin data-directory.
sf.set_data_dir('~/simfin_data/')

In [5]:
# SimFin load API key or use free data.
sf.load_api_key(path='~/simfin_api_key.txt', default_key='free')

## Load Datasets

In these examples, we will use the following datasets:

In [6]:
%%time
# Data for USA.
market = 'us'

# Daily Share-Prices.
df_prices = sf.load_shareprices(variant='daily', market=market)

Dataset "us-shareprices-daily" on disk (0 days old).
- Loading from disk ... Done!
CPU times: user 11.6 s, sys: 1.04 s, total: 12.6 s
Wall time: 12.7 s


In [7]:
df_prices.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,SimFinId,Open,Low,High,Close,Adj. Close,Dividend,Volume,Shares Outstanding
Ticker,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
A,2007-01-03,45846,34.99,34.05,35.48,34.3,22.66,,2574600,
A,2007-01-04,45846,34.3,33.46,34.6,34.41,22.73,,2073700,
A,2007-01-05,45846,34.3,34.0,34.4,34.09,22.52,,2676600,
A,2007-01-08,45846,33.98,33.68,34.08,33.97,22.44,,1557200,
A,2007-01-09,45846,34.08,33.63,34.32,34.01,22.47,,1386200,


## Adding Columns To DataFrame

For some reason, it is extremely slow to add new columns to an empty Pandas DataFrame that does not have an index.

In [8]:
# Names of the new columns.
FOO = 'Foo'
BAR = 'Bar'
QUX = 'Qux'

Create an empty DataFrame without setting its index. This is extremely slow:

In [9]:
%%time
df2 = pd.DataFrame()
df2[FOO] = df_prices[CLOSE] / df_prices[ADJ_CLOSE]
df2[BAR] = df_prices[CLOSE] * df_prices[ADJ_CLOSE]
df2[QUX] = df_prices[CLOSE] * df_prices[VOLUME]

CPU times: user 20.8 s, sys: 582 ms, total: 21.3 s
Wall time: 21.4 s


If we know what the index should be and we set it when creating the new DataFrame, it is much faster to add new columns to the DataFrame. This is the preferred method because it is fast and elegant:

In [10]:
%%timeit
df3 = pd.DataFrame(index=df_prices.index)
df3[FOO] = df_prices[CLOSE] / df_prices[ADJ_CLOSE]
df3[BAR] = df_prices[CLOSE] * df_prices[ADJ_CLOSE]
df3[QUX] = df_prices[CLOSE] * df_prices[VOLUME]

62.1 ms ± 2.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


We can use `df.values` to do the computation directly using numpy arrays, but it is not faster:

In [11]:
%%timeit
df4 = pd.DataFrame(index=df_prices.index)
df4[FOO] = df_prices[CLOSE].values / df_prices[ADJ_CLOSE].values
df4[BAR] = df_prices[CLOSE].values * df_prices[ADJ_CLOSE].values
df4[QUX] = df_prices[CLOSE].values * df_prices[VOLUME].values

65 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


We can also construct the new DataFrame from several Pandas Series, which is usually about the same speed as the `df3` solution above, but not as elegant:

In [12]:
%%timeit
df_foo = df_prices[CLOSE] / df_prices[ADJ_CLOSE]
df_bar = df_prices[CLOSE] * df_prices[ADJ_CLOSE]
df_qux = df_prices[CLOSE] * df_prices[VOLUME]
data = {FOO: df_foo, BAR: df_bar, QUX: df_qux}
df5 = pd.DataFrame(data=data)

68.6 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Disk Cache

Some functions take a long time to process data, such as the signal-functions in the simfin package. If you want to rerun a Notebook with such functions, then you would have to rerun all these slow functions again, even though the results would be exactly the same, if the data has not changed.

A simple solution is to cache the results of slow functions, by writing the results to a cache-file on disk. The next time the function is called, it automatically checks if a recent cache-file exists on disk and then loads it, otherwise the slow function will be computed and the results saved in the cache-file for future use.

This is implemented by using a so-called decorator or wrapper-function ` @sf.cache` on the slow function. This is used in simfin's signal-functions, and you can also use this wrapper on your own functions (see below).

A few things should be noted:

1. The wrapper adds three more arguments to the original function: `cache_name` which allows you to distinguish cache-files from each other. `cache_refresh` which determines if the slow function should be called and the results saved to the cache-file on disk. `cache_format` which is the format of the cache-file. See below for details.

2. Because of these new arguments, you **MUST** use keyword arguments when calling the wrapped function, otherwise the arguments will get passed to the cache-wrapper instead of the original function. This will raise a strange exception.

### Cache Refresh Conditions

There are several ways of specifying the conditions for when the slow function must be called again and the results saved to the cache-file on disk. These conditions are specified by passing different `cache_refresh` arguments to the wrapped function:

- If `cache_refresh=None` then the cache-file is never used and the wrapped function is always called as normal.

- If `cache_refresh=True` then the wrapped function is called and the results are saved to the cache-file on disk.

- If `cache_refresh=False` then the cache-file is always used, unless it does not exist, in which case the wrapped function is called and the cache-file saved.

- If `cache_refresh` is an integer which is lower than the cache-file's age in days, then the wrapped function is called and the results are saved to the cache-file. The cache is also refreshed if the integer is 0 (zero).

- If `cache_refresh` is a string or list of strings, these are considered to be file-paths e.g. for dataset-files. If the cache-file is older than any one of those files, then the wrapped function is called and the results are saved to the cache-file.

### Cache Format

By default `cache_format='pickle'` so the cache-files are saved as an uncompressed pickle-file, which is very fast to save and load, but also takes a lot of disk-space. The default `'pickle'` file-format should support all Pandas DataFrames and Series and properly save all meta-data such as which columns are used as indices, etc.

You may compress the pickle-files using `cache_format='pickle.gz'` which can compress DataFrames with much repetitive data (e.g. forward-filled daily signals) by a factor of 100 or more, but this requires a little more computation time.

Other file-formats such as `'parquet'` and `'feather'` are also supported, but these have some restrictions on the DataFrames they can save. The Parquet file-format only supports Pandas DataFrames (not Series). The column-names must also start with a letter. There may be other requirements imposed by the Parquet file-format, and you will get an exception if the DataFrame violates the requirements. The Feather file-format is even more basic and cannot save DataFrames with MultiIndex. So it is generally best to use the default pickle-format or the compressed pickle-format.

### Caching a SimFin Function

Here is an example of a function from the simfin package for calculating share-price signals. This takes about 30 seconds to compute:

In [13]:
%%time
df_price_signals = sf.price_signals(df_prices=df_prices)

CPU times: user 15.1 s, sys: 230 ms, total: 15.3 s
Wall time: 15.3 s


The function `sf.price_signals` is actually wrapped with ` @sf.cache` so the caching-feature is automatically enabled if we pass an argument `cache_refresh` different from `None`. For example, this is how we would instruct the cache-file to get updated once per day:

In [14]:
# Name for the cache e.g. 'us-all'
cache_name = market + '-all'

# Refresh the cache once a day.
cache_refresh_days = 1

In [15]:
%%time
df_price_signals2 = \
    sf.price_signals(df_prices=df_prices,
                     cache_name=cache_name,
                     cache_refresh=cache_refresh_days)

Cache-file 'price_signals-us-all.pickle' not on disk.
- Running function price_signals() ... Done!
- Saving cache-file to disk ... Done!
CPU times: user 15.1 s, sys: 522 ms, total: 15.6 s
Wall time: 15.8 s


The first time the function is called, it will compute the signals and save the resulting DataFrame to a cache-file on disk. When the function is called again, the cached DataFrame will be loaded instead. When the cache-file is too old, the function is called again and a new cache-file is saved to disk.

Note that the cache-file is named `price_signals-us-all.pickle` which is constructed from the function's name `price_signals`, the cache-name we have supplied `us-all`, and the file-extension `.pickle`. This keeps the cache-files neatly organized on disk, while still allowing us to designate different cache-names for different calls of the same function, for example if we want to process different markets or stocks.

If you want to pass the same cache-arguments to several functions, then it is more convenient to create a dict with the arguments:

In [16]:
cache_args = {'cache_name': cache_name,
              'cache_refresh' : cache_refresh_days}

In [17]:
%%time
df_price_signals3 = \
    sf.price_signals(df_prices=df_prices, **cache_args)

Cache-file 'price_signals-us-all.pickle' on disk (0 days old).
- Loading from disk ... Done!
CPU times: user 96.7 ms, sys: 148 ms, total: 245 ms
Wall time: 244 ms


We can check that the results are all identical:

In [18]:
df_price_signals.equals(df_price_signals2)

True

In [19]:
df_price_signals.equals(df_price_signals3)

True

### Caching Your Own Functions

You can also use the caching-feature on your own functions simply by adding the decorator ` @sf.cache` to your function declaration. Here is an example of a function that calculates the sum of each row:

In [20]:
@sf.cache
def my_function(df):
    return df.sum(axis=1)

Remember that you **MUST** call `my_function()` with named arguments! Otherwise you will get a strange exception. The reason is that the decorator has actually created a new function which takes the arguments: `cache_name`, `cache_refresh`, `cache_format` and `**kwargs`, as we can see from this slightly cryptic specification of the function:

In [21]:
import inspect
inspect.getfullargspec(my_function)

FullArgSpec(args=['cache_name', 'cache_refresh', 'cache_format'], varargs=None, varkw='kwargs', defaults=(None, None, 'pickle'), kwonlyargs=[], kwonlydefaults=None, annotations={})

So if you call `my_function()` with unnamed arguments, it expects the first arguments to be `cache_name`, `cache_refresh` and `cache_format`, while any remaining keyword arguments are passed to the original function. This raises a strange exception:

In [22]:
try:
    df_result = my_function(df_prices)
except Exception as e:
    print(e)

my_function() missing 1 required positional argument: 'df'


So you **MUST** use keyword-arguments when calling a function wrapped with ` @sf.cache`. But we can still call the function without passing a `cache_refresh` argument, in which case it will disable the caching and just call the wrapped function as normal, but again you **MUST** use named arguments such as `df=df_prices` instead of just `df_prices`:

In [23]:
%%time
df_result = my_function(df=df_prices)

CPU times: user 420 ms, sys: 88 ms, total: 508 ms
Wall time: 507 ms


In [24]:
df_result.head()

Ticker  Date      
A       2007-01-03    2620607.48
        2007-01-04    2119705.50
        2007-01-05    2722605.31
        2007-01-08    1603204.15
        2007-01-09    1432204.51
dtype: float64

If we pass the cache-arguments then the caching is automatically enabled:

In [25]:
# Cache arguments.
cache_name = market + '-all'
cache_refresh = 1

In [26]:
%%time
df_result2 = my_function(df=df_prices,
                         cache_name=cache_name,
                         cache_refresh=cache_refresh)

Cache-file 'my_function-us-all.pickle' not on disk.
- Running function my_function() ... Done!
- Saving cache-file to disk ... Done!
CPU times: user 484 ms, sys: 160 ms, total: 644 ms
Wall time: 643 ms


We may also create a dict with the cache-arguments, which is convenient if we want to use the same arguments in several functions:

In [27]:
cache_args = {'cache_name': cache_name,
              'cache_refresh' : cache_refresh}

In [28]:
%%time
df_result3 = my_function(df=df_prices, **cache_args)

Cache-file 'my_function-us-all.pickle' on disk (0 days old).
- Loading from disk ... Done!
CPU times: user 61.1 ms, sys: 48.3 ms, total: 109 ms
Wall time: 108 ms


Even for such a fairly quick function, the caching still saved a lot of time when using the raw pickle-format. But normally you would only use the caching-feature on functions that are very slow to compute.

We can check that the results are all identical:

In [29]:
df_result.equals(df_result2)

True

In [30]:
df_result.equals(df_result3)

True

## License (MIT)

This is published under the
[MIT License](https://github.com/simfin/simfin-tutorials/blob/master/LICENSE.txt)
which allows very broad use for both academic and commercial purposes.

You are very welcome to modify and use this source-code in your own project. Please keep a link to the [original repository](https://github.com/simfin/simfin-tutorials).
