# Using arcticdb to read/write market data with findatapy

Apr 2024 - Saeed Amen - https://www.cuemacro.com - saeed@cuemacro.com

## What is findatapy

I've been working on findatapy (and it's associated libraries) and open sourced it nearly 10 years. The basic idea of findatapy is that it can be used to download market/economic data from many sources using the same interface, and there's also ways to store this data using Parquet files and now, with ArcticDB, which I'll talk about shortly. I use findatapy regularly both in my teaching and we also use it a lot for downloading data at Turnleaf Analytics (https://www.turnleafanalytics.com) where we forecast macro economic data, alongside many proprietary libraries. As a result, it's been maintained quite regularly!

## What is ArcticDB?

ArcticDB is a serverless database engine developed by the quant hedge fund Man-AHL which makes it easy to store Pandas DataFrames, which replaces Arctic (which used a MongoDB backend). What are the main reasons to use ArcticDB:

* Fast
    * It is super fast and can process millions of rows a second on disk
* Flexible
    * You don't need to specify a schema to start, and it also supports versioning (ie. it is bitemporal), so you can see different vintages of the same data.
    * This is particularly useful when it comes to storing point-in-time data which is frequently revised (eg. macroeconomic data)
    * It supports many different disk storage backends including `lmdb` (local disk backend), `mem` (in memory mostly for testing) and various buckets (including `s3`, `azure` etc.)
* Familiar
    * If you already know Python and Pandas then it's fairly straightforward

Note, that it isn't a transactional database, so it isn't a replacement for databases like PostgresSQL or MySQL.

The full documentation for ArcticDB can be found at https://docs.arcticdb.io/latest

We've added support for ArcticDB in findatapy to make it easy to download market/economic data with findatapy and then store/retrieve in ArcticDB.

## Installing ArcticDB and findatapy/associated libraries

If you want to try out ArcticDB, one easy way to do this is to create a new conda environment with Anaconda using the below commands in your Anaconda Prompt (note, it you don't have to have Anaconda though!).

`conda create -n py310arcticdb python=3.10`

`conda activate py310arcticdb`

`conda install anaconda`

`pip install arcticdb finmarketpy chartpy findatapy`

## Using ArcticDB with findatapy to store tick market data from Dukascopy

In this notebook I'm going to show how to use ArcticDB to easily store market data using findatapy.

Let's download some tick data from `dukascopy` for USDJPY spot, which is a free data source using `findatapy`. Findatapy provides a uniform wrapper to download from many different data sources. We can predefine ticker mappings from our own nicknames for tickers to the vendor tickers. It already comes out of the box, with `dukascopy` ticker mappings predefined, but these are all customisable. Note, that we haven't used the `data_engine` property. If this isn't set, then findatapy will download from our data source directly.

In [1]:
# We can disable the log so the output is neater
import logging, sys
logging.disable(sys.maxsize)

In [2]:
import datetime

from findatapy.market import Market, MarketDataRequest

# In this case we are saving predefined tick tickers to disk, and then reading back
from findatapy.market.ioengine import IOEngine

md_request_download = MarketDataRequest(
    start_date="04 Jan 2021",
    finish_date="05 Jan 2021",
    category="fx",
    data_source='dukascopy',
    freq="tick",
    tickers=["USDJPY"],
    fields=["bid", "ask", "bidv", "askv"],
    data_engine=None
)

market = Market()

df_tick = market.fetch_market(md_request=md_request_download)

Let's print the output...

In [3]:
print(df_tick)

                                  USDJPY.bid  USDJPY.ask  USDJPY.bidv  \
Date                                                                    
2021-01-04 00:00:00.247000+00:00  103.247002  103.250000          1.0   
2021-01-04 00:00:00.349000+00:00  103.247002  103.249001          1.0   
2021-01-04 00:00:00.715000+00:00  103.246002  103.250999          1.0   
2021-01-04 00:00:00.816000+00:00  103.247002  103.249001          1.0   
2021-01-04 00:00:00.917000+00:00  103.247002  103.250000          1.0   
...                                      ...         ...          ...   
2021-01-04 23:59:51.574000+00:00  103.135002  103.138000          1.0   
2021-01-04 23:59:56.131000+00:00  103.135002  103.136002          1.0   
2021-01-04 23:59:57.569000+00:00  103.135002  103.138000          1.0   
2021-01-04 23:59:57.771000+00:00  103.135002  103.138000          1.0   
2021-01-04 23:59:59.314000+00:00  103.135002  103.138000          1.0   

                                  USDJPY.askv  
Da

Type in our ArcticDB connection string, which you'll need to change below. Note the use of `lmdb://` means we are using the local disk to manage our ArcticDB storage. We will be using a folder called `tempdatabase` in our working directory. In practice, you can obviously set a full path. To enable `findatapy` to identify it as an `arcticdb` instance, we need to write `arcticdb:` at the start. Note, that the rest of the connection string after that is precisely what you'd input into ArcticDB if you were calling it directly. The official documentation at https://docs.arcticdb.io/latest/ explains how to construct various connection strings for ArcticDB depending on the different backends. We've also illustrated how to put together a connection string for s3 buckets.

In [4]:
local_storage = True

# To switch between local storage or s3, it's a matter of changing the
# connection string (also you need to make sure your AWS S3 authentication is set etc.)
if local_storage:
    arcticdb_conn_str = "arcticdb:lmdb://tempdatabase?map_size=2GB"
else:
    # https://docs.arcticdb.io/latest/#s3-configuration gives more details
    # Note we need to prefix arcticdb: to the front so findatapy
    # knows what backend engine to use
    region = "eu-west-2"
    bucket_name = "burger_king_whopper" # Not sure, if this name is taken :-)
    path_prefix = "test"

    arcticdb_conn_str = f"arcticdb:s3s://s3.{region}.amazonaws.com:{bucket_name}?path_prefix={path_prefix}&aws_auth=true"

Below we set some of the parameters for writing/appending to ArcticDB.

In [5]:
# Set various parameters to govern how we write to ArcticDB, to use
# versioning
arcticdb_dict = {
    # If this is set to true removes previous versions (so we only record
    # the final version). Not pruning versions will take more disk space.
    "prune_previous_versions": False,

    # Do we want to append to existing records or write
    # If you attempt to append with an overlapping chunk, you'll
    # get an assertion failure, "update" allows you to change existing data
    "write_style": "write", # "write" / "append" / "update"

    # If set to true will remove any existing library, before writing (careful with this!!)
    "force_create_library": False,

    # This enables us to take advantage of ArcticDB's filtering of columns/dates
    # otherwise we would download the full dataset, and then filter
    # in Pandas
    "allow_on_disk_filter": True,

    # You can also specify your own custom queries for ArcticDB
    "query_builder": None
}

# Set the ArcticDB parameters for our MarketDataRequest
md_request_download.arcticdb_dict = arcticdb_dict

We can write our tick data DataFrame into ArcticDB. We can give it the `MarketDataRequest` we used for fetching the data, which basically creates the filename in the format of `environment.category.data_source.freq.tickers` for high frequency data or in the format of `environment.category.data_source.freq` for daily data. This will enable us to more easily fetch the data using the same `MarketDataRequest` and `Market` interface. The whole point of using findatapy is that it can store ticker mappings for us, and retrieve from many different market data sources using the same interface.

In this case, the symbol/table we will use for storing in ArcticDB is listed below.

* `backtest.fx.tick.dukascopy.NYC.EURUSD`
* ie. the environment of our data is `backtest`
* the `category` is `fx`
* the `data_source` is `dukascopy`
* the `freq` is `tick`
* the `cut` (or time of close) is `NYC`
* the `tickers` is `EURUSD`

## `write` to ArcticDB

The Jupyter notebook [market_data_example.ipynb](../market_data_example.ipynb) explains in more detail this ticker format and the concept of a `MarketDataRequest`. We dump it disk using the `IOEngine` class. Note that the `write_time_series_cache_to_disk` and `read_time_series_from_disk` reads/writes from ArcticDB. We need to make sure that when we're writing to disk, we have a data licence to do so (and this will clearly vary between data vendors), and in particular, that only those who read from the disk are authorised to use that data. Note, that we do not need to fill the `fname` parameter, because that will automatically get constructed from the `MarketDataRequest`. We are doing a `write` to ArcticDB is in this instance.

In [6]:
IOEngine().write_time_series_cache_to_disk(data_frame=df_tick, engine=arcticdb_conn_str, md_request=md_request_download)

# Snap the time, so we can fetch this vintage later
earlier_download_time = datetime.datetime.now().utcnow()

We could fetch the data directly from ArcticDB using the symbol/table name ie. `backtest.fx.dukascopy.tick.NYC.USDJPY` and `IOEngine`

In [7]:
symbol = "backtest.fx.dukascopy.tick.NYC.USDJPY" 
df_read_tick = IOEngine().read_time_series_cache_from_disk(symbol, engine=arcticdb_conn_str)

print(df_read_tick)

                                  USDJPY.bid  USDJPY.ask  USDJPY.bidv  \
Date                                                                    
2021-01-04 00:00:00.247000+00:00  103.247002  103.250000          1.0   
2021-01-04 00:00:00.349000+00:00  103.247002  103.249001          1.0   
2021-01-04 00:00:00.715000+00:00  103.246002  103.250999          1.0   
2021-01-04 00:00:00.816000+00:00  103.247002  103.249001          1.0   
2021-01-04 00:00:00.917000+00:00  103.247002  103.250000          1.0   
...                                      ...         ...          ...   
2021-01-04 23:59:51.574000+00:00  103.135002  103.138000          1.0   
2021-01-04 23:59:56.131000+00:00  103.135002  103.136002          1.0   
2021-01-04 23:59:57.569000+00:00  103.135002  103.138000          1.0   
2021-01-04 23:59:57.771000+00:00  103.135002  103.138000          1.0   
2021-01-04 23:59:59.314000+00:00  103.135002  103.138000          1.0   

                                  USDJPY.askv  
Da

## `append` to ArcticDB

Or we can just use the `MarketDataRequest` object we populated earlier. To make it fetch from ArcticDB instead of Dukascopy, we just need to set the `data_engine` property to give it the ArcticDB connection string. Note that we are downloading data strictly after the initial data we wrote, so it won't overlap.

In [8]:
# Now download a second set of data and write it for "append"
# Note: we'll get an assertion error, if we try to append before the end
# of the existing time series on disk
md_request_download.start_date = "06 Jan 2021"
md_request_download.finish_date = "07 Jan 2021"
md_request_download.arcticdb_dict["write_style"] = "append"

df_tick_later = market.fetch_market(md_request=md_request_download)
IOEngine().write_time_series_cache_to_disk(data_frame=df_tick_later,
                                           engine=arcticdb_conn_str,
                                           md_request=md_request_download)

# This time we use the Market wrapper to download data
# Given we don't specify an "as_of" property, we'll get the later version
later_download_time = datetime.datetime.now().utcnow()

We will now fetch back the data, using the `Market` interface. Note, that this data straddles our initial download and the second one which we appended.

In [9]:
md_request_local_cache = MarketDataRequest(
    md_request=md_request_download
)

md_request_local_cache.start_date = "04 Jan 2021 10:00"
md_request_local_cache.finish_date = "06 Jan 2021 14:00"
md_request_local_cache.data_engine = arcticdb_conn_str
md_request_local_cache.cache_algo = "cache_algo_return"

df_read_tick = Market().fetch_market(md_request=md_request_local_cache)

In [10]:
 # We should see the 1st write and 2nd append combined, ie. latest write
print("No as_of specified, so we'll get the latest write!")
print(df_read_tick)

No as_of specified, so we'll get the latest write!
                                  USDJPY.bid  USDJPY.ask  USDJPY.bidv  \
Date                                                                    
2021-01-04 10:00:00.092000+00:00  102.742996  102.745003         1.00   
2021-01-04 10:00:00.198000+00:00  102.741997  102.745003         1.50   
2021-01-04 10:00:00.403000+00:00  102.742996  102.746002         1.00   
2021-01-04 10:00:00.966000+00:00  102.744003  102.747002         1.00   
2021-01-04 10:00:01.220000+00:00  102.746002  102.748001         1.00   
...                                      ...         ...          ...   
2021-01-06 13:59:59.100000+00:00  103.176003  103.178001         1.31   
2021-01-06 13:59:59.252000+00:00  103.174004  103.178001         4.87   
2021-01-06 13:59:59.404000+00:00  103.174004  103.177002         1.87   
2021-01-06 13:59:59.860000+00:00  103.175003  103.178001         1.50   
2021-01-06 13:59:59.961000+00:00  103.174004  103.177002         4.62   


## Using `as_of` parameter to fetch different vintages

We can also try reading from our earlier vintage, by setting the `as_of` parameter.

In [11]:
# Let's instead take the first vintage
md_request_local_cache.as_of = earlier_download_time
df_read_tick = Market().fetch_market(md_request=md_request_local_cache)

# We should only see the earlier vintage
print("See the earlier vintage write!")
print(df_read_tick)

See the earlier vintage write!
                                  USDJPY.bid  USDJPY.ask  USDJPY.bidv  \
Date                                                                    
2021-01-04 10:00:00.092000+00:00  102.742996  102.745003          1.0   
2021-01-04 10:00:00.198000+00:00  102.741997  102.745003          1.5   
2021-01-04 10:00:00.403000+00:00  102.742996  102.746002          1.0   
2021-01-04 10:00:00.966000+00:00  102.744003  102.747002          1.0   
2021-01-04 10:00:01.220000+00:00  102.746002  102.748001          1.0   
...                                      ...         ...          ...   
2021-01-04 23:59:51.574000+00:00  103.135002  103.138000          1.0   
2021-01-04 23:59:56.131000+00:00  103.135002  103.136002          1.0   
2021-01-04 23:59:57.569000+00:00  103.135002  103.138000          1.0   
2021-01-04 23:59:57.771000+00:00  103.135002  103.138000          1.0   
2021-01-04 23:59:59.314000+00:00  103.135002  103.138000          1.0   

                   

We can now try to get the later vintage.

In [12]:
# We can also specify the later write time
md_request_local_cache.as_of = later_download_time
df_read_tick = Market().fetch_market(md_request=md_request_local_cache)

# We should only see the latest vintage
print("See the latest vintage write!")
print(df_read_tick)

See the latest vintage write!
                                  USDJPY.bid  USDJPY.ask  USDJPY.bidv  \
Date                                                                    
2021-01-04 10:00:00.092000+00:00  102.742996  102.745003         1.00   
2021-01-04 10:00:00.198000+00:00  102.741997  102.745003         1.50   
2021-01-04 10:00:00.403000+00:00  102.742996  102.746002         1.00   
2021-01-04 10:00:00.966000+00:00  102.744003  102.747002         1.00   
2021-01-04 10:00:01.220000+00:00  102.746002  102.748001         1.00   
...                                      ...         ...          ...   
2021-01-06 13:59:59.100000+00:00  103.176003  103.178001         1.31   
2021-01-06 13:59:59.252000+00:00  103.174004  103.178001         4.87   
2021-01-06 13:59:59.404000+00:00  103.174004  103.177002         1.87   
2021-01-06 13:59:59.860000+00:00  103.175003  103.178001         1.50   
2021-01-06 13:59:59.961000+00:00  103.174004  103.177002         4.62   

                    

## `update` to ArcticDB

Download date for `6 Jan 2021` and then write to disk as an `update` (after multiplying it), modifying an exisitng chunk of data already on disk.

In [13]:
# Finally let's try doing an update of an existing continuous chunk
# Note: if part of our update ends up being before/after the existing
# dataset, it will fail
md_request_download.start_date = "06 Jan 2021"
md_request_download.finish_date = "07 Jan 2021"
md_request_download.arcticdb_dict["write_style"] = "update"

df_tick_later = market.fetch_market(md_request=md_request_download)

# Modify the data, so we can see the obvious difference, when reading back later!
df_tick_later = df_tick_later * 10.0
IOEngine().write_time_series_cache_to_disk(data_frame=df_tick_later,
                                           engine=arcticdb_conn_str,
                                           md_request=md_request_download)

Let's read back from an earlier chunk, including the updating section. Given we don't specify the `as_of` it will just give us the latest version.

In [14]:
md_request_local_cache.start_date = "04 Jan 2021 10:00"
md_request_local_cache.finish_date = "08 Jan 2021 14:00"
md_request_local_cache.as_of = None
df_read_updated_tick = Market().fetch_market(md_request=md_request_local_cache)

print("Updated tick (should be 10 larger!)")
print(df_read_updated_tick)

Updated tick (should be 10 larger!)
                                   USDJPY.bid   USDJPY.ask  USDJPY.bidv  \
Date                                                                      
2021-01-04 10:00:00.092000+00:00   102.742996   102.745003          1.0   
2021-01-04 10:00:00.198000+00:00   102.741997   102.745003          1.5   
2021-01-04 10:00:00.403000+00:00   102.742996   102.746002          1.0   
2021-01-04 10:00:00.966000+00:00   102.744003   102.747002          1.0   
2021-01-04 10:00:01.220000+00:00   102.746002   102.748001          1.0   
...                                       ...          ...          ...   
2021-01-06 23:59:40.194000+00:00  1030.250000  1030.280029         12.5   
2021-01-06 23:59:40.295000+00:00  1030.250000  1030.270020         10.0   
2021-01-06 23:59:40.598000+00:00  1030.250000  1030.270020         10.0   
2021-01-06 23:59:40.700000+00:00  1030.219971  1030.270020         10.0   
2021-01-06 23:59:44.474000+00:00  1030.229980  1030.260010      

## Conclusion

We have seen how we can use findatapy combined with ArcticDB, to download market data pretty easy and also how to write/read this same data from ArcticDB. We have seen the different types of writing, notable `write`, `append` and `update`.