data / persistence - make data insert more efficient #18

westonplatter · 2020-07-28T05:36:20Z

Within data.load_and_cache, we have this line,

df.to_sql("quote", con=engine, if_exists="append", index=False, chunksize=1)

While this works, it's inefficient. Let's come up with a way to insert all the df rows into the DB at once while still respecting the unique index.

The text was updated successfully, but these errors were encountered:

briancappello · 2020-08-01T02:03:27Z

Hey @westonplatter,

Been researching this challenge of storing time series data a bit myself, not sure if you've considered any of these approaches or not but I figured it couldn't hurt to chime in with some of the stuff I've come across.

This is the setup I'm currently using:

alpacahq/marketstore (go server)
- plus Queryable Timeframe Aggregation alpacahq/marketstore#205 (adds support for on-the-fly timeframe aggregation, upstream reverted but... it works for me?)
and alpacahq/pymarketstore (my fork of their python client, supports writing dataframes like this)

See also https://github.com/ranaroussi/pystore and pikers/piker#90

westonplatter · 2020-08-01T18:42:13Z

@briancappello thanks for sharing the persistence strategy. I like the idea of pystore. I did something similar with caching data in parquet files. I think for now, I'll keep things as simple as possible in the pg db (performance is probably 4 versions ahead of me).

How much effort was it for you to get the Alpaca Marketstore working for you?

briancappello · 2020-08-03T02:18:56Z

Cool, makes sense.

re: Marketstore, I think they have a docker setup that should work out of the box? I run it locally, which on *nix is relatively painless (at least assuming you have your go development environment configured).

Clone it, and add the following to the Makefile:

install: plugins
	GOFLAGS=$(GOFLAGS) go install -ldflags "-s -X $(UTIL_PATH).Tag=$(DOCKER_TAG) -X $(UTIL_PATH).BuildStamp=$(shell date -u +%Y-%m-%d-%H-%M-%S) -X $(UTIL_PATH).GitHash=$(shell git rev-parse HEAD)" .

and then make install.

next run marketstore init, and edit the created mkts.yml config file to your needs. (Mainly, create a directory where you want it to store data and set root_directory to point to it.)

lastly, marketstore start runs the server

It comes with some plugins for automatically fetching and storing data from various providers (both historical and realtime streaming quotes), which are all optional. I use the polygon websocket plugin for realtime-ish minutely bars, but otherwise haven't experimented with any others.

On the python side, this snippet should get you started:

import pymarketstore as pymkts  # pip install pymarketstore
from pymarketstore.jsonrpc_client import JsonRpcClient

class Marketstore:
    def __init__(self, endpoint: str = 'http://localhost:5993/rpc'):
        self.client = JsonRpcClient(endpoint)

    def get_df(self, symbol: str, timeframe: str):
        p = pymkts.Param(symbol.upper(), timeframe, 'OHLCV')
        return self.client.query(p).first().df()

    def write_df(self, df: pd.DataFrame, symbol: str, timeframe: str):
        return self.client.rpc.call('DataService.Write', requests=[
            self._make_write_request(df, f'{symbol.upper()}/{timeframe}/OHLCV'),
        ])

    def _make_write_request(self, df, tbk):
        epoch = df.index.to_numpy(dtype='i8') / 10 ** 9
        dataset = dict(length=len(df),
                       startindex={tbk: 0},
                       lengths={tbk: len(df)},
                       types=['i8'] + [dtype.str.replace('<', '') for dtype in df.dtypes],
                       names=['Epoch'] + df.columns.to_list(),
                       data=[bytes(memoryview(epoch.astype('i8')))] + [
                           bytes(memoryview(df[name].to_numpy()))
                           for name in df.columns],
                       )
        return dict(dataset=dataset, is_variable_length=False)

This supports pandas DataFrames with an America/New_York-localized DatetimeIndex named Epoch, and data columns named Open, High, Low, Close, Volume. You can write either 64bit or 32bit float/int values, but once you write them, the onus is on the client to submit future write requests with the same data types for each column.

Supported timeframe strings can be found here, and the attrgroup parameter (hardcoded to OHLCV in my snippet) is, I believe, an arbitrary string. (so you can persist eg computed technical indicator values if you want).

westonplatter · 2020-08-03T02:58:05Z

@briancappello thanks for leaving notes. Sounds like there's a learning curve to getting things setup. What are the reasons to switch to marketstore over reading/writing to a local postgres instance? Curious to learn when/where it enables a higher level of features/functionality that's hard to otherwise obtain.

For now, my focus is to do indicator cohort analysis (eg, https://docs.google.com/spreadsheets/d/1CqzEjzP0m2XuylhQdHwKZ5qOK_YP7hcdbitEdsSzDWI/edit?usp=sharing).

briancappello · 2020-08-03T12:41:32Z

Yea, definitely not as simple as just using a single table in postgres :)

Mind you, I haven't benchmarked any of this, and I am very much thinking in terms of doing more upfront architectural work on the backtesting side of things so that backtested algos are more or less directly deployable to live trading once desirable strategies are discovered - but the benefits as I understand them:

marketstore is purpose-built for storing timeseries data, and its "tables" (files on disk) are per-symbol, per timeframe, and per dataset (eg OHLCV or pre-calculated TA values)
- I keep data for about 6000 tickers, cached in 1Min and 1D timeframes, each of which can be queried/updated just as quickly as if I was only storing data for 1 ticker
it has a plugin for doing timeframe aggregation and persisting those bars to disk as the faster-frequency data becomes available
it has a websocket server for streaming quotes and/or bars as they become available

So from the perspective of indicator analysis,

Streaming data server (either live or simulated by starting from a set date in the past) pushes bars
The strategies-under-test subscribe to said server, and generate signals (send orders) to a paper/live broker
The paper broker does the bookkeeping to store the results of strategies, which can then be analyzed

Obviously this is much more involved than just doing analysis on a dataframe! But, my hope is that it would also allow for much more sophisticated strategies to be tested. But, it may honestly also be trying to solve a rather different kind of problem than what you're currently focused on :)

briancappello · 2020-08-03T13:00:44Z

I guess it would probably help to give an example heh. So sticking with moving average crossovers, is it possible to add more "factors" to the analysis? For instance, also considering the slope of the slow MA. So maybe when the 20 crosses the 50, if the 50 is still trending downwards, it may not be a good cross to take, but if the 50 is flat and/or trending upwards, then maybe it is a good cross to buy on. Or perhaps you want to consider volume, or the value of another (non-MA) indicator, as a "filter" for whether or not to take crossover signals.

westonplatter created this issue from a note in ta_scanner - v0.1 (To do) Jul 28, 2020

westonplatter moved this from To do to In progress in ta_scanner - v0.1 Aug 1, 2020

westonplatter moved this from In progress to To do in ta_scanner - v0.1 Aug 1, 2020

westonplatter moved this from To do to In progress in ta_scanner - v0.1 Aug 25, 2020

westonplatter closed this as completed Aug 25, 2020

ta_scanner - v0.1 automation moved this from In progress to Done Aug 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data / persistence - make data insert more efficient #18

data / persistence - make data insert more efficient #18

westonplatter commented Jul 28, 2020

briancappello commented Aug 1, 2020

westonplatter commented Aug 1, 2020

briancappello commented Aug 3, 2020

westonplatter commented Aug 3, 2020

briancappello commented Aug 3, 2020

briancappello commented Aug 3, 2020

data / persistence - make data insert more efficient #18

data / persistence - make data insert more efficient #18

Comments

westonplatter commented Jul 28, 2020

briancappello commented Aug 1, 2020

westonplatter commented Aug 1, 2020

briancappello commented Aug 3, 2020

westonplatter commented Aug 3, 2020

briancappello commented Aug 3, 2020

briancappello commented Aug 3, 2020