Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data / persistence - make data insert more efficient #18

Closed
westonplatter opened this issue Jul 28, 2020 · 6 comments
Closed

data / persistence - make data insert more efficient #18

westonplatter opened this issue Jul 28, 2020 · 6 comments

Comments

@westonplatter
Copy link
Owner

Within data.load_and_cache, we have this line,

df.to_sql("quote", con=engine, if_exists="append", index=False, chunksize=1)

While this works, it's inefficient. Let's come up with a way to insert all the df rows into the DB at once while still respecting the unique index.

@westonplatter westonplatter created this issue from a note in ta_scanner - v0.1 (To do) Jul 28, 2020
@briancappello
Copy link

Hey @westonplatter,

Been researching this challenge of storing time series data a bit myself, not sure if you've considered any of these approaches or not but I figured it couldn't hurt to chime in with some of the stuff I've come across.

This is the setup I'm currently using:

See also https://github.com/ranaroussi/pystore and pikers/piker#90

@westonplatter westonplatter moved this from To do to In progress in ta_scanner - v0.1 Aug 1, 2020
@westonplatter westonplatter moved this from In progress to To do in ta_scanner - v0.1 Aug 1, 2020
@westonplatter
Copy link
Owner Author

@briancappello thanks for sharing the persistence strategy. I like the idea of pystore. I did something similar with caching data in parquet files. I think for now, I'll keep things as simple as possible in the pg db (performance is probably 4 versions ahead of me).

How much effort was it for you to get the Alpaca Marketstore working for you?

@briancappello
Copy link

Cool, makes sense.

re: Marketstore, I think they have a docker setup that should work out of the box? I run it locally, which on *nix is relatively painless (at least assuming you have your go development environment configured).

Clone it, and add the following to the Makefile:

install: plugins
	GOFLAGS=$(GOFLAGS) go install -ldflags "-s -X $(UTIL_PATH).Tag=$(DOCKER_TAG) -X $(UTIL_PATH).BuildStamp=$(shell date -u +%Y-%m-%d-%H-%M-%S) -X $(UTIL_PATH).GitHash=$(shell git rev-parse HEAD)" .

and then make install.

next run marketstore init, and edit the created mkts.yml config file to your needs. (Mainly, create a directory where you want it to store data and set root_directory to point to it.)

lastly, marketstore start runs the server

It comes with some plugins for automatically fetching and storing data from various providers (both historical and realtime streaming quotes), which are all optional. I use the polygon websocket plugin for realtime-ish minutely bars, but otherwise haven't experimented with any others.

On the python side, this snippet should get you started:

import pymarketstore as pymkts  # pip install pymarketstore
from pymarketstore.jsonrpc_client import JsonRpcClient

class Marketstore:
    def __init__(self, endpoint: str = 'http://localhost:5993/rpc'):
        self.client = JsonRpcClient(endpoint)

    def get_df(self, symbol: str, timeframe: str):
        p = pymkts.Param(symbol.upper(), timeframe, 'OHLCV')
        return self.client.query(p).first().df()

    def write_df(self, df: pd.DataFrame, symbol: str, timeframe: str):
        return self.client.rpc.call('DataService.Write', requests=[
            self._make_write_request(df, f'{symbol.upper()}/{timeframe}/OHLCV'),
        ])

    def _make_write_request(self, df, tbk):
        epoch = df.index.to_numpy(dtype='i8') / 10 ** 9
        dataset = dict(length=len(df),
                       startindex={tbk: 0},
                       lengths={tbk: len(df)},
                       types=['i8'] + [dtype.str.replace('<', '') for dtype in df.dtypes],
                       names=['Epoch'] + df.columns.to_list(),
                       data=[bytes(memoryview(epoch.astype('i8')))] + [
                           bytes(memoryview(df[name].to_numpy()))
                           for name in df.columns],
                       )
        return dict(dataset=dataset, is_variable_length=False)

This supports pandas DataFrames with an America/New_York-localized DatetimeIndex named Epoch, and data columns named Open, High, Low, Close, Volume. You can write either 64bit or 32bit float/int values, but once you write them, the onus is on the client to submit future write requests with the same data types for each column.

Supported timeframe strings can be found here, and the attrgroup parameter (hardcoded to OHLCV in my snippet) is, I believe, an arbitrary string. (so you can persist eg computed technical indicator values if you want).

@westonplatter
Copy link
Owner Author

@briancappello thanks for leaving notes. Sounds like there's a learning curve to getting things setup. What are the reasons to switch to marketstore over reading/writing to a local postgres instance? Curious to learn when/where it enables a higher level of features/functionality that's hard to otherwise obtain.

For now, my focus is to do indicator cohort analysis (eg, https://docs.google.com/spreadsheets/d/1CqzEjzP0m2XuylhQdHwKZ5qOK_YP7hcdbitEdsSzDWI/edit?usp=sharing).

@briancappello
Copy link

Yea, definitely not as simple as just using a single table in postgres :)

Mind you, I haven't benchmarked any of this, and I am very much thinking in terms of doing more upfront architectural work on the backtesting side of things so that backtested algos are more or less directly deployable to live trading once desirable strategies are discovered - but the benefits as I understand them:

  • marketstore is purpose-built for storing timeseries data, and its "tables" (files on disk) are per-symbol, per timeframe, and per dataset (eg OHLCV or pre-calculated TA values)
    • I keep data for about 6000 tickers, cached in 1Min and 1D timeframes, each of which can be queried/updated just as quickly as if I was only storing data for 1 ticker
  • it has a plugin for doing timeframe aggregation and persisting those bars to disk as the faster-frequency data becomes available
  • it has a websocket server for streaming quotes and/or bars as they become available

So from the perspective of indicator analysis,

  1. Streaming data server (either live or simulated by starting from a set date in the past) pushes bars
  2. The strategies-under-test subscribe to said server, and generate signals (send orders) to a paper/live broker
  3. The paper broker does the bookkeeping to store the results of strategies, which can then be analyzed

Obviously this is much more involved than just doing analysis on a dataframe! But, my hope is that it would also allow for much more sophisticated strategies to be tested. But, it may honestly also be trying to solve a rather different kind of problem than what you're currently focused on :)

@briancappello
Copy link

I guess it would probably help to give an example heh. So sticking with moving average crossovers, is it possible to add more "factors" to the analysis? For instance, also considering the slope of the slow MA. So maybe when the 20 crosses the 50, if the 50 is still trending downwards, it may not be a good cross to take, but if the 50 is flat and/or trending upwards, then maybe it is a good cross to buy on. Or perhaps you want to consider volume, or the value of another (non-MA) indicator, as a "filter" for whether or not to take crossover signals.

@westonplatter westonplatter moved this from To do to In progress in ta_scanner - v0.1 Aug 25, 2020
ta_scanner - v0.1 automation moved this from In progress to Done Aug 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants