Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for embedded database #128

Closed
drkarl opened this issue Mar 5, 2021 · 8 comments
Closed

Support for embedded database #128

drkarl opened this issue Mar 5, 2021 · 8 comments

Comments

@drkarl
Copy link
Contributor

drkarl commented Mar 5, 2021

I was considering to brute-force parameters with backtesting data on a AWS EC2 instance, one of the instances with 96 cores, or maybe one of the instances with 128 cores (although the latter has almost 2TB of RAM so they're like 3x more expensive than the ones with 96 cores).

Roughly the ones with 96 cores cost like $4-$5 an hour, and the ones with 128 cores like $12/$13 an hour.
This price is very expensive if it's going to run 24/7 for a month, but it's a perfectly acceptable price if you're going to run your simulations for a limited time, @jbax mentioned that "I can run the process described above in 20 minutes on a threadripper 3990x (we're talking about billions of candles processes over and over) ." And TR 3990x has 64 cores, so with 96 or 128 cores would be even better.

I was considering how could I maximize performance and minimize cost. I thought about using docker for a database and bring it up but it would take too much time to import a dump and then get the new data etc. Then I thought about using Amazon RDS but if I only need to run the process for a short time then I'd have to pay for the database to hold my data without using it.

I considered implementing support for reading binary data (as it would be faster to read than text files), and I saw ActiveJ Serializer as a very fast option, but it required adding annotations to the model and it was overhead.

Then I thought that maybe an embedded database could be perfect. I considered Nitrite, a NoSql embedded database based on H2, but probably SQL based was less work since most of the database code would already be the same. Nitrite also requires some annotations on the models.

The best fit is H2, is written in Java and is very fast. It would be great for unit tests to use in-memory, and for heavy simulations on AWS would be easy to just copy the H2 database file with all the backtesting data and the jar of the application. It would be great to run it in-memory in one of the AWS instances with lots of RAM, but then we would have to get all the backtesting data. It's more convenient just uploading a database file.

I tried in a branch and is very easy to setup, I'll create a PR soon for that.

@jbax
Copy link
Member

jbax commented Mar 5, 2021 via email

@drkarl
Copy link
Contributor Author

drkarl commented Mar 5, 2021

Thanks Jbax, I was counting on buying your optimizer jar and using it as well, I don't think both things are mutually exclusive. I'll send you an email about that.

Having support for an embedded database isn't as much for performance (althought we'll get to that later) as convenience. Use cases:

  • Unit tests and examples would benefit from using an embedded database, preferably an in-memory database. That way they don't depend on an external installation of a database and newcomers to the project (like me) can get up and running faster. Yeah, you can always use the FileCandleRepository, but arguably using an embedded database is more versatile.
  • For development you don't really need to use a MySQL database (installed or using docker like with mykolakaraman PR), I think it would be convenient for dev to use an embedded database (you can always use MySQL/MariaDB if you like, it's about choice).
  • Wen deploying that to a server you can install MySQL (either on the OS or using Docker). But when using a high powered server in AWS that you may run for 1 hour or less and then destroy it would be cumbersome to have to install MySQL every time and then import the database dump, so using an embedded database which is literally copying a file which contains the database can't get easier and more convenient, specially if you're going to destroy the server one hour later.

About performance, that is difficult to say, I'd go with it depends. In some cases an embedded database can be much faster than a server/client database like MySQL, specially if the embedded database is using in-memory mode and MySQL uses TCP. I'd go as far as to say that even when using file mode it can be faster as well. I haven't done benchmarks myself but what I have found online seem to point in that direction...

http://www.h2database.com/html/performance.html

I know that link is from h2 itself, I know I've seen benchmarks in other places but cant' find them now.

In summary, H2 looks like it would be faster than MySQL, specially H2 in-memory, but I'm not sure if it could have issues with very large datasets (hundreds of millions of rows of candlestick historical data for multiple tickers). But since it's way more convenient to deploy that to a short-lived server, I think it's worth giving it a try!

@jbax
Copy link
Member

jbax commented Mar 5, 2021 via email

@drkarl
Copy link
Contributor Author

drkarl commented Mar 5, 2021

I was trying that. For H2, you can try these values on application.yml

#
# H2 Database
#
database.jdbc.driver=org.h2.Driver
database.jdbc.url=jdbc:h2:~/data/univocity.db;MODE=MySQL;DATABASE_TO_LOWER=TRUE
database.user=sa
database.password=

For tests I was looking at something like this

	private static final String H2_DRIVER = "org.h2.Driver";
	private static final String JDBC_URL = "jdbc:h2:mem:univocity;DB_CLOSE_DELAY=-1;MODE=MySQL;DATABASE_TO_LOWER=TRUE;INIT=runscript from 'classpath:/db/h2/candle.sql'\\;runscript from 'classpath:/db/h2/gap.sql'";
	private static final String DB_USER = "sa";
	private static final String DB_PASSWORD = "dummy";

        Strategy.Simulator simulator = Strategy.simulator();
        simulator.configure().database()
		.jdbcDriver(H2_DRIVER)
		.jdbcUrl(JDBC_URL)
                .user(DB_USER)
		.password(DB_PASSWORD);

On pom.xml

        <dependency>
            <groupId>com.h2database</groupId>
            <artifactId>h2</artifactId>
            <version>1.4.200</version>
        </dependency>

And finally for the creation scripts, on the resources folder I added a db.h2 with h2 compatible versions of the scripts

CREATE TABLE IF NOT EXISTS candle
(
	symbol     VARCHAR(32)     NOT NULL,
	open_time  BIGINT          NOT NULL,
	close_time BIGINT          NOT NULL,
	open       DECIMAL(20, 10) NOT NULL,
	high       DECIMAL(20, 10) NOT NULL,
	low        DECIMAL(20, 10) NOT NULL,
	close      DECIMAL(20, 10) NOT NULL,
	volume     DECIMAL(20, 10) NOT NULL,
    ts         TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
;

ALTER TABLE candle ADD CONSTRAINT candle_symbol_time_uq UNIQUE (symbol, open_time, close_time);

CREATE INDEX candle_symbol_idx ON candle (symbol) USING HASH;
CREATE INDEX candle_symbol_open_idx ON candle (symbol, open_time) USING BTREE;
CREATE INDEX candle_symbol_close_idx ON candle (symbol, close_time) USING BTREE;
CREATE INDEX candle_open_close_idx ON candle (symbol, open_time, close_time) USING BTREE;

-- Once table has data, you can run this to keep all rows sorted in the database.
ALTER TABLE candle ORDER BY symbol, open_time, close_time;

and

-- Used to store known gaps in history that can't be retrieved from the exchange.
-- Delete rows from table if you want to try filling the gaps again.

CREATE TABLE IF NOT EXISTS gap
(
	symbol     VARCHAR(32)     NOT NULL,
	open_time  BIGINT          NOT NULL,
	close_time BIGINT          NOT NULL,
	
	CONSTRAINT gap_symbol_time_uq UNIQUE (symbol, open_time, close_time)
);

@drkarl
Copy link
Contributor Author

drkarl commented Mar 5, 2021

The creation scripts might need some work as some of the syntax may not be supported by H2

@jbax
Copy link
Member

jbax commented Mar 5, 2021 via email

@hlevel
Copy link
Contributor

hlevel commented Mar 6, 2021

@jbax In fact, I have already changed to H2 database for normal operation, and the memory usage is very good. I will use MySQL database for local test and H2 for online run, MySQL and H2 have their own use scenarios, so you don't need to favor any one

@jbax
Copy link
Member

jbax commented Mar 10, 2021

done in the latest commit

@jbax jbax closed this as completed Mar 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants