-
-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for embedded database #128
Comments
The embedded database will be even slower. Support for truly fast
simulations comes in a separate jar I sell separately
…On Fri, Mar 5, 2021, 11:44 AM drkarl ***@***.***> wrote:
I was considering to brute-force parameters with backtesting data on a AWS
EC2 instance, one of the instances with 96 cores, or maybe one of the
instances with 128 cores (although the latter have almost 2TB of RAM so
they're like 3x more expensive than the ones with 96 cores).
Roughly the ones with 96 cores cost like $4-$5 an hour, and the ones with
128 cores like $12/$13 an hour.
This price is very expensive if it's going to run 24/7 for a month, but
it's a perfectly acceptable price if you're going to run your simulations
for a limited time, @jbax <https://github.com/jbax> mentioned that "I can
run the process described above in 20 minutes on a threadripper 3990x
(we're talking about billions of candles processes over and over) ." And TR
3990x has 64 cores, so with 96 or 128 cores would be even better.
I was considering how could I maximize performance and minimize cost. I
thought about using docker for a database and bring it up but it would take
too much time to import a dump and then get the new data etc. Then I
thought about using Amazon RDS but if I only need to run the process for a
short time then I'd have to pay for the database to hold my data without
using it.
I considered implementing support for reading binary data (as it would be
faster to read than text files), and I saw ActiveJ Serializer as a very
fast option, but it required adding annotations to the model and it was
overhead.
Then I thought that maybe an embedded database could be perfect. I
considered Nitrite, a NoSql embedded database based on H2, but probably SQL
based was less work since most of the database code would already be the
same. Nitrite also requires some annotations on the models.
The best fit is H2, is written in Java and is very fast. It would be great
for unit tests to use in-memory, and for heavy simulations on AWS would be
easy to just copy the H2 database file with all the backtesting data and
the jar of the application. It would be great to run it in-memory in one of
the AWS instances with lots of RAM, but then we would have to get all the
backtesting data. It's more convenient just uploading a database file.
I tried in a branch and is very easy to setup, I'll create a PR soon for
that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#128>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABWFQPTNE5ZTEZPNJSHCFPDTCAWA3ANCNFSM4YUJRYZA>
.
|
Thanks Jbax, I was counting on buying your optimizer jar and using it as well, I don't think both things are mutually exclusive. I'll send you an email about that. Having support for an embedded database isn't as much for performance (althought we'll get to that later) as convenience. Use cases:
About performance, that is difficult to say, I'd go with it depends. In some cases an embedded database can be much faster than a server/client database like MySQL, specially if the embedded database is using in-memory mode and MySQL uses TCP. I'd go as far as to say that even when using file mode it can be faster as well. I haven't done benchmarks myself but what I have found online seem to point in that direction... http://www.h2database.com/html/performance.html I know that link is from h2 itself, I know I've seen benchmarks in other places but cant' find them now. In summary, H2 looks like it would be faster than MySQL, specially H2 in-memory, but I'm not sure if it could have issues with very large datasets (hundreds of millions of rows of candlestick historical data for multiple tickers). But since it's way more convenient to deploy that to a short-lived server, I think it's worth giving it a try! |
Ok I got your point to enable an embedded database for running live. I'll
work to enable that
…On Fri, Mar 5, 2021, 9:29 PM drkarl ***@***.***> wrote:
Thanks Jbax, I was counting on buying your optimizer jar and using it as
well, I don't think both things are mutually exclusive. I'll send you an
email about that.
Having support for an embedded database isn't as much for performance
(althought we'll get to that later) as convenience. Use cases:
- Unit tests and examples would benefit from using an embedded
database, preferably an in-memory database. That way they don't depend on
an external installation of a database and newcomers to the project (like
me) can get up and running faster. Yeah, you can always use the
FileCandleRepository, but arguably using an embedded database is more
versatile.
- For development you don't really need to use a MySQL database
(installed or using docker like with mykolakaraman PR), I think it would be
convenient for dev to use an embedded database (you can always use
MySQL/MariaDB if you like, it's about choice).
- Wen deploying that to a server you can install MySQL (either on the
OS or using Docker). But when using a high powered server in AWS that you
may run for 1 hour or less and then destroy it would be cumbersome to have
to install MySQL every time and then import the database dump, so using an
embedded database which is literally copying a file which contains the
database can't get easier and more convenient, specially if you're going to
destroy the server one hour later.
About performance, that is difficult to say, I'd go with it depends. In
some cases an embedded database can be much faster than a server/client
database like MySQL, specially if the embedded database is using in-memory
mode and MySQL uses TCP. I'd go as far as to say that even when using file
mode it can be faster as well. I haven't done benchmarks myself but what I
have found online seem to point in that direction...
http://www.h2database.com/html/performance.html
I know that link is from h2 itself, I know I've seen benchmarks in other
places but cant' find them now.
In summary, H2 looks like it would be faster than MySQL, specially H2
in-memory, but I'm not sure if it could have issues with very large
datasets (hundreds of millions of rows of candlestick historical data for
multiple tickers). But since it's way more convenient to deploy that to a
short-lived server, I think it's worth giving it a try!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#128 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABWFQPVHBNN7RL5C33GZVXLTCC2R5ANCNFSM4YUJRYZA>
.
|
I was trying that. For H2, you can try these values on application.yml #
# H2 Database
#
database.jdbc.driver=org.h2.Driver
database.jdbc.url=jdbc:h2:~/data/univocity.db;MODE=MySQL;DATABASE_TO_LOWER=TRUE
database.user=sa
database.password= For tests I was looking at something like this private static final String H2_DRIVER = "org.h2.Driver";
private static final String JDBC_URL = "jdbc:h2:mem:univocity;DB_CLOSE_DELAY=-1;MODE=MySQL;DATABASE_TO_LOWER=TRUE;INIT=runscript from 'classpath:/db/h2/candle.sql'\\;runscript from 'classpath:/db/h2/gap.sql'";
private static final String DB_USER = "sa";
private static final String DB_PASSWORD = "dummy";
Strategy.Simulator simulator = Strategy.simulator();
simulator.configure().database()
.jdbcDriver(H2_DRIVER)
.jdbcUrl(JDBC_URL)
.user(DB_USER)
.password(DB_PASSWORD); On pom.xml <dependency>
<groupId>com.h2database</groupId>
<artifactId>h2</artifactId>
<version>1.4.200</version>
</dependency> And finally for the creation scripts, on the CREATE TABLE IF NOT EXISTS candle
(
symbol VARCHAR(32) NOT NULL,
open_time BIGINT NOT NULL,
close_time BIGINT NOT NULL,
open DECIMAL(20, 10) NOT NULL,
high DECIMAL(20, 10) NOT NULL,
low DECIMAL(20, 10) NOT NULL,
close DECIMAL(20, 10) NOT NULL,
volume DECIMAL(20, 10) NOT NULL,
ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
;
ALTER TABLE candle ADD CONSTRAINT candle_symbol_time_uq UNIQUE (symbol, open_time, close_time);
CREATE INDEX candle_symbol_idx ON candle (symbol) USING HASH;
CREATE INDEX candle_symbol_open_idx ON candle (symbol, open_time) USING BTREE;
CREATE INDEX candle_symbol_close_idx ON candle (symbol, close_time) USING BTREE;
CREATE INDEX candle_open_close_idx ON candle (symbol, open_time, close_time) USING BTREE;
-- Once table has data, you can run this to keep all rows sorted in the database.
ALTER TABLE candle ORDER BY symbol, open_time, close_time; and -- Used to store known gaps in history that can't be retrieved from the exchange.
-- Delete rows from table if you want to try filling the gaps again.
CREATE TABLE IF NOT EXISTS gap
(
symbol VARCHAR(32) NOT NULL,
open_time BIGINT NOT NULL,
close_time BIGINT NOT NULL,
CONSTRAINT gap_symbol_time_uq UNIQUE (symbol, open_time, close_time)
); |
The creation scripts might need some work as some of the syntax may not be supported by H2 |
Thanks, I'll check it out. Probably related to indices and mysql specific
stuff
…On Fri, Mar 5, 2021, 10:16 PM drkarl ***@***.***> wrote:
The creation scripts might need some work as some of the syntax may not be
supported by H2
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#128 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABWFQPVR4PZWMFBPZXMDNYTTCDAA3ANCNFSM4YUJRYZA>
.
|
@jbax In fact, I have already changed to H2 database for normal operation, and the memory usage is very good. I will use MySQL database for local test and H2 for online run, MySQL and H2 have their own use scenarios, so you don't need to favor any one |
done in the latest commit |
I was considering to brute-force parameters with backtesting data on a AWS EC2 instance, one of the instances with 96 cores, or maybe one of the instances with 128 cores (although the latter has almost 2TB of RAM so they're like 3x more expensive than the ones with 96 cores).
Roughly the ones with 96 cores cost like $4-$5 an hour, and the ones with 128 cores like $12/$13 an hour.
This price is very expensive if it's going to run 24/7 for a month, but it's a perfectly acceptable price if you're going to run your simulations for a limited time, @jbax mentioned that "I can run the process described above in 20 minutes on a threadripper 3990x (we're talking about billions of candles processes over and over) ." And TR 3990x has 64 cores, so with 96 or 128 cores would be even better.
I was considering how could I maximize performance and minimize cost. I thought about using docker for a database and bring it up but it would take too much time to import a dump and then get the new data etc. Then I thought about using Amazon RDS but if I only need to run the process for a short time then I'd have to pay for the database to hold my data without using it.
I considered implementing support for reading binary data (as it would be faster to read than text files), and I saw ActiveJ Serializer as a very fast option, but it required adding annotations to the model and it was overhead.
Then I thought that maybe an embedded database could be perfect. I considered Nitrite, a NoSql embedded database based on H2, but probably SQL based was less work since most of the database code would already be the same. Nitrite also requires some annotations on the models.
The best fit is H2, is written in Java and is very fast. It would be great for unit tests to use in-memory, and for heavy simulations on AWS would be easy to just copy the H2 database file with all the backtesting data and the jar of the application. It would be great to run it in-memory in one of the AWS instances with lots of RAM, but then we would have to get all the backtesting data. It's more convenient just uploading a database file.
I tried in a branch and is very easy to setup, I'll create a PR soon for that.
The text was updated successfully, but these errors were encountered: