-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
We should consider gzipping datum_t
objects when we serialize them
#1396
Comments
That would be awesome! (Unless we saturate the CPU, in which case that would not be awesome). |
Before that we should add variable length encoding for string lengths (edit: that was done btw) and find some way to avoid writing keys in all the places. |
We don't use protocol buffers to serialize datum_t objects. |
Gzipping datum_t's and other forms of wanton compression could also expose some systems to BREACH-style attacks. |
The string length thing would be good. What do you mean by "avoid writing keys in all the places"?
We may want to consider offering gzip compression to/from the clients, too. (Also, our own serialization function isn't that much better; we still serialize a whole double for every number, for example.)
Duly noted; we should document that we do this. (This isn't as high on my list of concerns as our performance problems.) |
I think he means that if you have a billion documents, each with a field "first_name", we'll end up writing the string "first_name" a billion times. |
please don't use gz as it is a CPU drain and slow (relatively speaking). compression can gain you both space and speed performance if you choose an algorithm that has a higher bandwidth than the device from/to which you are reading/writing (either network or storage), and luckily these algorithms are typically very easy on the CPU as well. look at lzo if you want something commonly available and well suited to high speed / low cpu. maybe take a peak at google's snappy as well. |
I'd like to do this sooner rather than later, because I think it will help with Benchmark X (both speed and actually fitting the goddamn thing on disk). @jdoliner also recommended snappy, so I'm going to look into that. @AtnNn -- snappy doesn't seem to have pre-precise packages, so we'd probably have to include it like with re2. Would that be plausible? |
If snappy is as portable and as easy to build as re2, it is plausible. |
I've looked into this more. Snappy is really fast (4-5x zlib on zlib's fastest compression speed), but it has really poor compression on some workloads. It can barely compress English text at all (it takes the size down by about 20%, compared to about 50% for zlib). My hypothesis is that zlib is getting good compression on english text by compressing more common characters (like Snappy does much better on more redundant data; it takes the size of a random github issue I picked through the JSON API down by about 50%, probably because of all the repeated URLs and stuff. |
I think we should just use snappy for now. Our original test data for benchmark X was highly redundant because it wasn't written expecting to be compressed; we'll have to revise it with a target snappy compression ration of about .5 (GitHub seems like as good a standard as anything for how well JSON documents will compress). |
Also, snappy would perform much better if it were compressing multiple rows at a time, since it could then de-dup field names. |
We should document that our database shouldn't be used for web applications? |
Maybe I don't understand breach style attacks but all the data we'd be compressing is data we've previously been passing around as clear text. How is this going to make things less secure? |
A BREACH style attack is what you get when you compress secret data together with attacker-supplied data. The attacker can then figure out the contents of the secret data by measuring how well his data compresses together with the secret data. For example, suppose a dating website stores users' information in the following format |
Or better yet, figure out how much padding makes the document size reach the threshold where a blob needs a different number of levels. Even if all the data is in memory, locking a next level of blocks should take a spin or two around the message loop -- send some other queries that would overload the database server and you'll have half a second difference in response times. |
@srh do you happen to know how other databases deal with this? |
I know PostegreSQL is willing to compress individual SQL rows if they get large enough. However, people using a relational database aren't going to put those two kinds of data in the same row. |
Most mature DB products I know implement compression, but make it optional (there are many obvious reasons why we can't make compression a requirement). I think we should do that -- make it an option on a table creation, and document the implications of the choice really well. This would let users decide whether space issues are more important to them than vulnerability risk (many people want to use Rethink as an analytics/reporting type system without outside access; storage space is much more important there than possible vulnerabilities). I don't know if compression should be on by default, but my intuition says it should be off. |
compression is an implementation detail to me and if you ship the libs with the product i don't care what you use. as for BREACH ... how is using compression in your data storage back end doesn't make a web service less secure? wouldn't a hacker need physical access to the compressed bytes on the disk of someone's rethinkdb server for this to matter? my 2 cents says compression as a performance and storage optimization should default to on. if you were sacrificing performance for space (like gzip) it should default off. but working with big data has taught me that the value of compression is really increasing bandwidth (over the wire and on and off the disk) rather than space. |
I did some preliminary testing on Benchmark X in JD's |
So, I did some more tests, and they're very strange. On a 1-node cluster with no secondary indexes or extra write acks or anything, the following workloads:
all have the exact same behavior: 3-4k writes per second, single-digit CPU utilization by all threads except for 4, and those 4 have 40-80% CPU utilization. |
Where are you actually doing the compression? |
JD made the |
There is some extra copying involved in the compression right now but I don't think that that explains this. |
Interestingly enough, when I bump the CPU_SHARDING_FACTOR up to 24 (the number of cores on magneto), I still can't get above like 5k writes/sec (and the web interface starts to become unresponsive around then). |
Er, I didn't mean unresponsive; I meant the stat requests start to time out (or something) so that the write graph no longer corresponds closely to reality. |
Make that 5-6k. Here's the other interesting thing: that 50% increase in write throughput (from bumping up the CPU_SHARDING_FACTOR) comes along with a 200% increase in CPU usage (from 400% to 1200% for the whole process). @srh -- is there some sort of large per-cpu-shard overhead that might be causing this? |
Apologies for this type of question. I'm leaning towards using using Cassandra for a new project only because of its compression feature; however, if the compression feature in rethinkdb becomes available in 4 months I'd prefer to use that...just wondering if that should be anticipated or not. |
We currently don't have compression planned in the next four months, though I'd be surprised if it doesn't make it in in a year. We'd consider prioritizing it if it's really important. Could you email me at slava@rethinkdb.com with your use case? |
use-case -> i have lots of data, and compressing saves me x disk space so i can store 10x more. for me mostly its around any high volume data storing high value archival data in volume with most of the query set against recent data (ie no full scans, either tail cursors or indexed queries). concrete case -> one i'm looking at is cloudtrail data (aws audit log), which are json events dumped with high frequency to s3 as gz compressed keys. in one aws account i work with that's a 1gb of data a day uncompressed. I'd like to be able to query out those records on various attributes within a time window, as well determine api usage rates within a given time window. |
This might also give you some headroom on file size bloat. |
Moving to subsequent. I think we should reconsider this soon. Per-document compression could be especially effective if it pushes slightly larger documents below the 250 bytes btree inlining size limit. |
As far as I can tell multi-document compression is a lot more difficult to implement in our codebase (though it would also be significantly more effective). A static dictionary that can be specified at table creation but can't be changed afterwards should also be fairly easy to do, and can help to reduce the overhead of repeating field names in each document if those names are known beforehand. |
the mongodb compression in wiredtiger is really quite great imo. it uses snappy by default and it performs really well for me. https://docs.mongodb.com/manual/core/wiredtiger/#compression |
Thanks for sharing @polidore . The Good data point in favor of snappy though in either case. |
Is there an eta on this yet? We are starting to feel the pain of no compression. For storage we use rethinkdb and elasticsearch. Our elasticsearch machines use so much less hard drive space than rethink. Needing terabyte disks on rethink when we need 200 GB disks on elasticsearch is a real pain point. |
@monkeyboy64 Sorry, no ETA yet. My guess would be 6-12 months from now. |
Have anyone looked at https://github.com/facebook/zstd yet? It's fast, does entropy encoding, supports a static dictionary and has tunable speed/compression ratio (which I think is a major boon: you can have a very fast CPU-friendly compression or compression with the best ratios depending on your particular needs). Text (e.g. JSON) compression is superb, though it doesn't perform just as well on binary data. |
Zstd has a non-optional entropy encoder stage, which causes it to require ~3-4 times the CPU resources of LZ4 or Snappy. Zstd's extra compression ratio comes at a much higher risk of creating a CPU bottleneck. I think Zstd makes a good choice for an optional higher compression setting, considering that it beats zlib in both compression and decompression speed by a wide margin. |
@mlucy @coffeemug I'm looking at implementing this (as it'd be key to our usage of RethinkDB) can you elaborate on:
Was the only issue the CPU usage or were there other discoveries that made this unworkable? |
@robertjpayne I unfortunately don't remember any details about this. I think that comment was probably based on some kind of implementational complexity assessment, not CPU usage concerns though. But I have no clue anymore what those complexities might have been. |
@robertjpayne -- my memory is that:
If all you need is single-document compression, and there's a compression algorithm that has the right CPU/space tradeoff on your data, then I don't see any problem with adding it. |
Glad to see that there is still some activity on this topic. Compression is really a must have for document based databases in order to get rid of the millions of redundant keys within the data. I'd like to draw your attention to this benchmark comparison, including most recent compression algorithms, also from a DB developer point of view. LZ4 is the clear winner there. https://www.percona.com/blog/2016/04/13/evaluating-database-compression-methods-update/ |
@ml31415 there's still no clear path forward on this -- it's a massive undertaking because not only does it need to happen on the log serialiser it also needs to be able to upgrade existing tables. I almost reckon it's more worth while to officially support running on ZFS, if ZFS is configured with the right block sizes it can do the compression on the fly with minimal overhead. |
Sounds interesting. I'd love to use rethinkdb for a redesign of our trading engine, due to the amazing streaming features and the great syntax, but I can't have it bloat the majority of the used disk space with endless copies of key data. It's a pity that there is no DB combining the rethink API and streaming features with a more structured way of storing data and ideally columnwise compression. |
@ml31415 there's good reasons not to use compression too (CPU overhead, crypto attacks etc…) I'd recommend you look at ZFSOnLinux and particularly how to set it up for MySQL as RethinkDB would benefit from a similar setup. Though I believe RethinkDB pages data in 4KB blocks where as MySQL uses 16KB blocks |
Dropping by to say I'm pro compression. It could be as a connection option or a query option. On the same issue, any idea if the overhead would be too high storing the data of rethinkdb in a mounted swift container? |
@Extarys compression on the wire protocol is a totally different issue ( much easier ), storage compression is what is being discussed here and is much harder. |
Oh I wasn't quite sure so I took my chance. I prefer performance vs disk space. What does similar db engine do? Does the "competition" compress? |
Yes, most document oriented DBs feature compression in some way. Storing documents as JSON strings just cries for proper compression. E.g. imagine billions of strings like "{timestamp: 123123123, bid: 123, ask: 124, bvol: 100, avol: 200}". It has let's say 12 bytes usable info in there. The record would use 64 bytes on disk without compression. That's a bloat by a factor of 5, and it could go worse, if you have longer key strings. |
I'm currently working with multiple big objects so I can imagine how fast this can grow in disk space. I know for files LZMA offer the best between cpu time and compression rate, though I think it requires more memory. |
Just stumbled across another possible candidate, which recently got introduced to the linux kernel: https://github.com/facebook/zstd |
@ml31415 the reality is that single-document compression isn't going to be worth it for the CPU it costs. Have to get block level across the whole db. |
This results in space savings for even smallish documents, for several reasons:
The text was updated successfully, but these errors were encountered: