Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We should consider gzipping datum_t objects when we serialize them #1396

Open
mlucy opened this issue Aug 30, 2013 · 62 comments
Open

We should consider gzipping datum_t objects when we serialize them #1396

mlucy opened this issue Aug 30, 2013 · 62 comments

Comments

@mlucy
Copy link
Member

mlucy commented Aug 30, 2013

This results in space savings for even smallish documents, for several reasons:

  • Englishy language compresses by a factor of 4.
  • We use a whole double to store every number, so arrays of small numbers compress by a factor of 10 (counting protobuf overhead).
  • Our protobuf format has to use multiple type tags for every datum term -- we have a type for the term, then a type for the datum. Compressing away protobuf overhead alone gives pretty good savings (although this is harder to quantify).
@coffeemug
Copy link
Contributor

That would be awesome! (Unless we saturate the CPU, in which case that would not be awesome).

@srh
Copy link
Contributor

srh commented Sep 2, 2013

Before that we should add variable length encoding for string lengths (edit: that was done btw) and find some way to avoid writing keys in all the places.

@srh
Copy link
Contributor

srh commented Sep 2, 2013

We don't use protocol buffers to serialize datum_t objects.

@srh
Copy link
Contributor

srh commented Sep 2, 2013

Gzipping datum_t's and other forms of wanton compression could also expose some systems to BREACH-style attacks.

@mlucy
Copy link
Member Author

mlucy commented Sep 2, 2013

Before that we should add variable length encoding for string lengths and find some way to avoid writing keys in all the places.

The string length thing would be good. What do you mean by "avoid writing keys in all the places"?

We don't use protocol buffers to serialize datum_t objects.

We may want to consider offering gzip compression to/from the clients, too. (Also, our own serialization function isn't that much better; we still serialize a whole double for every number, for example.)

Gzipping datum_t's and other forms of wanton compression could also expose some systems to BREACH-style attacks.

Duly noted; we should document that we do this. (This isn't as high on my list of concerns as our performance problems.)

@coffeemug
Copy link
Contributor

What do you mean by "avoid writing keys in all the places"?

I think he means that if you have a billion documents, each with a field "first_name", we'll end up writing the string "first_name" a billion times.

@underrun
Copy link

underrun commented Sep 3, 2013

please don't use gz as it is a CPU drain and slow (relatively speaking).

compression can gain you both space and speed performance if you choose an algorithm that has a higher bandwidth than the device from/to which you are reading/writing (either network or storage), and luckily these algorithms are typically very easy on the CPU as well.

look at lzo if you want something commonly available and well suited to high speed / low cpu. maybe take a peak at google's snappy as well.

@mlucy
Copy link
Member Author

mlucy commented Sep 9, 2013

I'd like to do this sooner rather than later, because I think it will help with Benchmark X (both speed and actually fitting the goddamn thing on disk).

@jdoliner also recommended snappy, so I'm going to look into that.

@AtnNn -- snappy doesn't seem to have pre-precise packages, so we'd probably have to include it like with re2. Would that be plausible?

@AtnNn
Copy link
Member

AtnNn commented Sep 9, 2013

If snappy is as portable and as easy to build as re2, it is plausible.

@mlucy
Copy link
Member Author

mlucy commented Sep 12, 2013

I've looked into this more.

Snappy is really fast (4-5x zlib on zlib's fastest compression speed), but it has really poor compression on some workloads. It can barely compress English text at all (it takes the size down by about 20%, compared to about 50% for zlib). My hypothesis is that zlib is getting good compression on english text by compressing more common characters (like e) to fewer than 8 bits; snappy doesn't do entropy encoding, so it can't do very well on those inputs (about all it can do is eliminate common subsequences of letters).

Snappy does much better on more redundant data; it takes the size of a random github issue I picked through the JSON API down by about 50%, probably because of all the repeated URLs and stuff.

@mlucy
Copy link
Member Author

mlucy commented Sep 12, 2013

I think we should just use snappy for now. Our original test data for benchmark X was highly redundant because it wasn't written expecting to be compressed; we'll have to revise it with a target snappy compression ration of about .5 (GitHub seems like as good a standard as anything for how well JSON documents will compress).

@mlucy
Copy link
Member Author

mlucy commented Sep 12, 2013

Also, snappy would perform much better if it were compressing multiple rows at a time, since it could then de-dup field names.

@srh
Copy link
Contributor

srh commented Sep 12, 2013

Duly noted; we should document that we do this. (This isn't as high on my list of concerns as our performance problems.)

We should document that our database shouldn't be used for web applications?

@jdoliner
Copy link
Contributor

Maybe I don't understand breach style attacks but all the data we'd be compressing is data we've previously been passing around as clear text. How is this going to make things less secure?

@srh
Copy link
Contributor

srh commented Sep 12, 2013

A BREACH style attack is what you get when you compress secret data together with attacker-supplied data. The attacker can then figure out the contents of the secret data by measuring how well his data compresses together with the secret data.

For example, suppose a dating website stores users' information in the following format { username: ..., email_address: ..., inbox: [...]}. A stalker could then figure out a user's email address by sending/deleting messages to the user and measuring how long it takes the server to respond to requests involving the user. You can do this with relatively low timing resolution if you first figure out how much padding the message needs to be to make the document size reach the threshold of a block boundary.

@srh
Copy link
Contributor

srh commented Sep 12, 2013

Or better yet, figure out how much padding makes the document size reach the threshold where a blob needs a different number of levels. Even if all the data is in memory, locking a next level of blocks should take a spin or two around the message loop -- send some other queries that would overload the database server and you'll have half a second difference in response times.

@jdoliner
Copy link
Contributor

@srh do you happen to know how other databases deal with this?

@srh
Copy link
Contributor

srh commented Sep 12, 2013

I know PostegreSQL is willing to compress individual SQL rows if they get large enough. However, people using a relational database aren't going to put those two kinds of data in the same row.

@coffeemug
Copy link
Contributor

Most mature DB products I know implement compression, but make it optional (there are many obvious reasons why we can't make compression a requirement). I think we should do that -- make it an option on a table creation, and document the implications of the choice really well. This would let users decide whether space issues are more important to them than vulnerability risk (many people want to use Rethink as an analytics/reporting type system without outside access; storage space is much more important there than possible vulnerabilities).

I don't know if compression should be on by default, but my intuition says it should be off.

@underrun
Copy link

compression is an implementation detail to me and if you ship the libs with the product i don't care what you use.

as for BREACH ... how is using compression in your data storage back end doesn't make a web service less secure? wouldn't a hacker need physical access to the compressed bytes on the disk of someone's rethinkdb server for this to matter?

my 2 cents says compression as a performance and storage optimization should default to on. if you were sacrificing performance for space (like gzip) it should default off. but working with big data has taught me that the value of compression is really increasing bandwidth (over the wire and on and off the disk) rather than space.

@mlucy
Copy link
Member Author

mlucy commented Sep 14, 2013

I did some preliminary testing on Benchmark X in JD's snappy branch with highly compressable data, and the write throughput was 8-10k, same as before. This suggests that something else is the bottleneck (perf top didn't produce any obvious suspects). I'll do some more tests on Monday.

@mlucy
Copy link
Member Author

mlucy commented Sep 14, 2013

So, I did some more tests, and they're very strange. On a 1-node cluster with no secondary indexes or extra write acks or anything, the following workloads:

  • 1 shard, no compression
  • 1 shard, 2x compression
  • 4 shards, no compression
  • 4 shards, 2x compression

all have the exact same behavior: 3-4k writes per second, single-digit CPU utilization by all threads except for 4, and those 4 have 40-80% CPU utilization.

@srh
Copy link
Contributor

srh commented Sep 14, 2013

Where are you actually doing the compression?

@mlucy
Copy link
Member Author

mlucy commented Sep 14, 2013

JD made the rdb_blob_wrapper_t compress when it writes to its internal blob_t.

@jdoliner
Copy link
Contributor

There is some extra copying involved in the compression right now but I don't think that that explains this.

@mlucy
Copy link
Member Author

mlucy commented Sep 17, 2013

Interestingly enough, when I bump the CPU_SHARDING_FACTOR up to 24 (the number of cores on magneto), I still can't get above like 5k writes/sec (and the web interface starts to become unresponsive around then).

@mlucy
Copy link
Member Author

mlucy commented Sep 17, 2013

Er, I didn't mean unresponsive; I meant the stat requests start to time out (or something) so that the write graph no longer corresponds closely to reality.

@mlucy
Copy link
Member Author

mlucy commented Sep 17, 2013

Make that 5-6k.

Here's the other interesting thing: that 50% increase in write throughput (from bumping up the CPU_SHARDING_FACTOR) comes along with a 200% increase in CPU usage (from 400% to 1200% for the whole process).

@srh -- is there some sort of large per-cpu-shard overhead that might be causing this?

@gx0r
Copy link

gx0r commented Sep 15, 2015

Apologies for this type of question. I'm leaning towards using using Cassandra for a new project only because of its compression feature; however, if the compression feature in rethinkdb becomes available in 4 months I'd prefer to use that...just wondering if that should be anticipated or not.

@coffeemug
Copy link
Contributor

We currently don't have compression planned in the next four months, though I'd be surprised if it doesn't make it in in a year. We'd consider prioritizing it if it's really important. Could you email me at slava@rethinkdb.com with your use case?

@kapilt
Copy link

kapilt commented Feb 29, 2016

use-case -> i have lots of data, and compressing saves me x disk space so i can store 10x more. for me mostly its around any high volume data storing high value archival data in volume with most of the query set against recent data (ie no full scans, either tail cursors or indexed queries).

concrete case -> one i'm looking at is cloudtrail data (aws audit log), which are json events dumped with high frequency to s3 as gz compressed keys. in one aws account i work with that's a 1gb of data a day uncompressed. I'd like to be able to query out those records on various attributes within a time window, as well determine api usage rates within a given time window.

@srh
Copy link
Contributor

srh commented Mar 4, 2016

This might also give you some headroom on file size bloat.

@danielmewes danielmewes modified the milestones: subsequent, backlog Mar 4, 2016
@danielmewes
Copy link
Member

Moving to subsequent. I think we should reconsider this soon.

Per-document compression could be especially effective if it pushes slightly larger documents below the 250 bytes btree inlining size limit.

@danielmewes
Copy link
Member

As far as I can tell multi-document compression is a lot more difficult to implement in our codebase (though it would also be significantly more effective).

A static dictionary that can be specified at table creation but can't be changed afterwards should also be fairly easy to do, and can help to reduce the overhead of repeating field names in each document if those names are known beforehand.

@polidore
Copy link

the mongodb compression in wiredtiger is really quite great imo. it uses snappy by default and it performs really well for me.

https://docs.mongodb.com/manual/core/wiredtiger/#compression

@danielmewes
Copy link
Member

Thanks for sharing @polidore .

The datum_t compression proposed here might end up being a bit less effective compared to the block compression of WiredTiger when it comes to small documents. For large documents it might work better. This depends a bit on how large the "blocks" are that WiredTiger uses, which I'm not sure about.

Good data point in favor of snappy though in either case.

@monkeyboy64
Copy link

Is there an eta on this yet? We are starting to feel the pain of no compression. For storage we use rethinkdb and elasticsearch. Our elasticsearch machines use so much less hard drive space than rethink. Needing terabyte disks on rethink when we need 200 GB disks on elasticsearch is a real pain point.

@danielmewes
Copy link
Member

@monkeyboy64 Sorry, no ETA yet. My guess would be 6-12 months from now.

@ArtemGr
Copy link

ArtemGr commented Jan 22, 2017

Have anyone looked at https://github.com/facebook/zstd yet? It's fast, does entropy encoding, supports a static dictionary and has tunable speed/compression ratio (which I think is a major boon: you can have a very fast CPU-friendly compression or compression with the best ratios depending on your particular needs).

Text (e.g. JSON) compression is superb, though it doesn't perform just as well on binary data.

@tkodw
Copy link

tkodw commented Feb 6, 2017

Zstd has a non-optional entropy encoder stage, which causes it to require ~3-4 times the CPU resources of LZ4 or Snappy. Zstd's extra compression ratio comes at a much higher risk of creating a CPU bottleneck.

I think Zstd makes a good choice for an optional higher compression setting, considering that it beats zlib in both compression and decompression speed by a wide margin.

@robertjpayne
Copy link
Contributor

@mlucy @coffeemug I'm looking at implementing this (as it'd be key to our usage of RethinkDB) can you elaborate on:

Talked to @mlucy -- we did some experiments, and there is no quick fix we can merge yet. Moving to backlog.

Was the only issue the CPU usage or were there other discoveries that made this unworkable?

@danielmewes
Copy link
Member

@robertjpayne I unfortunately don't remember any details about this. I think that comment was probably based on some kind of implementational complexity assessment, not CPU usage concerns though. But I have no clue anymore what those complexities might have been.

@mlucy
Copy link
Member Author

mlucy commented Sep 9, 2017

@robertjpayne -- my memory is that:

  1. gzip was too CPU intensive.
  2. snappy didn't give good enough space savings to be worthwhile.
  3. The largest benefits came from compressing multiple documents with a single symbol table (because often they all share the same keys), but this was more complicated than compressing individual documents.

If all you need is single-document compression, and there's a compression algorithm that has the right CPU/space tradeoff on your data, then I don't see any problem with adding it.

@ml31415
Copy link

ml31415 commented Sep 13, 2017

Glad to see that there is still some activity on this topic. Compression is really a must have for document based databases in order to get rid of the millions of redundant keys within the data. I'd like to draw your attention to this benchmark comparison, including most recent compression algorithms, also from a DB developer point of view. LZ4 is the clear winner there.

https://www.percona.com/blog/2016/04/13/evaluating-database-compression-methods-update/

@robertjpayne
Copy link
Contributor

@ml31415 there's still no clear path forward on this -- it's a massive undertaking because not only does it need to happen on the log serialiser it also needs to be able to upgrade existing tables.

I almost reckon it's more worth while to officially support running on ZFS, if ZFS is configured with the right block sizes it can do the compression on the fly with minimal overhead.

@ml31415
Copy link

ml31415 commented Sep 13, 2017

Sounds interesting. I'd love to use rethinkdb for a redesign of our trading engine, due to the amazing streaming features and the great syntax, but I can't have it bloat the majority of the used disk space with endless copies of key data. It's a pity that there is no DB combining the rethink API and streaming features with a more structured way of storing data and ideally columnwise compression.

@robertjpayne
Copy link
Contributor

@ml31415 there's good reasons not to use compression too (CPU overhead, crypto attacks etc…) I'd recommend you look at ZFSOnLinux and particularly how to set it up for MySQL as RethinkDB would benefit from a similar setup.

Though I believe RethinkDB pages data in 4KB blocks where as MySQL uses 16KB blocks

@Extarys
Copy link

Extarys commented Sep 13, 2017

Dropping by to say I'm pro compression. It could be as a connection option or a query option.
It would save bandwidth/time/memory for large query like "all product of company X". Although for querying individual documents I don't see any direct benefit.

On the same issue, any idea if the overhead would be too high storing the data of rethinkdb in a mounted swift container?

@robertjpayne
Copy link
Contributor

@Extarys compression on the wire protocol is a totally different issue ( much easier ), storage compression is what is being discussed here and is much harder.

@Extarys
Copy link

Extarys commented Sep 13, 2017

Oh I wasn't quite sure so I took my chance. I prefer performance vs disk space. What does similar db engine do? Does the "competition" compress?

@ml31415
Copy link

ml31415 commented Sep 13, 2017

Yes, most document oriented DBs feature compression in some way. Storing documents as JSON strings just cries for proper compression. E.g. imagine billions of strings like "{timestamp: 123123123, bid: 123, ask: 124, bvol: 100, avol: 200}". It has let's say 12 bytes usable info in there. The record would use 64 bytes on disk without compression. That's a bloat by a factor of 5, and it could go worse, if you have longer key strings.

@Extarys
Copy link

Extarys commented Sep 13, 2017

I'm currently working with multiple big objects so I can imagine how fast this can grow in disk space.

I know for files LZMA offer the best between cpu time and compression rate, though I think it requires more memory.

@ml31415
Copy link

ml31415 commented Sep 18, 2017

Just stumbled across another possible candidate, which recently got introduced to the linux kernel: https://github.com/facebook/zstd

@robertjpayne
Copy link
Contributor

@ml31415 the reality is that single-document compression isn't going to be worth it for the CPU it costs. Have to get block level across the whole db.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests