Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added compression NONE #1045

Merged
merged 11 commits into from Aug 1, 2017

Conversation

@prog8
Copy link
Contributor

commented Jul 31, 2017

At this moment no-compression is implemented in the most naive way. This means we use memcpy to copy an uncompressed block to a comopressed buffer. This doesn't differ from compression expect we don't use any compressing algorithm. Ideally we could avoid memcpy but this will probably require deeper refactoring

@robot-metrika-test

This comment has been minimized.

Copy link

commented Jul 31, 2017

Can one of the admins verify this patch?

1 similar comment
@robot-metrika-test

This comment has been minimized.

Copy link

commented Jul 31, 2017

Can one of the admins verify this patch?

if (method == static_cast<UInt8>(CompressionMethodByte::LZ4) || method == static_cast<UInt8>(CompressionMethodByte::ZSTD))
{
size_compressed = unalignedLoad<UInt32>(&own_compressed_buffer[1]);
size_decompressed = unalignedLoad<UInt32>(&own_compressed_buffer[5]);
}
else if (method == static_cast<UInt8>(CompressionMethodByte::NONE))

This comment has been minimized.

Copy link
@alexey-milovidov

alexey-milovidov Jul 31, 2017

Member

This does not differ from above.

@@ -1030,8 +1030,11 @@ MergeTreeData::AlterDataPartTransactionPtr MergeTreeData::alterDataPart(
*this, part, DEFAULT_MERGE_BLOCK_SIZE, 0, 0, expression->getRequiredColumns(), ranges,
false, nullptr, "", false, 0, DBMS_DEFAULT_BUFFER_SIZE, false);

auto compression_method = this->context.chooseCompressionMethod(
this->getTotalActiveSizeInBytes(),

This comment has been minimized.

Copy link
@alexey-milovidov

alexey-milovidov Jul 31, 2017

Member

Looks incorrect. We need to get the ratio of size of data part to size of table.

This comment has been minimized.

Copy link
@prog8

prog8 Aug 1, 2017

Author Contributor

Does it mean I should take this->getTotalActiveSizeInBytes() and divide it by total compressed size? I'm asking because I'm not fully aware of what methods stands for what values and I am not much familiar with CH code.

@@ -146,8 +146,12 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataWriter::writeTempPart(BlockWithDa
ProfileEvents::increment(ProfileEvents::MergeTreeDataWriterBlocksAlreadySorted);
}

auto compression_method = data.context.chooseCompressionMethod(
data.getTotalActiveSizeInBytes(),
static_cast<double>(data.getTotalCompressedSize()) / data.getTotalActiveSizeInBytes());

This comment has been minimized.

Copy link
@alexey-milovidov
@@ -146,8 +146,12 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataWriter::writeTempPart(BlockWithDa
ProfileEvents::increment(ProfileEvents::MergeTreeDataWriterBlocksAlreadySorted);
}

auto compression_method = data.context.chooseCompressionMethod(
data.getTotalActiveSizeInBytes(),
static_cast<double>(data.getTotalActiveSizeInBytes()) / data.getTotalCompressedSize());

This comment has been minimized.

Copy link
@alexey-milovidov

alexey-milovidov Aug 1, 2017

Member

This is little tricky. You need to provide data part size and ratio of data part size to all table size in arguments of chooseCompressionMethod. But data part size is not known in advance, before we write it.

This logic is intended for cases when existing data will be recompressed (as in case of merges and alters). And we take data part size to make decision, do we need to recompress it.
But when writing a new part, it is meaningless.

I think, you need to pass two zeros as arguments. Compression method will be selected only if both min_part_size and min_part_size_ratio are zeros or was not specified in configuration for corresponding compression method.

You definitely will specify min_part_size and min_part_size_ratio as zeros for choosing compression method none, because otherwise, lz4 will be used.

This comment has been minimized.

Copy link
@prog8

prog8 Aug 1, 2017

Author Contributor

Honestly, I started (my local version of code) works with single compression method (both min_part_size and min_part_size_ratio set to 0 in config) but then prepared a pull request so I thought if it is likely to be accepted it should respect existing configuration of "compression" tag so I thought I have to use chooseCompressionMethod as it should be. If you are OK to use chooseCompressionMethod(0,0) I can add it.

@alexey-milovidov
Copy link
Member

left a comment

.

@alexey-milovidov alexey-milovidov merged commit ae8783a into yandex:master Aug 1, 2017

@prog8

This comment has been minimized.

Copy link
Contributor Author

commented Aug 1, 2017

@alexey-milovidov

This comment has been minimized.

Copy link
Member

commented Aug 1, 2017

Ok. I have added remaining changes. Thank you!

About copy avoidance. This is definitely worth to do.
For example, look at CompressedReadBufferFromFile::nextImpl.
This method prepares buffer for decompressed data (memory) and sets working_buffer to point to it. Then decompresses into working_buffer.

If you want to avoid excessive copy, you should just point working_buffer to range inside "compressed" data (all data without header).

@alexey-milovidov

This comment has been minimized.

Copy link
Member

commented Aug 1, 2017

Also it will be nice, if you could share performance testing results.
Both total numbers (query execution speed) and perf listings are interesting!

@prog8

This comment has been minimized.

Copy link
Contributor Author

commented Aug 1, 2017

Yeah I can do copy-free version but this only for reading but for writes there will be still memcpy because of hash function (checksum).
I think I will not use non-compression version in production use because it turns out I will waste too much disk space so I cannot afford speeding up queries in favor of storage usage. I think it is better to invest more in CPU cores and get the same results in terms of query speed.
Since I started playing with this I decided to push even small change instead of abandoning it so maybe someone will make a use of it.

@alexey-milovidov

This comment has been minimized.

Copy link
Member

commented Aug 1, 2017

Ok.

@prog8 prog8 deleted the prog8:nocompression branch Aug 2, 2017

@prog8

This comment has been minimized.

Copy link
Contributor Author

commented Aug 2, 2017

@alexey-milovidov I just reminded myself why I didn't drop memcpy for decompression. It is not a low hanging fruit task because CompressedReadBuffer allocates memory so I'd have to refactor CompressedReadBuffer. In addition to that I will have to change CompressedReadBufferBase::decompress to accept pointer to pointer instead of char * to. It is all doable but I didn't feel comfortable with making bigger changes.

@alexey-milovidov

This comment has been minimized.

Copy link
Member

commented Aug 3, 2017

Ok. If you are not going to use this compression method, it's not worth to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.