Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added compression NONE #1045

Merged
merged 11 commits into from
Aug 1, 2017
Merged

Conversation

prog8
Copy link
Contributor

@prog8 prog8 commented Jul 31, 2017

At this moment no-compression is implemented in the most naive way. This means we use memcpy to copy an uncompressed block to a comopressed buffer. This doesn't differ from compression expect we don't use any compressing algorithm. Ideally we could avoid memcpy but this will probably require deeper refactoring

@robot-metrika-test
Copy link

Can one of the admins verify this patch?

1 similar comment
@robot-metrika-test
Copy link

Can one of the admins verify this patch?

if (method == static_cast<UInt8>(CompressionMethodByte::LZ4) || method == static_cast<UInt8>(CompressionMethodByte::ZSTD))
{
size_compressed = unalignedLoad<UInt32>(&own_compressed_buffer[1]);
size_decompressed = unalignedLoad<UInt32>(&own_compressed_buffer[5]);
}
else if (method == static_cast<UInt8>(CompressionMethodByte::NONE))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not differ from above.

@@ -1030,8 +1030,11 @@ MergeTreeData::AlterDataPartTransactionPtr MergeTreeData::alterDataPart(
*this, part, DEFAULT_MERGE_BLOCK_SIZE, 0, 0, expression->getRequiredColumns(), ranges,
false, nullptr, "", false, 0, DBMS_DEFAULT_BUFFER_SIZE, false);

auto compression_method = this->context.chooseCompressionMethod(
this->getTotalActiveSizeInBytes(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks incorrect. We need to get the ratio of size of data part to size of table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean I should take this->getTotalActiveSizeInBytes() and divide it by total compressed size? I'm asking because I'm not fully aware of what methods stands for what values and I am not much familiar with CH code.

@@ -146,8 +146,12 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataWriter::writeTempPart(BlockWithDa
ProfileEvents::increment(ProfileEvents::MergeTreeDataWriterBlocksAlreadySorted);
}

auto compression_method = data.context.chooseCompressionMethod(
data.getTotalActiveSizeInBytes(),
static_cast<double>(data.getTotalCompressedSize()) / data.getTotalActiveSizeInBytes());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

@@ -146,8 +146,12 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataWriter::writeTempPart(BlockWithDa
ProfileEvents::increment(ProfileEvents::MergeTreeDataWriterBlocksAlreadySorted);
}

auto compression_method = data.context.chooseCompressionMethod(
data.getTotalActiveSizeInBytes(),
static_cast<double>(data.getTotalActiveSizeInBytes()) / data.getTotalCompressedSize());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is little tricky. You need to provide data part size and ratio of data part size to all table size in arguments of chooseCompressionMethod. But data part size is not known in advance, before we write it.

This logic is intended for cases when existing data will be recompressed (as in case of merges and alters). And we take data part size to make decision, do we need to recompress it.
But when writing a new part, it is meaningless.

I think, you need to pass two zeros as arguments. Compression method will be selected only if both min_part_size and min_part_size_ratio are zeros or was not specified in configuration for corresponding compression method.

You definitely will specify min_part_size and min_part_size_ratio as zeros for choosing compression method none, because otherwise, lz4 will be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I started (my local version of code) works with single compression method (both min_part_size and min_part_size_ratio set to 0 in config) but then prepared a pull request so I thought if it is likely to be accepted it should respect existing configuration of "compression" tag so I thought I have to use chooseCompressionMethod as it should be. If you are OK to use chooseCompressionMethod(0,0) I can add it.

Copy link
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@alexey-milovidov alexey-milovidov merged commit ae8783a into ClickHouse:master Aug 1, 2017
@prog8
Copy link
Contributor Author

prog8 commented Aug 1, 2017

Thanks @alexey-milovidov

@alexey-milovidov
Copy link
Member

Ok. I have added remaining changes. Thank you!

About copy avoidance. This is definitely worth to do.
For example, look at CompressedReadBufferFromFile::nextImpl.
This method prepares buffer for decompressed data (memory) and sets working_buffer to point to it. Then decompresses into working_buffer.

If you want to avoid excessive copy, you should just point working_buffer to range inside "compressed" data (all data without header).

@alexey-milovidov
Copy link
Member

Also it will be nice, if you could share performance testing results.
Both total numbers (query execution speed) and perf listings are interesting!

@prog8
Copy link
Contributor Author

prog8 commented Aug 1, 2017

Yeah I can do copy-free version but this only for reading but for writes there will be still memcpy because of hash function (checksum).
I think I will not use non-compression version in production use because it turns out I will waste too much disk space so I cannot afford speeding up queries in favor of storage usage. I think it is better to invest more in CPU cores and get the same results in terms of query speed.
Since I started playing with this I decided to push even small change instead of abandoning it so maybe someone will make a use of it.

@alexey-milovidov
Copy link
Member

Ok.

@prog8 prog8 deleted the nocompression branch August 2, 2017 09:12
@prog8
Copy link
Contributor Author

prog8 commented Aug 2, 2017

@alexey-milovidov I just reminded myself why I didn't drop memcpy for decompression. It is not a low hanging fruit task because CompressedReadBuffer allocates memory so I'd have to refactor CompressedReadBuffer. In addition to that I will have to change CompressedReadBufferBase::decompress to accept pointer to pointer instead of char * to. It is all doable but I didn't feel comfortable with making bigger changes.

@alexey-milovidov
Copy link
Member

Ok. If you are not going to use this compression method, it's not worth to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants