Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: Use faster hashing algo for cache key generation #3508

Merged
merged 2 commits into from Mar 27, 2023

Conversation

keulinho
Copy link
Contributor

The serialized array can become quite big and md5 slows down the bigger the input data.

On our test run this saved ~2s: https://blackfire.io/profiles/compare/0abc026f-4855-4bb0-ac84-c4a2eef825e0/graph
image

I guess now we are at the limit of the "quick fixes" in terms of performance improvements without actually touching the architecture of the whole system

@@ -452,7 +452,7 @@ private function shouldRemoveUseName(string $last, string $newNameLastName, bool
private function createOldToNewTypes(Node $node, array $oldToNewClasses): array
{
$oldToNewClasses = $this->resolveOldToNewClassCallbacks($node, $oldToNewClasses);
$cacheKey = md5(serialize($oldToNewClasses));
$cacheKey = \hash('crc32c', \serialize($oldToNewClasses));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the length seems got shorter, ref https://3v4l.org/e6hLl , I think current md5 is just fine.

In other area, we use sha1() and sha1_file() for config key and value and I think that's just fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep the key length got shorter, from 128 bit to 32 bit, but honestly i don't think that 128 bit keys are needed here to prevent collision.

From what i've seen there are not so many different $oldToNewClasses permutations, the chance of collisions in that case are pretty low, refer to https://preshing.com/20110504/hash-collision-probabilities/

And crc32c hashing is a lot faster then md5, the reason i choose that algo was from the performance comparison done here: https://php.watch/articles/php-hash-benchmark

Why do i think we should change this usage of md5? Because it showed up in the traces that while md5 was called overall ~300.000, but the ~55.000 calls from this place accounted for 96% of the cpu time in this method. The reason is that probably for the other calls to md5 the data input was a lot shorter and thus it had basically no effect on the performance, but here we feed it the serialized data of an array that could be quite long

This was in the blackfire trace before this change:
image

After this change neither the md5 calls, nor the hash calls appear in the trace because they were not significant enough in terms of performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I look into it, crc seems not cryptographycally, so I think it is better to use common just works, less performance is fine if we got more confident with data integrity.

Copy link
Contributor

@staabm staabm Mar 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Md5 isn't cryptographic secure either (and sha1 too).

I don't think its relevant for the use-case here though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

md5 is cryptographically, while not secure for password, it is fine for cache use case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched back to a 128bit hash, as i think the point about the collisions is fair enough.

Now it uses the xxh128 algorithm if it is available (as it was only introduced in php 8.1), but i think it matches the needs perfectly:

xxHash is an extremely fast hashing algorithm that is not designed for cryptographic purposes, but provides excellent randomness and dispersion of output, and uniqueness of to minimize collisions.

see: https://php.watch/versions/8.1/xxHash

And i totally agree with @staabm as in this case cryptographically secure hashing is not needed, what we need is fast hashing with a low chance of collision

So for systems running on php 8.1 they see the same performance gains as with the original change, system running on older php versions still see a slight improvement by switching from md5 to the slightly (nearly 2x) faster md4

@TomasVotruba TomasVotruba requested review from staabm and samsonasik and removed request for staabm March 27, 2023 07:29
Copy link
Member

@samsonasik samsonasik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @keulinho , let's give it a try

@samsonasik samsonasik merged commit 0f0b936 into rectorphp:main Mar 27, 2023
36 checks passed
@keulinho keulinho deleted the use-faster-hashing-algo branch March 27, 2023 08:24
samsonasik pushed a commit that referenced this pull request May 8, 2023
* Use faster hashing algo for cache key generation

* Use 128 bit hashing algo again
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants