Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvement #42

Merged
merged 36 commits into from
Aug 1, 2018
Merged

Performance improvement #42

merged 36 commits into from
Aug 1, 2018

Conversation

malloxpb
Copy link
Member

@malloxpb malloxpb commented Jul 26, 2018

This PR aims to improve the performance of SCURL

@codecov
Copy link

codecov bot commented Jul 26, 2018

Codecov Report

Merging #42 into master will decrease coverage by 0.21%.
The diff coverage is 61.86%.

@@            Coverage Diff             @@
##           master      #42      +/-   ##
==========================================
- Coverage   61.58%   61.37%   -0.22%     
==========================================
  Files           2        2              
  Lines         315      321       +6     
==========================================
+ Hits          194      197       +3     
- Misses        121      124       +3
Impacted Files Coverage Δ
scurl/canonicalize.pyx 19.46% <12.5%> (-1.96%) ⬇️
scurl/cgurl.pyx 84.13% <74.46%> (+4.41%) ⬆️

@malloxpb
Copy link
Member Author

Hey @lopuhin , in this PR, I have successfully increased the performance of urljoin and canonicalize_url by doing the folowing:

  • Took care of the double parsing in urljoin
  • Removed unnecessary to_native_str call in canonicalize_url since we know the types of variables
  • made calls to our function canonicalize_component from chromium instead of calling quote from urllib.parse.

These have brought significant increase to the performance of canonicalize_url as well as urljoin. In particular, the rate of link extraction from canonicalize_url increased from 31k links/sec to 44k links/sec. But let me know if there's anything that concerns you :)

@malloxpb malloxpb requested a review from lopuhin July 30, 2018 22:07
@malloxpb
Copy link
Member Author

For some reason the cgurl.cpp is conflicted with the one from master branch...

Copy link
Member

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nctl144 the changes look good to me, nice job 👍
But the build failed, can you have a look?

@@ -13,7 +13,7 @@ def main():
time = 0
time_canonicalize_url = 0

tar = tarfile.open("sites.tar.gz")
tar = tarfile.open("benchmarks/sites.tar.gz")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely unrelated, but do you recall why we don't we use the same chromium urls here as well?

@@ -85,7 +76,7 @@ cdef bytes unicode_handling(str):
bytes_str = <bytes>str
return bytes_str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious how this works: we have cdef bytes bytes_str above and return it from the function, but return type of the function is char * - is this correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, looks like Cython indeed does this conversion automatically.

@malloxpb
Copy link
Member Author

Hey @lopuhin , I have cleaned up the code even further by moving all the canonicalize code to canonocalize.pyx :) I think this PR is ready!

Copy link
Member

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @nctl144 the changes look good but the build is failing (for a different reason), can you have a look at it?

@malloxpb
Copy link
Member Author

malloxpb commented Aug 1, 2018

Hey @lopuhin , yeah sorry about that... I did not notice yesterday 😄 I just fixed it and the build is green now

Copy link
Member

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nctl144 , looks good! 👍

@malloxpb malloxpb merged commit 86f6623 into master Aug 1, 2018
@malloxpb malloxpb deleted the performance branch August 1, 2018 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants