Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial functionality. #4
Initial functionality. #4
Changes from 14 commits
40b6d51
1f6cc87
de962c6
ffb68cf
ab98bf2
e932c1f
44e8e66
de00aea
dac0b16
20bc9d8
6fd9b91
233a62f
5fcd156
7f442a7
e335437
044c5a7
3254a7b
7daf848
d843cfd
1f5a56b
34b9cc2
d2d831f
2238ac5
d39eb3d
5567a10
5d7840b
2d74b1a
a27d14d
3953115
9367bfe
570eafc
ac5a9b0
14f63c8
ac6421a
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can estimate the memory usage of this approach somehow. If it's a lot, it may even make sense to have a separate code path when we know the fingerprints are not going to be updated (e.g. learning is not enabled, or learning is finished).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main worry is that with this implementation there is a non-zero chance zyte-spider-templates RAM usage may blow up above SC free unit limits (or 1 SC unit limit) in reasonably common cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This skips requests without the meta key, but where will we use that key? Probably for all normal requests?
As for memory usage I wanted to say it's comparable to the fingerprinter one but then I realized that URLs are often longer than fingerprints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I still think it shouldn't :) There was a thread in the original proposal about this. cc @BurnzZ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way we can save on RAM is to store this on disk, like using https://docs.python.org/3/library/shelve.html. Although this uses a dict-like interface, we can simply use the keys for uniqueness and leave the values empty.
Moreover, as a sidenote, there's a caveat to using shelve's
.get()
method, where it runs inO(n)
(I learned this the hard way) Ref. Using something like this is faster:Having an opt-in approach is indeed tedious for the user to set but allows to narrow down which types of URLs will be stored here, and thus, reducing the storage needed.
What do you think about having a similar approach with scrapy-zyte-api's
TRANSPARENT_MODE
while allowing"zyte_api"
to be set in the meta. Users can use a setting that turns everything on but for zyte-spider-templates, we can manually set DUD via the meta for optimal usage.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer a solution where the overhead is minimal and it's opt-out :) It seems this is achievable. Url matching looks optimized enough, and I'm optimistic RAM can also be optimized.
DIsc storage will trade off RAM for speed; it may not be a good trade here, but I'm not sure.
For URL storage there are also data structures like tries, which can save a lot of memory.
But on a first sight, it looks like if we can make an assumption that the fingeprints don't change, and have a separate code path for the case they can change, it all can be pretty optimal. In this case we may even think if the storate can be shared with the dupefilter, but that's a separate idea.
When the learning is enabled, it looks like optimization can be significantly harder. So, maybe we can have learning opt-in (maybe per-request), not the whole thing opt-in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I don't think we must have a final optimized implementation in this PR to have it merged.
Addressing it (i.e. evaluating how big is the issue, and making necessary optimizations) is a blocker to make use of DUD in zyte-spider-templates by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it seems we need to understand if an optimized version is possible to make a decision on having the component enabled by default for all request (+ a way to opt out per request?) vs having it opt-in per-request.