New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawls duplicate/identical URLs #83
Comments
«If there are duplicate URLs on the page they will be crawled and exported as many times as you see the link.» Scrapy already takes care of duplicated urls by default. May be you don't have a properly configured installation of scrapy (which, by default, avoids to visit duplicated urls) |
If I don't have a properly configured installation of Scrapy it's because the default isn't to take care of duplicate URLs as I'm using the Vagrant solution. I assure you, the results are duplicated in the export... |
May be you are confusing duplicated urls with duplicated items? Slybot can still generate duplicate data, but from different urls, because url duplication is already handled by scrapy. In fact, if there were not such filter, the crawling would be infinite in most cases, as there are circular linkage between pages. The duplication of data is avoided by correctly defining the item type (in particular, the Vary flag of the field) |
It appears that for each time the URL appears on the index page an entry appears in the export. For example, if the URL appears 5 times on the page , then it shall appear 5 times in the export. The URL in the href of the anchor are identical, so I can't understand why this is occurring. Where do I find the documentation on the Vary flag? Thanks. |
https://github.com/scrapinghub/portia/blob/master/slybot/docs/project.rst#field I cannot help to understand what is happening in you particular case without more data than what you are providing. But I can assure you that scrapy does not visits multiple times the same url, unless it is instructed to do so. And even more, as I said before, if that it is really happening, your crawl would be infinite. |
I located a description on the scrapinghub.com website here:
However, this does not describe the problem I'm seeing. It isn't that each entry is a slight variation, each entry does not vary at all, each entry that is a duplicate is identical. The problem is almost the opposite, whereby I need to tell it to look for unique identifiers to say, these pages are identical if they contain a certain string. |
Yes, that doc was written by me. But as I said before, I cannot help without more description of your problem, the log, or a way to reproduce. Which page you are annotating, how you are annotating, how you defined the fields. What I can tell you is that scrapy and slybot already has tools for avoiding duplicates. Scrapy dedupes urls by default (otherwise, you would get infinite loops), and slybot dedupes by correctly using the Vary flag (if you have all the fields flagged as Vary, then no item will be deduped) |
Are you using the urls of the pages that you want to extract, as start urls of your spider? how many duplicates do you get for each item? |
There isn't a uniform hierarchy if that's what you mean, no. The Index page looks like this: While the target page looks like this: I'm assuming from your question that this would be problematic? |
Looking into this a bit closer I see what is happening... Although I see the URL repeated 5 times in the index page, these are not the problem. The problem is other links that when clicked redirect to the same page. Although I was perhaps a bit hasty reporting an issue, it has been useful to learn of the "Vary" option. Each page has a meta tag which contains a unique ID for the page, regardless of URL, alternatively there's a canonical link. I assume the solution here is to annotate value to a field and tick Vary? I have tried that, yet it still seems to export duplicates... Looking at this closer, I had not set the unique ID to required and the entries did not contain the unique ID... It seems it's unable to parse that field from the page which is why the export is duplicating. Instead I thought I would try the canonical link, ticking it as required and vary, yet I'm still seeing duplicates in the export... |
Hi James, I think you are missinterpreting how Vary flag works. You have to set the Vary flag only on fields that may vary and still be the same item. So, things like Unique IDs must never be Vary. Two items are considered the same (and so, filtered by dupes pipeline) if all not Vary flags are equal. |
On the contrary, for example, url must always be Vary, otherwise same item in different url will be considered distinct. |
The correct approach is to start with only url as Vary, and then add Vary fields when you really need it, in order to remove duplicates when the same item may appear with different values in a given field. |
I can't understand that, for example, let's assume this is wordpress and there's two posts: The URLs would be: The postID in the meta would be 123 or 234. The URLs may be: I can't understand why the canonical URL may Vary but the postID would never be... |
not the canonical URL. I am meaning the url from which the post was extracted (which is the 'url' field) If the url field is not flagged as Vary, then the data from www.example.com/?post=123&source=twitter and the data from www.example.com/?post=123 would be considered different, regardless you are extracting the same content. If you have a mean to extract a canonical URL from somewhere in the page, then it identifies uniquely the post, so you must not mark it as Vary. Observe that the url field is added automatically by the slybot spider. However, now that I see, portia does not define it as Vary by default. I think that is the source of the confusion. A fast solution is to add the field url in the item specification and mark it as Vary. But that should be properly handled by slybot and portia. I will discuss that. |
Hi Martin, Thanks for taking your time on this. I'm looking, but I cannot see a way to select the "url" item to give tick the 'Vary' field. That's why I used the canonical link element instead. |
I've discovered the troublesome field and set that to Vary to avoid duplication. I had overlooked that this field was changing but reviewing the log made it very clear. Think I've mastered it now... |
Ok. You welcome. If the issue is not valid anymore, please close :) |
If there are duplicate URLs on the page they will be crawled and exported as many times as you see the link.
It is very unusual circumstances that you would need to crawl the same URL more than once.
I have two proposals:
The text was updated successfully, but these errors were encountered: