Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawls duplicate/identical URLs #83

Closed
jpswade opened this issue Jul 29, 2014 · 18 comments
Closed

Crawls duplicate/identical URLs #83

jpswade opened this issue Jul 29, 2014 · 18 comments

Comments

@jpswade
Copy link

jpswade commented Jul 29, 2014

If there are duplicate URLs on the page they will be crawled and exported as many times as you see the link.

It is very unusual circumstances that you would need to crawl the same URL more than once.

I have two proposals:

  1. Have a checkbox so you can tick "Avoid visiting duplicate links".
  2. Alternatively, add filtering options in the link crawler to filter by HTML markup too, that way only only links with certain classes.
@kalessin
Copy link
Member

«If there are duplicate URLs on the page they will be crawled and exported as many times as you see the link.»

Scrapy already takes care of duplicated urls by default. May be you don't have a properly configured installation of scrapy (which, by default, avoids to visit duplicated urls)

@jpswade
Copy link
Author

jpswade commented Jul 30, 2014

If I don't have a properly configured installation of Scrapy it's because the default isn't to take care of duplicate URLs as I'm using the Vagrant solution.

I assure you, the results are duplicated in the export...

@kalessin
Copy link
Member

May be you are confusing duplicated urls with duplicated items? Slybot can still generate duplicate data, but from different urls, because url duplication is already handled by scrapy. In fact, if there were not such filter, the crawling would be infinite in most cases, as there are circular linkage between pages.

The duplication of data is avoided by correctly defining the item type (in particular, the Vary flag of the field)

@jpswade
Copy link
Author

jpswade commented Jul 30, 2014

It appears that for each time the URL appears on the index page an entry appears in the export.

For example, if the URL appears 5 times on the page , then it shall appear 5 times in the export.

The URL in the href of the anchor are identical, so I can't understand why this is occurring.

Where do I find the documentation on the Vary flag?

Thanks.

@kalessin
Copy link
Member

https://github.com/scrapinghub/portia/blob/master/slybot/docs/project.rst#field

I cannot help to understand what is happening in you particular case without more data than what you are providing. But I can assure you that scrapy does not visits multiple times the same url, unless it is instructed to do so. And even more, as I said before, if that it is really happening, your crawl would be infinite.

@jpswade
Copy link
Author

jpswade commented Jul 30, 2014

I located a description on the scrapinghub.com website here:

Vary
Autoscraping has a duplicate item detection system which will reject any item that has already been scraped. In order to accomplish this task the duplicates detector needs to know which fields must be compared in order to effectively find duplicate items. If a field is marked as Vary, it is not included in the checks to detect duplicates. This means that two items that have the same data in all fields except those marked as Vary, will be considered identical and, therefore, the second scraped item will be dropped. Or, to put it another way, when you mark a field as Vary you are declaring that the same item may be found with different values in this field. It is for this reason that url field must always be marked as Vary (and the user interface does not allow to unselect it): if it wasn’t a Vary field, then items from different URLs would always be considered different and the duplicates detector would never work.

Let’s illustrate with an example. Suppose we have an item type with fields name, price, description, category and url, while the fields category and url are marked as Vary. Now suppose that the Autoscraping bot has scraped the following item first:

name: Louis XIV Table
price: 1000.00
description: Very high quality Louis XIV style table
category: Tables
url: http://www.furniture.com/tables/louis-xiv-table.html
Then later it extracts the following item in a different place on the site:

name: Louis XIV Table
price: 1000.00
description: Very high quality Louis XIV style table
category: Living Room
url: http://www.furniture.com/living-room/louis-xiv-table.html
It is, of course, the same product, but the specific map of the site allows it to appear in two different places under different product categories. Because url and category are marked as Vary, only name, price and description are checked by the duplicates detector. Since all of these fields have the same value in both items, the second one is considered a duplicate of the first, and so it is rejected. Note that if url and category were not marked as Vary, then the duplicates detection system would consider them as different products, and so both would be generated. The term Vary is used to indicate that fields marked in this way may vary their values, still allowing items to be treated as identical.

However, this does not describe the problem I'm seeing.

It isn't that each entry is a slight variation, each entry does not vary at all, each entry that is a duplicate is identical.

The problem is almost the opposite, whereby I need to tell it to look for unique identifiers to say, these pages are identical if they contain a certain string.

@kalessin
Copy link
Member

Yes, that doc was written by me.

But as I said before, I cannot help without more description of your problem, the log, or a way to reproduce. Which page you are annotating, how you are annotating, how you defined the fields. What I can tell you is that scrapy and slybot already has tools for avoiding duplicates. Scrapy dedupes urls by default (otherwise, you would get infinite loops), and slybot dedupes by correctly using the Vary flag (if you have all the fields flagged as Vary, then no item will be deduped)

@kalessin
Copy link
Member

Are you using the urls of the pages that you want to extract, as start urls of your spider? how many duplicates do you get for each item?

@jpswade
Copy link
Author

jpswade commented Jul 30, 2014

There isn't a uniform hierarchy if that's what you mean, no.

The Index page looks like this:
http://www.example.co.uk/services/...-12345/sort/default

While the target page looks like this:
http://www.example.co.uk/classified/advert/2014.../sort/default/.../dealer/12345/...

I'm assuming from your question that this would be problematic?

@jpswade
Copy link
Author

jpswade commented Jul 31, 2014

Looking into this a bit closer I see what is happening...

Although I see the URL repeated 5 times in the index page, these are not the problem.

The problem is other links that when clicked redirect to the same page.

Although I was perhaps a bit hasty reporting an issue, it has been useful to learn of the "Vary" option.

Each page has a meta tag which contains a unique ID for the page, regardless of URL, alternatively there's a canonical link.

I assume the solution here is to annotate value to a field and tick Vary?

I have tried that, yet it still seems to export duplicates...

Looking at this closer, I had not set the unique ID to required and the entries did not contain the unique ID...

It seems it's unable to parse that field from the page which is why the export is duplicating.

Instead I thought I would try the canonical link, ticking it as required and vary, yet I'm still seeing duplicates in the export...

@kalessin
Copy link
Member

Hi James,

I think you are missinterpreting how Vary flag works. You have to set the Vary flag only on fields that may vary and still be the same item. So, things like Unique IDs must never be Vary.

Two items are considered the same (and so, filtered by dupes pipeline) if all not Vary flags are equal.

@kalessin
Copy link
Member

On the contrary, for example, url must always be Vary, otherwise same item in different url will be considered distinct.

@kalessin
Copy link
Member

The correct approach is to start with only url as Vary, and then add Vary fields when you really need it, in order to remove duplicates when the same item may appear with different values in a given field.

@jpswade
Copy link
Author

jpswade commented Jul 31, 2014

I can't understand that, for example, let's assume this is wordpress and there's two posts:

The URLs would be:
www.example.com/?post=123
www.example.com/?post=234

The postID in the meta would be 123 or 234.

The URLs may be:
www.example.com/?post=123&source=twitter
but the canonical URL would be
www.example.com/?post=123
and the postID would be
123

I can't understand why the canonical URL may Vary but the postID would never be...

@kalessin
Copy link
Member

not the canonical URL. I am meaning the url from which the post was extracted (which is the 'url' field)

If the url field is not flagged as Vary, then the data from www.example.com/?post=123&source=twitter and the data from www.example.com/?post=123 would be considered different, regardless you are extracting the same content.

If you have a mean to extract a canonical URL from somewhere in the page, then it identifies uniquely the post, so you must not mark it as Vary.

Observe that the url field is added automatically by the slybot spider. However, now that I see, portia does not define it as Vary by default. I think that is the source of the confusion. A fast solution is to add the field url in the item specification and mark it as Vary. But that should be properly handled by slybot and portia. I will discuss that.

@jpswade
Copy link
Author

jpswade commented Jul 31, 2014

Hi Martin,

Thanks for taking your time on this.

I'm looking, but I cannot see a way to select the "url" item to give tick the 'Vary' field.

That's why I used the canonical link element instead.

@jpswade
Copy link
Author

jpswade commented Jul 31, 2014

I've discovered the troublesome field and set that to Vary to avoid duplication.

I had overlooked that this field was changing but reviewing the log made it very clear.

Think I've mastered it now...

@kalessin
Copy link
Member

Ok. You welcome. If the issue is not valid anymore, please close :)

@jpswade jpswade closed this as completed Aug 1, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants