Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LinkExtractor with Unique = False doesn't extract fully identical Links #3798

Closed
Ksianka opened this issue May 27, 2019 · 1 comment · Fixed by #5458
Closed

LinkExtractor with Unique = False doesn't extract fully identical Links #3798

Ksianka opened this issue May 27, 2019 · 1 comment · Fixed by #5458
Labels

Comments

@Ksianka
Copy link

Ksianka commented May 27, 2019

Case:
unexpected behavior identified for LinkExctractor with Unique = False if the page contains fully identical links (the same URL and text).
The current result returns one unique link instead of two or more identical ones.
For example:

  1. Local http file
<html>
<body>
<a href='sample3.html'>sample 3 repetition</a>
<a href='sample3.html'>sample 3 repetition</a>
</body>
</html>```

2) Scrapy test code:
ignore the error about <GET file:///robots.txt>  as we use local file

`import scrapy
from scrapy.linkextractors import LinkExtractor

class QuotesSpider(scrapy.Spider):
    name = “test”
    start_urls = [
        'file:///<insert_your_local_path>/Test_file.html',
    ]

def __init__(self, *args, **kwargs):
    super(QuotesSpider, self).__init__(*args, **kwargs)
    self.le = LinkExtractor(unique=False)

def parse(self, response):
    links = self.le.extract_links(response)
    yield {'extract_link': links}`

3) run the code: scrapy crawl test, and get the result with one link:

{'extract_link': [Link(url='file:///<your_local_path>/sample3.html', text='sample 3 repetition', fragment='', nofollow=False)]}

Instead of the result with two links from HTML:

{'extract_link': [Link(url='file:///<your_local_path>/sample3.html', text='sample 3 repetition', fragment='', nofollow=False), 
Link(url='<your_local_path>/sample3.html', text='sample 3 repetition', fragment='', nofollow=False)]}
@Ksianka Ksianka changed the title #3796 LinkExtractor with Unique = False doesn't extract fully identical Links #3798 LinkExtractor with Unique = False doesn't extract fully identical Links May 27, 2019
@Ksianka
Copy link
Author

Ksianka commented May 27, 2019

Pre-solution:
the issue is driven by the public method exctract_link of LxmlLinkExtractor(the same as LinkExctractor) which returns unique_list(all_links), see line 130 in module lxmlhtml.py, instead of all_links (similar to solution in sgml.py).

However, here we get another issue:
It looks like the original idea of this unique_list(all_links) was to avoid duplication of links if there is a repetition of tags in restrict_css and restrict_xpaths stated by user In LinkExctractor instance.
For example: self.le = ##LinkExtractor(unique=False, restrict_css=('a', 'a'), restrict_xpaths=('a', ‘a’)).

The duplication of restrict_css and restrict_xpaths is prosed to be corrected by applying function unique_list() to joined/combined list of css and x_paths in init.py module of class FilteringLinkExtractor(object).
Code for correction:

a) Initial code for joined css_x_paths:
scrapy/linkextractors/init.py

self.restrict_xpaths = tuple(arg_to_iter(restrict_xpaths))
self.restrict_xpaths += tuple(map(self._csstranslator.css_to_xpath,
arg_to_iter(restrict_css)))

b) Adjusted code for joined css_x_paths:
scrapy/linkextractors/init.py

_joined_xpath_css = arg_to_iter(restrict_xpaths)
_joined_xpath_css += tuple((map(self._csstranslator.css_to_xpath,
arg_to_iter(restrict_css))))
self.restrict_xpaths = tuple(unique_list(_joined_xpath_css))

Ksianka pushed a commit to Ksianka/scrapy that referenced this issue May 27, 2019
@Gallaecio Gallaecio added the bug label Aug 19, 2019
qwlake added a commit to qwlake/scrapy that referenced this issue Jul 26, 2020
Fix bug of issue scrapy#3798. When use LinkExtractor with option 'unique=False', It still returns links with duplicates removed. I changed extract_links to reflect the unique option.
@wRAR wRAR changed the title #3798 LinkExtractor with Unique = False doesn't extract fully identical Links LinkExtractor with Unique = False doesn't extract fully identical Links Jul 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment