-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LinkExtractor with Unique = False doesn't extract fully identical Links #3798
Comments
Pre-solution: However, here we get another issue: The duplication of restrict_css and restrict_xpaths is prosed to be corrected by applying function unique_list() to joined/combined list of css and x_paths in init.py module of class FilteringLinkExtractor(object). a) Initial code for joined css_x_paths: self.restrict_xpaths = tuple(arg_to_iter(restrict_xpaths)) b) Adjusted code for joined css_x_paths: _joined_xpath_css = arg_to_iter(restrict_xpaths) |
…xtract fully identical Links
Fix bug of issue scrapy#3798. When use LinkExtractor with option 'unique=False', It still returns links with duplicates removed. I changed extract_links to reflect the unique option.
Case:
unexpected behavior identified for LinkExctractor with Unique = False if the page contains fully identical links (the same URL and text).
The current result returns one unique link instead of two or more identical ones.
For example:
The text was updated successfully, but these errors were encountered: