Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Incorrectly picked URL in `scrapy.linkextractors.regex.RegexLinkExtractor` when there is a `<base>` tag. #1564

Closed
starrify opened this issue Oct 29, 2015 · 1 comment

Comments

@starrify
Copy link
Contributor

@starrify starrify commented Oct 29, 2015

Issue Description

Incorrectly picked URL in scrapy.linkextractors.regex.RegexLinkExtractor when there is a <base> tag.

How to Reproduce the Issue & Version Used

[pengyu@GLaDOS tmp]$ python2
Python 2.7.10 (default, Sep  7 2015, 13:51:49) 
[GCC 5.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import scrapy
>>> scrapy.__version__
u'1.0.3'
>>> html_body = '''
... <html>
...     <head>
...         <base href="http://b.com/">
...     </head>
...     <body>
...         <a href="test.html"></a>
...     </body>
... </html>
... '''
>>> response = scrapy.http.TextResponse(url='http://a.com/', body=html_body)
>>> import scrapy.linkextractors.regex
>>> scrapy.linkextractors.regex.RegexLinkExtractor().extract_links(response)
__main__:1: ScrapyDeprecationWarning: SgmlLinkExtractor is deprecated and will be removed in future releases. Please use scrapy.linkextractors.LinkExtractor
[Link(url='http://a.com/test.html', text=u'', fragment='', nofollow=False)]

Expected Result

URL of the extracted link shall start with 'http://b.com/'

Suggested Fix

The issue can be fixed by editing a few lines in scrapy/linkextractors/regex.py

starrify added a commit to starrify/scrapy that referenced this issue Oct 29, 2015
…ctors.regex.RegexLinkExtractor` when there is a `<base>` tag.)
starrify added a commit to starrify/scrapy that referenced this issue Oct 29, 2015
…ctors.regex.RegexLinkExtractor` when there is a `<base>` tag. )
dangra added a commit that referenced this issue Dec 4, 2015
[MRG+1] fixed: Issue #1564 (Incorrectly picked URL in `scrapy.linkextractors.regex.RegexLinkExtractor` when there is a `<base>` tag. )
@dangra
Copy link
Member

@dangra dangra commented Dec 4, 2015

closed by #1565

@dangra dangra closed this Dec 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants