New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offsite middleware ignoring port #50
Comments
@redapple can you reopen this? The problem seems to have resurfaced recently, it happens in 1.0.3 and i didn't see any changes touching this in recent releases. |
@immerrr , do you have a reproducible example? |
something like this should do: # -*- coding: utf-8 -*-
import scrapy
class Localhost8000Spider(scrapy.Spider):
name = "localhost_8000"
allowed_domains = ["localhost:8000"]
start_urls = (
'http://localhost:8000/',
)
def parse(self, response):
yield scrapy.Request(response.url + 'foobar') Results in:
|
@redapple @immerrr I made the following change, and then it could work: # scrapy/spidermiddlewares/offsite.py
def should_follow(self, request, spider):
regex = self.host_regex
# pre: host = urlparse_cached(request).hostname or ''
host = urlparse_cached(request).netloc or ''
return bool(regex.search(host)) |
Marking this as "Good First Issue", as there's a suggested patch already. For this to be "solved", a PR should include some additional test cases for this issue. |
Thank you, that fix worked for me, too. |
@cathalgarvey
|
"/scrapy/tests/test_engine.py" is failing.
If I apply the suggested patch, # scrapy/spidermiddlewares/offsite.py
def should_follow(self, request, spider):
regex = self.host_regex
# pre: host = urlparse_cached(request).hostname or ''
host = urlparse_cached(request).netloc or ''
return bool(regex.search(host)) it fails to visit some urls with port number like 'http://localhost:37089/item1.html'. My idea is to patch like this. # scrapy/spidermiddlewares/offsite.py
def should_follow(self, request, spider):
regex = self.host_regex
# hostname can be None for wrong urls (like javascript links)
hostname = urlparse_cached(request).hostname or ''
netloc = urlparse_cached(request).netloc or ''
return bool(regex.search(hostname)) or bool(regex.search(netloc)) Any ideas? |
The problem is that a spider with If we want to allow port numbers in A domain is used in two contexts: First, it is used to check if a URL is allowed to be followed (
(Note: The difference was introduced in b1f011d in an attempt to fix this issue #50.) The following table goes into more detail:
We could change Another solution would be to only consider the port in the URL if it is also given in the domain (this is the solution proposed by @TakaakiFuruse). I.e. I think that the best solution, however, would be to not allow ports for def get_host_regex(self, spider):
# (...)
url_pattern = re.compile("^https?://.*$")
for domain in allowed_domains:
if url_pattern.match(domain):
message = ("allowed_domains accepts only domains, not URLs. "
"Ignoring URL entry %s in allowed_domains." % domain)
warnings.warn(message, URLWarning)
domains = [re.escape(d) for d in allowed_domains if d is not None]
regex = r'^(.*\.)?(%s)$' % '|'.join(domains)
return re.compile(regex) We could now also issue a warning if a domain with a port number is added and ignore the domain---just like how it is done for URLs. I have prepared PR #4413 to fix this issue. |
I agree with @Lukas0907 on the principle here: Behaviour: allow any ports; don't check against What sensible reason is there to restrict OffsiteMiddleware to ports? Allow port 443 but not port 80 links? Is this kind of granularity useful? |
In my spider I have the following:
class MySpider(BaseSpider):
and in the parse method I do something like:
the result when I run the code is that that scrapy reports:
DEBUG: Filtered offsite request to '192.168.0.15': <GET http://192.168.0.15:8080/mypage.html>
Which is wrong - it seems to be ignoring the port. If I change the allowed_domains to:
Then it works as you would expect it to. No big deal, can work around it but I think it is a bug. The problem being located in the should_follow method of the OffsiteMiddleware class in contrib/spidermiddleware/offsite.py
The text was updated successfully, but these errors were encountered: