Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offsite middleware ignoring port #50

Closed
eedeep opened this issue Nov 16, 2011 · 10 comments · Fixed by #4413
Closed

Offsite middleware ignoring port #50

eedeep opened this issue Nov 16, 2011 · 10 comments · Fixed by #4413

Comments

@eedeep
Copy link

@eedeep eedeep commented Nov 16, 2011

In my spider I have the following:

class MySpider(BaseSpider):

    allowed_domains = ['192.169.0.15:8080']

and in the parse method I do something like:

    yield Request('http://192.169.0.15:8080/mypage.html', self.my_callback_function)

the result when I run the code is that that scrapy reports:

DEBUG: Filtered offsite request to '192.168.0.15': <GET http://192.168.0.15:8080/mypage.html>

Which is wrong - it seems to be ignoring the port. If I change the allowed_domains to:

    allowed_domains = ['192.169.0.15:8080', '192.16.0.15']

Then it works as you would expect it to. No big deal, can work around it but I think it is a bug. The problem being located in the should_follow method of the OffsiteMiddleware class in contrib/spidermiddleware/offsite.py

@immerrr
Copy link
Contributor

@immerrr immerrr commented Dec 7, 2016

@redapple can you reopen this? The problem seems to have resurfaced recently, it happens in 1.0.3 and i didn't see any changes touching this in recent releases.

@redapple
Copy link
Contributor

@redapple redapple commented Dec 7, 2016

@immerrr , do you have a reproducible example?

@redapple redapple reopened this Dec 7, 2016
@immerrr
Copy link
Contributor

@immerrr immerrr commented Dec 7, 2016

something like this should do:

# -*- coding: utf-8 -*-
import scrapy


class Localhost8000Spider(scrapy.Spider):
    name = "localhost_8000"
    allowed_domains = ["localhost:8000"]
    start_urls = (
        'http://localhost:8000/',
    )

    def parse(self, response):
        yield scrapy.Request(response.url + 'foobar')

Results in:

2016-12-07 12:23:54 [scrapy] DEBUG: Crawled (200) <GET http://localhost:8000/> (referer: None)
2016-12-07 12:23:54 [scrapy] DEBUG: Filtered offsite request to 'localhost': <GET http://localhost:8000/foobar>

@imp0wd3r
Copy link

@imp0wd3r imp0wd3r commented Mar 25, 2017

@redapple @immerrr I made the following change, and then it could work:

# scrapy/spidermiddlewares/offsite.py

def should_follow(self, request, spider):
    regex = self.host_regex
    # pre: host = urlparse_cached(request).hostname or ''
    host = urlparse_cached(request).netloc or ''
    return bool(regex.search(host))

@cathalgarvey
Copy link
Contributor

@cathalgarvey cathalgarvey commented Feb 20, 2018

Marking this as "Good First Issue", as there's a suggested patch already. For this to be "solved", a PR should include some additional test cases for this issue.

@tensigh
Copy link

@tensigh tensigh commented Feb 21, 2018

Thank you, that fix worked for me, too.

@TakaakiFuruse
Copy link

@TakaakiFuruse TakaakiFuruse commented Apr 2, 2018

@cathalgarvey
I'm trying to make a pull request (first pull request for a big project!!) now and found that applying the patch creates another error.
I think I need to research further...
Here's the test log.


============================================================ FAILURES ============================================================
____________________________________________________ EngineTest.test_crawler _____________________________________________________

result = None, g = <generator object test_crawler at 0x7f396bc25780>, deferred = <Deferred at 0x7f396b6a6a28 current result: None>

    def _inlineCallbacks(result, g, deferred):
        """
        See L{inlineCallbacks}.
        """
        # This function is complicated by the need to prevent unbounded recursion
        # arising from repeatedly yielding immediately ready deferreds.  This while
        # loop and the waiting variable solve that by manually unfolding the
        # recursion.

        waiting = [True, # waiting for result?
                   None] # result

        while 1:
            try:
                # Send the last result back as the result of the yield expression.
                isFailure = isinstance(result, failure.Failure)
                if isFailure:
                    result = result.throwExceptionIntoGenerator(g)
                else:
>                   result = g.send(result)

/scrapy/.tox/py27/local/lib/python2.7/site-packages/twisted/internet/defer.py:1386:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/scrapy/tests/test_engine.py:167: in test_crawler
    self._assert_visited_urls()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <tests.test_engine.EngineTest testMethod=test_crawler>

    def _assert_visited_urls(self):
        must_be_visited = ["/", "/redirect", "/redirected",
                           "/item1.html", "/item2.html", "/item999.html"]
        urls_visited = set([rp[0].url for rp in self.run.respplug])
        urls_expected = set([self.run.geturl(p) for p in must_be_visited])
>       assert urls_expected <= urls_visited, "URLs not visited: %s" % list(urls_expected - urls_visited)
E       AssertionError: URLs not visited: ['http://localhost:42519/item2.html', 'http://localhost:42519/item999.html', 'http://localhost:42519/item1.html']

/scrapy/tests/test_engine.py:183: AssertionError
=============================================== 1 failed, 3 passed in 1.37 seconds ===============================================
ERROR: InvocationError: '/scrapy/.tox/py27/bin/py.test --cov=scrapy --cov-report= tests/test_engine.py'
____________________________________________________________ summary _____________________________________________________________

@TakaakiFuruse
Copy link

@TakaakiFuruse TakaakiFuruse commented Apr 2, 2018

"/scrapy/tests/test_engine.py" is failing.
It looks the the test creates spiders by using

allowed_domains = ["scrapytest.org", "localhost"] 

If I apply the suggested patch,

# scrapy/spidermiddlewares/offsite.py

def should_follow(self, request, spider):
    regex = self.host_regex
    # pre: host = urlparse_cached(request).hostname or ''
    host = urlparse_cached(request).netloc or ''
    return bool(regex.search(host))

it fails to visit some urls with port number like 'http://localhost:37089/item1.html'.
Since test server is launched each time we run the test, it has different port number for local host each time. This means we cannot add fixed a port number to "localhost" in 'allowed_domain" list.

My idea is to patch like this.

# scrapy/spidermiddlewares/offsite.py

    def should_follow(self, request, spider):
        regex = self.host_regex
        # hostname can be None for wrong urls (like javascript links)
        hostname = urlparse_cached(request).hostname or ''
        netloc = urlparse_cached(request).netloc or ''
        return bool(regex.search(hostname)) or bool(regex.search(netloc))

Any ideas?

@Lukas0907
Copy link
Contributor

@Lukas0907 Lukas0907 commented Mar 7, 2020

The problem is that a spider with allowed_domains = ["localhost:8080"] is not allowed to request http://localhost:8080. However, if the user wants to crawl a URL with an explict port and doesn't want to shield the spider off from other ports on the same domain, the user could use allowed_domains = ["localhost"] which does allow requests to http://localhost:8080 (i.e. this is an easy workaround) but this is probably surprising/unintuitive to users and looks like a bug.

If we want to allow port numbers in allowed_domains, we have to decide on how it changes the semantics of allowed_domains (or the notion of a domain in the context of Scrapy in general).

A domain is used in two contexts: First, it is used to check if a URL is allowed to be followed (OffsiteMiddleware), second, it is used for extracting links from a site (LinkExtractor).

  • scrapy.utils.url.url_is_from_any_domain(url, domains) (used by LinkExtractor) uses the netloc part of the domain. Netloc for http://example.com:80 is example.com:80 (domain with port). Thefore the URL http://example.com:80 is not considered to be from the domain example.com.
  • should_follow() (used by OffsiteMiddleware) uses the hostname part of the domain. Hostname for http://example.com:8080 is example.com (domain without port). Therefore the URL http://example.com:8080 is allowed by the domain example.com.

(Note: The difference was introduced in b1f011d in an attempt to fix this issue #50.)

The following table goes into more detail:

URL Domain url_is_from_any_domain() allowed_domains
http://example.com example.com True True
https://example.com example.com True True
https://example.com example.com:443 False False
http://example.com:80 example.com False True
http://example.com example.com:80 False False
http://example.com:80 example.com:80 True False
http://example.com:8080 example.com False True
http://example.com example.com:8080 False False
http://example.com:8080 example.com:8080 True False
True iff hostnames and ports are equal. True iff hostnames are equal and domain has no port (port of the URL is ignored).

We could change should_follow() to use the same semantics as url_is_from_any_domain(), however, this would be a potentially breaking change in the following case: A user scrapes example.com:8080 and sets allowed_domains = ['example.com']. The new semantics of url_is_from_any_domain() would require the user to change the spider so that the port is included in allowed_domains, i.e. allowed_domains = ['example.com:8080']. This would be a breaking change and required the user to modify their spider.

Another solution would be to only consider the port in the URL if it is also given in the domain (this is the solution proposed by @TakaakiFuruse). I.e. allowed_domains = ['example.com'] would allow to follow http://example.com and http://example.com:8080 (which is the existing behavior). allowed_domains = ['example.com:8080'] would only allow to follow http://example.com:8080 but not http://example.com or http://example.com:80. This would be backwards-compatible.

I think that the best solution, however, would be to not allow ports for allowed_domains at all. I think this makes sense because the property is called allowed_domains, i.e. the user is expected to give a list of domains. The port number is not part of the domain--it's part of the URL. Since the user might not think about the difference between a domain and a URL, there is already a check in the offsite middleware that warns if the user passes a URL:

    def get_host_regex(self, spider):
        # (...)
        url_pattern = re.compile("^https?://.*$")
        for domain in allowed_domains:
            if url_pattern.match(domain):
                message = ("allowed_domains accepts only domains, not URLs. "
                           "Ignoring URL entry %s in allowed_domains." % domain)
                warnings.warn(message, URLWarning)
        domains = [re.escape(d) for d in allowed_domains if d is not None]
        regex = r'^(.*\.)?(%s)$' % '|'.join(domains)
        return re.compile(regex)

We could now also issue a warning if a domain with a port number is added and ignore the domain---just like how it is done for URLs. I have prepared PR #4413 to fix this issue.

@nyov
Copy link
Contributor

@nyov nyov commented Mar 8, 2020

I agree with @Lukas0907 on the principle here:
allowed_domains handles domains. It don't currently see a reason to include any ports. allowed_domains is not the start_urls setting.

Behaviour: allow any ports; don't check against netloc, but against hostname property.

What sensible reason is there to restrict OffsiteMiddleware to ports? Allow port 443 but not port 80 links? Is this kind of granularity useful?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.