Fixed "Offsite middleware ignoring port #50" issue #3198

TakaakiFuruse · 2018-04-02T23:32:38Z

ref: #50

Although a patch was suggested, simply applying it fails another test. As shown here.


============================================================ FAILURES ============================================================
____________________________________________________ EngineTest.test_crawler _____________________________________________________

result = None, g = <generator object test_crawler at 0x7f396bc25780>, deferred = <Deferred at 0x7f396b6a6a28 current result: None>

    def _inlineCallbacks(result, g, deferred):
        """
        See L{inlineCallbacks}.
        """
        # This function is complicated by the need to prevent unbounded recursion
        # arising from repeatedly yielding immediately ready deferreds.  This while
        # loop and the waiting variable solve that by manually unfolding the
        # recursion.

        waiting = [True, # waiting for result?
                   None] # result

        while 1:
            try:
                # Send the last result back as the result of the yield expression.
                isFailure = isinstance(result, failure.Failure)
                if isFailure:
                    result = result.throwExceptionIntoGenerator(g)
                else:
>                   result = g.send(result)

/scrapy/.tox/py27/local/lib/python2.7/site-packages/twisted/internet/defer.py:1386:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/scrapy/tests/test_engine.py:167: in test_crawler
    self._assert_visited_urls()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <tests.test_engine.EngineTest testMethod=test_crawler>

    def _assert_visited_urls(self):
        must_be_visited = ["/", "/redirect", "/redirected",
                           "/item1.html", "/item2.html", "/item999.html"]
        urls_visited = set([rp[0].url for rp in self.run.respplug])
        urls_expected = set([self.run.geturl(p) for p in must_be_visited])
>       assert urls_expected <= urls_visited, "URLs not visited: %s" % list(urls_expected - urls_visited)
E       AssertionError: URLs not visited: ['http://localhost:42519/item2.html', 'http://localhost:42519/item999.html', 'http://localhost:42519/item1.html']

/scrapy/tests/test_engine.py:183: AssertionError
=============================================== 1 failed, 3 passed in 1.37 seconds ===============================================
ERROR: InvocationError: '/scrapy/.tox/py27/bin/py.test --cov=scrapy --cov-report= tests/test_engine.py'
____________________________________________________________ summary _____________________________________________________________

So, "/scrapy/tests/test_engine.py" is failing.
It looks the the test creates spiders by using

allowed_domains = ["scrapytest.org", "localhost"]

A spider fails to visit some urls with port number like 'http://localhost:37089/item1.html'.
Since test server is launched each time we run the test, it has different port number for local host each time. This means we cannot add a fixed port number to "localhost" in 'allowed_domain" list.

So I modified the patch to...

# scrapy/spidermiddlewares/offsite.py

    def should_follow(self, request, spider):
        regex = self.host_regex
        # hostname can be None for wrong urls (like javascript links)
        hostname = urlparse_cached(request).hostname or ''
        netloc = urlparse_cached(request).netloc or ''
        return bool(regex.search(hostname)) or bool(regex.search(netloc))

@imp0wd3r

- applied @imp0wd3r patch - added test - it makes a request to 192.169.0.15 for port 8080 as in @eedeep's bug report

Found out that applying the suggested causes visitting some urls with portnumber like "http://www.test.org:8080/1.html".

codecov · 2018-04-02T23:53:00Z

Codecov Report

Merging #3198 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3198      +/-   ##
==========================================
+ Coverage   82.12%   82.12%   +<.01%     
==========================================
  Files         228      228              
  Lines        9593     9594       +1     
  Branches     1385     1385              
==========================================
+ Hits         7878     7879       +1     
  Misses       1456     1456              
  Partials      259      259

Impacted Files	Coverage Δ
scrapy/spidermiddlewares/offsite.py	`100% <100%> (ø)`	⬆️

isbm · 2019-03-01T11:10:06Z

Guys, is there any possibility to have this patch merged? Seems brings problems with ports, apparently.

TakaakiFuruse added 2 commits April 2, 2018 09:18

Applied a patch and added test

bed9ca3

- applied @imp0wd3r patch - added test - it makes a request to 192.169.0.15 for port 8080 as in @eedeep's bug report

Modified suggested patch

7d52631

Found out that applying the suggested causes visitting some urls with portnumber like "http://www.test.org:8080/1.html".

Lukas0907 mentioned this pull request Mar 7, 2020

Ignore a domain in allowed_domains with port and issue a warning #4413

Merged

Gallaecio closed this in #4413 Mar 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed "Offsite middleware ignoring port #50" issue #3198

Fixed "Offsite middleware ignoring port #50" issue #3198

TakaakiFuruse commented Apr 2, 2018 •

edited

Loading

codecov bot commented Apr 2, 2018 •

edited

Loading

isbm commented Mar 1, 2019

Fixed "Offsite middleware ignoring port #50" issue #3198

Fixed "Offsite middleware ignoring port #50" issue #3198

Conversation

TakaakiFuruse commented Apr 2, 2018 • edited Loading

codecov bot commented Apr 2, 2018 • edited Loading

Codecov Report

isbm commented Mar 1, 2019

TakaakiFuruse commented Apr 2, 2018 •

edited

Loading

codecov bot commented Apr 2, 2018 •

edited

Loading