Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed "Offsite middleware ignoring port #50" issue #3198

Closed
wants to merge 2 commits into from

Conversation

TakaakiFuruse
Copy link

@TakaakiFuruse TakaakiFuruse commented Apr 2, 2018

ref: #50

Although a patch was suggested, simply applying it fails another test. As shown here.


============================================================ FAILURES ============================================================
____________________________________________________ EngineTest.test_crawler _____________________________________________________

result = None, g = <generator object test_crawler at 0x7f396bc25780>, deferred = <Deferred at 0x7f396b6a6a28 current result: None>

    def _inlineCallbacks(result, g, deferred):
        """
        See L{inlineCallbacks}.
        """
        # This function is complicated by the need to prevent unbounded recursion
        # arising from repeatedly yielding immediately ready deferreds.  This while
        # loop and the waiting variable solve that by manually unfolding the
        # recursion.

        waiting = [True, # waiting for result?
                   None] # result

        while 1:
            try:
                # Send the last result back as the result of the yield expression.
                isFailure = isinstance(result, failure.Failure)
                if isFailure:
                    result = result.throwExceptionIntoGenerator(g)
                else:
>                   result = g.send(result)

/scrapy/.tox/py27/local/lib/python2.7/site-packages/twisted/internet/defer.py:1386:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/scrapy/tests/test_engine.py:167: in test_crawler
    self._assert_visited_urls()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <tests.test_engine.EngineTest testMethod=test_crawler>

    def _assert_visited_urls(self):
        must_be_visited = ["/", "/redirect", "/redirected",
                           "/item1.html", "/item2.html", "/item999.html"]
        urls_visited = set([rp[0].url for rp in self.run.respplug])
        urls_expected = set([self.run.geturl(p) for p in must_be_visited])
>       assert urls_expected <= urls_visited, "URLs not visited: %s" % list(urls_expected - urls_visited)
E       AssertionError: URLs not visited: ['http://localhost:42519/item2.html', 'http://localhost:42519/item999.html', 'http://localhost:42519/item1.html']

/scrapy/tests/test_engine.py:183: AssertionError
=============================================== 1 failed, 3 passed in 1.37 seconds ===============================================
ERROR: InvocationError: '/scrapy/.tox/py27/bin/py.test --cov=scrapy --cov-report= tests/test_engine.py'
____________________________________________________________ summary _____________________________________________________________

So, "/scrapy/tests/test_engine.py" is failing.
It looks the the test creates spiders by using

allowed_domains = ["scrapytest.org", "localhost"] 

A spider fails to visit some urls with port number like 'http://localhost:37089/item1.html'.
Since test server is launched each time we run the test, it has different port number for local host each time. This means we cannot add a fixed port number to "localhost" in 'allowed_domain" list.

So I modified the patch to...

# scrapy/spidermiddlewares/offsite.py

    def should_follow(self, request, spider):
        regex = self.host_regex
        # hostname can be None for wrong urls (like javascript links)
        hostname = urlparse_cached(request).hostname or ''
        netloc = urlparse_cached(request).netloc or ''
        return bool(regex.search(hostname)) or bool(regex.search(netloc))

- applied @imp0wd3r patch
- added test
  - it makes a request to 192.169.0.15 for port 8080 as in @eedeep's bug report
Found out that applying the suggested causes visitting some urls with
portnumber like "http://www.test.org:8080/1.html".
@codecov
Copy link

codecov bot commented Apr 2, 2018

Codecov Report

Merging #3198 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3198      +/-   ##
==========================================
+ Coverage   82.12%   82.12%   +<.01%     
==========================================
  Files         228      228              
  Lines        9593     9594       +1     
  Branches     1385     1385              
==========================================
+ Hits         7878     7879       +1     
  Misses       1456     1456              
  Partials      259      259
Impacted Files Coverage Δ
scrapy/spidermiddlewares/offsite.py 100% <100%> (ø) ⬆️

@isbm
Copy link

isbm commented Mar 1, 2019

Guys, is there any possibility to have this patch merged? Seems brings problems with ports, apparently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants