Refactor http/ftp URL tests, use urlpase_cached + add more tests #1

redapple · 2016-10-18T10:55:13Z

The change is mainly about handling if url.username or url.password:, instead of and, since HTTP credentials need to work for http://username@example.com and http://username:@example.com too

The other change is to use urlparse_cached to potentially save on a few urlparsing ops.

This implementation also relaxes the tests on ftp_user and ftp_password to not require both of them, since the password in FTP looks optional (in theory), leaving the requirement for password (if needed) to the FTP download handler. This is debatable.

nyov · 2016-10-20T13:15:22Z

I think this is great, and should be merged.
But I have found some possible improvements while testing.

It doesn't seem possible to give a spider-wide setting such as with "http_user" spider attribute, and the ftp request will lose the credentials on subsequent requests.
My start_urls was 'ftp://root:root@127.127.127.127/'and the follow-up request went to ftp://127.127.127.127/ with a KeyError: 'ftp_user'.
That seems actually more correct, as the credentials should be Site/Domain/Realm-wide instead of spider globals. (An addition to the Middleware could be to cache credentials+realm until evicted, in the MW or as spider attributes. Realm for FTP would be the domain or IP.)

I'm not quite sure about those raised KeyErrors. Would it be possible to handle them more gracefully, in the event those URLs come from a crawl and are not hand-picked targets?

ftp_user might default to some anonymous user without password, as you mentioned (browser behavior is "Mozilla@" or something -- the @ for the case where an email might be expected),
or the MW might cancel the request early and log a message such as:
[scrapy] ERROR: Skipped downloading <GET ftp://127.127.127.127/>: No FTP credentials given.

The other issue I found is probably in the shell: scrapy fetch and scrapy view simply didn't acknowledge a ftp url as a valid url.

redapple · 2016-10-20T13:29:47Z

Thanks for the feedback @nyov . I haven't actually tested this patch with a real FTP server.

It doesn't seem possible to give a spider-wide setting such as with "http_user" spider attribute

You mean an spider-wise ftp_user(/ftp_passord)?
It can be added, simply to have coherent behavior for HTTP and FTP.

My start_urls was 'ftp://root:root@127.127.127.127/'and the follow-up request went to ftp://127.127.127.127/ with a KeyError: 'ftp_user'.

Were these manually-crafted URLs?
Do you have some logs for these KeyError?

ftp_user might default to some anonymous user without password, (...)

Default FTP user for Twisted is username='anonymous'
and it seems that Twisted FTPClient() does handle None FTP passwords

The other issue I found is probably in the shell: scrapy fetch and scrapy view simply didn't acknowledge a ftp url as a valid url.

Ah right, I have not tested this.
Does it work for http:// (if you tested by any chance)?

redapple · 2016-10-20T13:46:23Z

It does work with scrapy shell and http://

$ scrapy shell "http://user:pass@httpbin.org/headers"
2016-10-20 15:43:48 [scrapy] INFO: Scrapy 1.2.0dev2 started (bot: scrapybot)
(...)
2016-10-20 15:43:49 [scrapy] INFO: Spider opened
2016-10-20 15:43:49 [default] INFO: AuthMiddleware.process_request(<GET http://user:pass@httpbin.org/headers>, <DefaultSpider 'default' at 0x7f3a71923950>)
2016-10-20 15:43:49 [default] INFO: AuthMiddleware.process_request(<GET http://httpbin.org/headers>, <DefaultSpider 'default' at 0x7f3a71923950>)
2016-10-20 15:43:59 [scrapy] DEBUG: Crawled (200) <GET http://httpbin.org/headers> (referer: None)
(...)
>>> print(response.body)
{
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
    "Accept-Encoding": "gzip,deflate", 
    "Accept-Language": "en", 
    "Authorization": "Basic dXNlcjpwYXNz", 
    "Host": "httpbin.org", 
    "User-Agent": "Scrapy/1.2.0dev2 (+http://scrapy.org)"
  }
}

redapple · 2016-10-20T13:49:20Z

@nyov , right, there's something fishy with ftp_user:

$ scrapy shell ftp://ftp.eu.metabrainz.org/pub/musicbrainz/data/fullexport/20161019-001816/MD5SUMS
2016-10-20 15:48:19 [scrapy] INFO: Scrapy 1.2.0dev2 started (bot: scrapybot)
2016-10-20 15:48:19 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2016-10-20 15:48:19 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-10-20 15:48:19 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.auth.AuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-10-20 15:48:19 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-10-20 15:48:19 [scrapy] INFO: Enabled item pipelines:
[]
2016-10-20 15:48:19 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-10-20 15:48:19 [scrapy] INFO: Spider opened
2016-10-20 15:48:19 [default] INFO: AuthMiddleware.process_request(<GET ftp://ftp.eu.metabrainz.org/pub/musicbrainz/data/fullexport/20161019-001816/MD5SUMS>, <DefaultSpider 'default' at 0x7f1927558e10>)
Traceback (most recent call last):
  File "/home/paul/.virtualenvs/scrapydev/bin/scrapy", line 11, in <module>
    load_entry_point('Scrapy', 'console_scripts', 'scrapy')()
  File "/home/paul/src/scrapy/scrapy/cmdline.py", line 142, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/home/paul/src/scrapy/scrapy/cmdline.py", line 88, in _run_print_help
    func(*a, **kw)
  File "/home/paul/src/scrapy/scrapy/cmdline.py", line 149, in _run_command
    cmd.run(args, opts)
  File "/home/paul/src/scrapy/scrapy/commands/shell.py", line 71, in run
    shell.start(url=url)
  File "/home/paul/src/scrapy/scrapy/shell.py", line 47, in start
    self.fetch(url, spider)
  File "/home/paul/src/scrapy/scrapy/shell.py", line 112, in fetch
    reactor, self._schedule, request, spider)
  File "/home/paul/.virtualenvs/scrapydev/local/lib/python2.7/site-packages/twisted/internet/threads.py", line 122, in blockingCallFromThread
    result.raiseException()
  File "<string>", line 2, in raiseException
KeyError: 'ftp_user'

Same issue with scrapy 1.2.0, so not related to this patch it seems

redapple · 2016-10-20T13:53:17Z

Scrapy's FTP download handler requires ftp_user and ftp_password which may be a bug

redapple · 2016-10-20T14:05:29Z

@nyov , FYI, I opened scrapy#2342

nyov · 2016-10-20T17:08:32Z

I thought this KeyError: 'ftp_user' was related, oops. Thanks, I will check scrapy#2342.
Regarding the shell, scrapy view and scrapy fetch will just show the usage on passing an ftp url. scrapy shell seems to work.

I think requiring an ftp user/password is okay. Auth is pretty much required in the protocol if I'm not mistaken. Accepted standard for anonymous ftp is allowing a random username and empty password at the server.
edit: Correction: 'anonymous' username and random password.

umrashrf · 2016-10-23T00:17:38Z

scrapy/downloadermiddlewares/auth.py

@@ -28,19 +47,33 @@ def spider_opened(self, spider):
            self.auth = basic_auth_header(usr, pwd)

    def process_request(self, request, spider):
-        auth = getattr(self, 'auth', None)
-        if auth and 'Authorization' not in request.headers:
-            request.headers['Authorization'] = auth


Why are you deleting these lines? Does your change not break http attributes given in the spider?

@umrashrf , it's less of a removal than moving statements around.
getattr(self, 'auth', None) is done a few lines after so that request.headers['Authorization'] is set only once and not overwritten.
Also 'Authorization' in request.headers is relevant for HTTP only

redapple · 2016-10-24T12:13:53Z

@nyov , ah right, let's require user and password then. Thanks for your feedback

umrashrf · 2016-10-26T01:54:13Z

@redapple so you decided to require both username and password for urls?

redapple · 2016-10-26T09:43:27Z

@umrashrf , only for FTP
For HTTP Basic Auth, empty string password or no password is still acceptable.

redapple · 2016-10-26T11:21:25Z

@umrashrf , I updated the PR to your PR with support for encoded delimiters.
I have not amended it for requiring non-empty user and password.
I'll leave it to you. Probably some warning to spit out and IgnoreRequest to raise

umrashrf · 2016-10-27T14:52:20Z

docs/topics/settings.rst

@@ -457,7 +457,7 @@ Default::

    {
        'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
-        'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
+        'scrapy.downloadermiddlewares.auth.AuthMiddleware': 300,


scrapy#1466 (comment)

Ignore my last comment. I thought it was settings file.

umrashrf · 2016-10-27T14:56:43Z

scrapy/settings/default_settings.py

@@ -91,7 +91,7 @@
 DOWNLOADER_MIDDLEWARES_BASE = {
    # Engine side
    'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
-    'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
+    'scrapy.downloadermiddlewares.auth.AuthMiddleware': 300,


scrapy#1466 (comment) made me not make this change

@redapple if you remove it, I want to merge your changes.

redapple · 2016-11-14T16:07:04Z

@umrashrf , I made the change.

Add tests for crawl command non-default cases

umrashrf · 2016-11-20T18:24:56Z

Sorry for late reply @redapple, I rebased my branch with scrapy/master and now I can't merge yours :(

redapple · 2016-11-21T09:21:26Z

hey @umrashrf , I cherry-picked the commits and submitted #2

umrashrf and others added 8 commits May 21, 2016 03:49

parse authentication credentials from url for http and ftp schemes

1595863

fixed url generation

22fdb2b

replaced httpauth with auth in contrib

29e8e75

fixed deprecation warnings

47e9305

fixed settings

125b9ab

added unit tests

4680bfa

fixed tests refs 1466#discussion_r64992503

2891e02

Refactor http/ftp URL tests, use urlpase_cached + add more tests

bcc2539

redapple mentioned this pull request Oct 18, 2016

[MRG+1] Support auth credentials in netloc for HTTP and FTP handlers scrapy/scrapy#1466

Open

redapple added 2 commits October 18, 2016 15:38

Update default settings to point to new AuthMiddleware

16526c7

Update docs on new AuthMiddleware

fda4784

umrashrf reviewed Oct 23, 2016

View reviewed changes

Support percent-encoded delimiters in userinfo

da2e032

umrashrf reviewed Oct 27, 2016

View reviewed changes

Revert change to default settings for Auth middleware

1f9d05c

umrashrf force-pushed the auth-creds-from-url branch from 2891e02 to 0d9e654 Compare November 20, 2016 18:23

umrashrf pushed a commit that referenced this pull request Nov 20, 2016

Merge pull request #1 from redapple/tethik_parse_command_rules_fix

88cf86f

Add tests for crawl command non-default cases

redapple mentioned this pull request Nov 21, 2016

Refactor http/ftp URL tests, use urlpase_cached + add more tests #2

Merged

redapple closed this Nov 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor http/ftp URL tests, use urlpase_cached + add more tests #1

Refactor http/ftp URL tests, use urlpase_cached + add more tests #1

redapple commented Oct 18, 2016

nyov commented Oct 20, 2016

redapple commented Oct 20, 2016 •

edited

redapple commented Oct 20, 2016 •

edited

redapple commented Oct 20, 2016 •

edited

redapple commented Oct 20, 2016

redapple commented Oct 20, 2016

nyov commented Oct 20, 2016 •

edited

umrashrf Oct 23, 2016

redapple Oct 24, 2016

redapple commented Oct 24, 2016

umrashrf commented Oct 26, 2016

redapple commented Oct 26, 2016

redapple commented Oct 26, 2016

umrashrf Oct 27, 2016

umrashrf Oct 27, 2016

umrashrf Oct 27, 2016

redapple Oct 27, 2016

umrashrf Nov 9, 2016

redapple commented Nov 14, 2016

umrashrf commented Nov 20, 2016

redapple commented Nov 21, 2016 •

edited

Refactor http/ftp URL tests, use urlpase_cached + add more tests #1

Refactor http/ftp URL tests, use urlpase_cached + add more tests #1

Conversation

redapple commented Oct 18, 2016

nyov commented Oct 20, 2016

redapple commented Oct 20, 2016 • edited

redapple commented Oct 20, 2016 • edited

redapple commented Oct 20, 2016 • edited

redapple commented Oct 20, 2016

redapple commented Oct 20, 2016

nyov commented Oct 20, 2016 • edited

umrashrf Oct 23, 2016

Choose a reason for hiding this comment

redapple Oct 24, 2016

Choose a reason for hiding this comment

redapple commented Oct 24, 2016

umrashrf commented Oct 26, 2016

redapple commented Oct 26, 2016

redapple commented Oct 26, 2016

umrashrf Oct 27, 2016

Choose a reason for hiding this comment

umrashrf Oct 27, 2016

Choose a reason for hiding this comment

umrashrf Oct 27, 2016

Choose a reason for hiding this comment

redapple Oct 27, 2016

Choose a reason for hiding this comment

umrashrf Nov 9, 2016

Choose a reason for hiding this comment

redapple commented Nov 14, 2016

umrashrf commented Nov 20, 2016

redapple commented Nov 21, 2016 • edited

redapple commented Oct 20, 2016 •

edited

redapple commented Oct 20, 2016 •

edited

redapple commented Oct 20, 2016 •

edited

nyov commented Oct 20, 2016 •

edited

redapple commented Nov 21, 2016 •

edited