Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor http/ftp URL tests, use urlpase_cached + add more tests #1

Conversation

redapple
Copy link

The change is mainly about handling if url.username or url.password:, instead of and, since HTTP credentials need to work for http://username@example.com and http://username:@example.com too

The other change is to use urlparse_cached to potentially save on a few urlparsing ops.

This implementation also relaxes the tests on ftp_user and ftp_password to not require both of them, since the password in FTP looks optional (in theory), leaving the requirement for password (if needed) to the FTP download handler. This is debatable.

@nyov
Copy link

nyov commented Oct 20, 2016

I think this is great, and should be merged.
But I have found some possible improvements while testing.

It doesn't seem possible to give a spider-wide setting such as with "http_user" spider attribute, and the ftp request will lose the credentials on subsequent requests.
My start_urls was 'ftp://root:root@127.127.127.127/'and the follow-up request went to ftp://127.127.127.127/ with a KeyError: 'ftp_user'.
That seems actually more correct, as the credentials should be Site/Domain/Realm-wide instead of spider globals. (An addition to the Middleware could be to cache credentials+realm until evicted, in the MW or as spider attributes. Realm for FTP would be the domain or IP.)

I'm not quite sure about those raised KeyErrors. Would it be possible to handle them more gracefully, in the event those URLs come from a crawl and are not hand-picked targets?

ftp_user might default to some anonymous user without password, as you mentioned (browser behavior is "Mozilla@" or something -- the @ for the case where an email might be expected),
or the MW might cancel the request early and log a message such as:
[scrapy] ERROR: Skipped downloading <GET ftp://127.127.127.127/>: No FTP credentials given.

The other issue I found is probably in the shell: scrapy fetch and scrapy view simply didn't acknowledge a ftp url as a valid url.

@redapple
Copy link
Author

redapple commented Oct 20, 2016

Thanks for the feedback @nyov . I haven't actually tested this patch with a real FTP server.

It doesn't seem possible to give a spider-wide setting such as with "http_user" spider attribute

You mean an spider-wise ftp_user(/ftp_passord)?
It can be added, simply to have coherent behavior for HTTP and FTP.

My start_urls was 'ftp://root:root@127.127.127.127/'and the follow-up request went to ftp://127.127.127.127/ with a KeyError: 'ftp_user'.

Were these manually-crafted URLs?
Do you have some logs for these KeyError?

ftp_user might default to some anonymous user without password, (...)

Default FTP user for Twisted is username='anonymous'
and it seems that Twisted FTPClient() does handle None FTP passwords

The other issue I found is probably in the shell: scrapy fetch and scrapy view simply didn't acknowledge a ftp url as a valid url.

Ah right, I have not tested this.
Does it work for http:// (if you tested by any chance)?

@redapple
Copy link
Author

redapple commented Oct 20, 2016

It does work with scrapy shell and http://

$ scrapy shell "http://user:pass@httpbin.org/headers"
2016-10-20 15:43:48 [scrapy] INFO: Scrapy 1.2.0dev2 started (bot: scrapybot)
(...)
2016-10-20 15:43:49 [scrapy] INFO: Spider opened
2016-10-20 15:43:49 [default] INFO: AuthMiddleware.process_request(<GET http://user:pass@httpbin.org/headers>, <DefaultSpider 'default' at 0x7f3a71923950>)
2016-10-20 15:43:49 [default] INFO: AuthMiddleware.process_request(<GET http://httpbin.org/headers>, <DefaultSpider 'default' at 0x7f3a71923950>)
2016-10-20 15:43:59 [scrapy] DEBUG: Crawled (200) <GET http://httpbin.org/headers> (referer: None)
(...)
>>> print(response.body)
{
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
    "Accept-Encoding": "gzip,deflate", 
    "Accept-Language": "en", 
    "Authorization": "Basic dXNlcjpwYXNz", 
    "Host": "httpbin.org", 
    "User-Agent": "Scrapy/1.2.0dev2 (+http://scrapy.org)"
  }
}

@redapple
Copy link
Author

redapple commented Oct 20, 2016

@nyov , right, there's something fishy with ftp_user:

$ scrapy shell ftp://ftp.eu.metabrainz.org/pub/musicbrainz/data/fullexport/20161019-001816/MD5SUMS
2016-10-20 15:48:19 [scrapy] INFO: Scrapy 1.2.0dev2 started (bot: scrapybot)
2016-10-20 15:48:19 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2016-10-20 15:48:19 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-10-20 15:48:19 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.auth.AuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-10-20 15:48:19 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-10-20 15:48:19 [scrapy] INFO: Enabled item pipelines:
[]
2016-10-20 15:48:19 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-10-20 15:48:19 [scrapy] INFO: Spider opened
2016-10-20 15:48:19 [default] INFO: AuthMiddleware.process_request(<GET ftp://ftp.eu.metabrainz.org/pub/musicbrainz/data/fullexport/20161019-001816/MD5SUMS>, <DefaultSpider 'default' at 0x7f1927558e10>)
Traceback (most recent call last):
  File "/home/paul/.virtualenvs/scrapydev/bin/scrapy", line 11, in <module>
    load_entry_point('Scrapy', 'console_scripts', 'scrapy')()
  File "/home/paul/src/scrapy/scrapy/cmdline.py", line 142, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/home/paul/src/scrapy/scrapy/cmdline.py", line 88, in _run_print_help
    func(*a, **kw)
  File "/home/paul/src/scrapy/scrapy/cmdline.py", line 149, in _run_command
    cmd.run(args, opts)
  File "/home/paul/src/scrapy/scrapy/commands/shell.py", line 71, in run
    shell.start(url=url)
  File "/home/paul/src/scrapy/scrapy/shell.py", line 47, in start
    self.fetch(url, spider)
  File "/home/paul/src/scrapy/scrapy/shell.py", line 112, in fetch
    reactor, self._schedule, request, spider)
  File "/home/paul/.virtualenvs/scrapydev/local/lib/python2.7/site-packages/twisted/internet/threads.py", line 122, in blockingCallFromThread
    result.raiseException()
  File "<string>", line 2, in raiseException
KeyError: 'ftp_user'

Same issue with scrapy 1.2.0, so not related to this patch it seems

@redapple
Copy link
Author

Scrapy's FTP download handler requires ftp_user and ftp_password which may be a bug

@redapple
Copy link
Author

@nyov , FYI, I opened scrapy#2342

@nyov
Copy link

nyov commented Oct 20, 2016

I thought this KeyError: 'ftp_user' was related, oops. Thanks, I will check scrapy#2342.
Regarding the shell, scrapy view and scrapy fetch will just show the usage on passing an ftp url. scrapy shell seems to work.

I think requiring an ftp user/password is okay. Auth is pretty much required in the protocol if I'm not mistaken. Accepted standard for anonymous ftp is allowing a random username and empty password at the server.
edit: Correction: 'anonymous' username and random password.

@@ -28,19 +47,33 @@ def spider_opened(self, spider):
self.auth = basic_auth_header(usr, pwd)

def process_request(self, request, spider):
auth = getattr(self, 'auth', None)
if auth and 'Authorization' not in request.headers:
request.headers['Authorization'] = auth
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you deleting these lines? Does your change not break http attributes given in the spider?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@umrashrf , it's less of a removal than moving statements around.
getattr(self, 'auth', None) is done a few lines after so that request.headers['Authorization'] is set only once and not overwritten.
Also 'Authorization' in request.headers is relevant for HTTP only

@redapple
Copy link
Author

@nyov , ah right, let's require user and password then. Thanks for your feedback

@umrashrf
Copy link
Owner

@redapple so you decided to require both username and password for urls?

@redapple
Copy link
Author

@umrashrf , only for FTP
For HTTP Basic Auth, empty string password or no password is still acceptable.

@redapple
Copy link
Author

@umrashrf , I updated the PR to your PR with support for encoded delimiters.
I have not amended it for requiring non-empty user and password.
I'll leave it to you. Probably some warning to spit out and IgnoreRequest to raise

@@ -457,7 +457,7 @@ Default::

{
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
'scrapy.downloadermiddlewares.auth.AuthMiddleware': 300,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore my last comment. I thought it was settings file.

@@ -91,7 +91,7 @@
DOWNLOADER_MIDDLEWARES_BASE = {
# Engine side
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
'scrapy.downloadermiddlewares.auth.AuthMiddleware': 300,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scrapy#1466 (comment) made me not make this change

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@redapple if you remove it, I want to merge your changes.

@redapple
Copy link
Author

@umrashrf , I made the change.

umrashrf pushed a commit that referenced this pull request Nov 20, 2016
Add tests for crawl command non-default cases
@umrashrf
Copy link
Owner

Sorry for late reply @redapple, I rebased my branch with scrapy/master and now I can't merge yours :(

@redapple
Copy link
Author

redapple commented Nov 21, 2016

hey @umrashrf , I cherry-picked the commits and submitted #2

@redapple redapple closed this Nov 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants