New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor http/ftp URL tests, use urlpase_cached + add more tests #1
Refactor http/ftp URL tests, use urlpase_cached + add more tests #1
Conversation
I think this is great, and should be merged. It doesn't seem possible to give a spider-wide setting such as with " I'm not quite sure about those raised
The other issue I found is probably in the shell: |
Thanks for the feedback @nyov . I haven't actually tested this patch with a real FTP server.
You mean an spider-wise
Were these manually-crafted URLs?
Default FTP user for Twisted is
Ah right, I have not tested this. |
It does work with scrapy shell and http://
|
@nyov , right, there's something fishy with
Same issue with scrapy 1.2.0, so not related to this patch it seems |
Scrapy's FTP download handler requires |
@nyov , FYI, I opened scrapy#2342 |
I thought this I think requiring an ftp user/password is okay. Auth is pretty much required in the protocol if I'm not mistaken. Accepted standard for anonymous ftp is allowing a random username and empty password at the server. |
@@ -28,19 +47,33 @@ def spider_opened(self, spider): | |||
self.auth = basic_auth_header(usr, pwd) | |||
|
|||
def process_request(self, request, spider): | |||
auth = getattr(self, 'auth', None) | |||
if auth and 'Authorization' not in request.headers: | |||
request.headers['Authorization'] = auth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you deleting these lines? Does your change not break http attributes given in the spider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@umrashrf , it's less of a removal than moving statements around.
getattr(self, 'auth', None)
is done a few lines after so that request.headers['Authorization']
is set only once and not overwritten.
Also 'Authorization' in request.headers
is relevant for HTTP only
@nyov , ah right, let's require user and password then. Thanks for your feedback |
@redapple so you decided to require both username and password for urls? |
@umrashrf , only for FTP |
@umrashrf , I updated the PR to your PR with support for encoded delimiters. |
@@ -457,7 +457,7 @@ Default:: | |||
|
|||
{ | |||
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100, | |||
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300, | |||
'scrapy.downloadermiddlewares.auth.AuthMiddleware': 300, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore my last comment. I thought it was settings file.
@@ -91,7 +91,7 @@ | |||
DOWNLOADER_MIDDLEWARES_BASE = { | |||
# Engine side | |||
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100, | |||
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300, | |||
'scrapy.downloadermiddlewares.auth.AuthMiddleware': 300, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scrapy#1466 (comment) made me not make this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@redapple if you remove it, I want to merge your changes.
@umrashrf , I made the change. |
2891e02
to
0d9e654
Compare
Add tests for crawl command non-default cases
Sorry for late reply @redapple, I rebased my branch with scrapy/master and now I can't merge yours :( |
The change is mainly about handling
if url.username or url.password:
, instead ofand
, since HTTP credentials need to work forhttp://username@example.com
andhttp://username:@example.com
tooThe other change is to use
urlparse_cached
to potentially save on a few urlparsing ops.This implementation also relaxes the tests on
ftp_user
andftp_password
to not require both of them, since the password in FTP looks optional (in theory), leaving the requirement for password (if needed) to the FTP download handler. This is debatable.