-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use pathlib #5682
use pathlib #5682
Conversation
The tests are failing with "TypeError: 'ABCMeta' object is not subscriptable". There are also "unsupported-binary-operation" pylint errors and multiple different errors from mypy, including "X | Y syntax for unions requires Python 3.10". |
Codecov Report
@@ Coverage Diff @@
## master #5682 +/- ##
=======================================
Coverage 88.95% 88.95%
=======================================
Files 162 162
Lines 10938 10958 +20
Branches 1797 1793 -4
=======================================
+ Hits 9730 9748 +18
- Misses 927 929 +2
Partials 281 281
|
"Not subscriptable" is also a Python version thing. I thought I had used some new features but the tox file said 3.8 and it passed with tox so I thought it was good. I'll figure it out tomorrow. |
Indeed, tests on other versions fail for different reasons, e.g. "TypeError: unsupported operand type(s) for |: 'type' and 'types.GenericAlias'" on 3.9 and long/short file name comparison on Windows 3.10. |
The path comparison issue seems like something weird with the agent environment, but it should work now. I was not getting the same problem even though my working directory contains a component longer than 8 characters. The type hints should all be compatible with Python 3.7. When running the tests on 3.7, I got a weird error about mitmproxy not supporting the required version of SSL, but I don't see how I would of broken that so I'm assuming it's a problem with my environment. |
There are still some typing issues and |
And Windows tests still fail regarding short names. |
I don't know what |
I'll look into |
|
||
class Spider(object_ref): | ||
"""Base class for scrapy spiders. All spiders must inherit from this | ||
class. | ||
""" | ||
|
||
name: Optional[str] = None | ||
name: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a change in behavior. It looks fine, but I'm not sure I understand the consequences fully. Would it make sense to keep it as it was, or should we investigate further?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indeed looks wrong as base spider classes usually don't have the name attribute set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's required not to be None
by httpcache.py
.
scrapy\extensions\httpcache.py:335: error: Argument 2 to "Path" has incompatible type "Optional[str]"; expected "Union[str, PathLike[str]]" [arg-type]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what to do about this sort of issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well.
The base spiders I mentioned aren't supposed to be run so middlewares won't be executed :) Maybe it's enough to add a cast/assert to the middleware.
OTOH there is a WIP PR #4327 that would make the name optional even for spiders that can be run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it required that there is a name, even for the base Spider
s? InitSpider
does not have a name, but when you construct it you must provide a name as a parameter or else Spider
's __init__
will raise an exception. It's these lines being removed in the other PR: https://github.com/scrapy/scrapy/pull/4327/files#diff-bfeaff1ed9c7b0a9a61d76bca0e46e4e34289df77cbc1a5143c31e4d1be6fcadL28-L29
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Spider
class itself does not have a name
defined, and that is not an error. And subclasses do not need a name either, as long as they are not instantiated (which is when Spider.__init__
gets called, though a subclass could actually override its __init__
to prevent that exception in the first place).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a subclass doesn't call the base __init__
that seems like all promises made by the base class are off and functions provided by the base class may not work. Python doesn't enforce type hints at runtime so the code will work the same as before even if somebody skips over the code that asserts there is always a name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking this is correct as str
. name
is intended not to be None
. If the type checker is really smart, having it declared to be str
could generate a warning if somebody does create a special subclass that doesn't call the base __init__
to set it (I haven't checked this, but I doubt it). The type checker will definitely generate a warning if somebody writes code that explicitly sets it to None
.
The tests passed! :) |
I wonder whether it's fine in general to replace |
There is actually So I guess we should keep using |
Is this PR waiting on me to change anything? I'm sure the |
I think it's not reviewed completely yet, so for now it's waiting for that. |
Could I get a hacktoberfest-accepted? |
Maybe we should create a feature request for this after this PR is merged. Well, at least for the accepting part. |
Thanks! I've finished reviewing this. |
sample_feeds_dir = os.path.join(os.path.abspath(os.path.dirname(__file__)), 'sample_data', 'feeds') | ||
sample_feed_path = os.path.join(sample_feeds_dir, 'feed-sample3.csv') | ||
sample_feed2_path = os.path.join(sample_feeds_dir, 'feed-sample4.csv') | ||
sample_feed3_path = os.path.join(sample_feeds_dir, 'feed-sample5.csv') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch @mdonoughe!
'..', | ||
'tox.ini', | ||
) | ||
tox_config_file_path = Path(__file__).parent / '..' / 'tox.ini' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tox_config_file_path = Path(__file__).parent / '..' / 'tox.ini' | |
tox_config_file_path = Path(__file__).parent.parent / 'tox.ini' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old code was like this so I didn't change it. '..'
and .parent
are slightly different if there are symbolic links. It'd probably be okay to change it here.
os.remove(local_fname) | ||
self.assertTrue(local_fname.exists()) | ||
self.assertEqual(local_fname.read_bytes(), b"I have the power!") | ||
local_fname.unlink() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR: if a previous asserts fails, the file is not removed
Thanks for the careful implementation @mdonoughe! |
This is a pretty tedious change. To address #4916 and hopefully #4497, I went through and used pathlib whenever possible for interacting with the filesystem.
Functions around new pathlib usage have had type annotations added. This helped somewhat with getting the correctness right. I avoided changing public interfaces to accept or expose Paths if they didn't already.
In one of the test files, Path objects were previously tested using one test and all the other tests used strings. I reversed this so only one test is using strings because I changed the shared test code to use Paths.
The tests pass, at least on my Windows machine.
This PR contains no intentional changes to functionality.