Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use pathlib #5682

Merged
merged 13 commits into from
Nov 26, 2022
Merged

use pathlib #5682

merged 13 commits into from
Nov 26, 2022

Conversation

mdonoughe
Copy link
Contributor

This is a pretty tedious change. To address #4916 and hopefully #4497, I went through and used pathlib whenever possible for interacting with the filesystem.

Functions around new pathlib usage have had type annotations added. This helped somewhat with getting the correctness right. I avoided changing public interfaces to accept or expose Paths if they didn't already.

In one of the test files, Path objects were previously tested using one test and all the other tests used strings. I reversed this so only one test is using strings because I changed the shared test code to use Paths.

The tests pass, at least on my Windows machine.

This PR contains no intentional changes to functionality.

@wRAR
Copy link
Member

wRAR commented Oct 18, 2022

The tests are failing with "TypeError: 'ABCMeta' object is not subscriptable". There are also "unsupported-binary-operation" pylint errors and multiple different errors from mypy, including "X | Y syntax for unions requires Python 3.10".

@codecov
Copy link

codecov bot commented Oct 18, 2022

Codecov Report

Merging #5682 (d19a216) into master (3be7aa9) will increase coverage by 0.00%.
The diff coverage is 91.78%.

@@           Coverage Diff           @@
##           master    #5682   +/-   ##
=======================================
  Coverage   88.95%   88.95%           
=======================================
  Files         162      162           
  Lines       10938    10958   +20     
  Branches     1797     1793    -4     
=======================================
+ Hits         9730     9748   +18     
- Misses        927      929    +2     
  Partials      281      281           
Impacted Files Coverage Δ
scrapy/utils/job.py 75.00% <33.33%> (ø)
scrapy/crawler.py 87.17% <60.00%> (-0.78%) ⬇️
scrapy/spiders/__init__.py 96.66% <66.66%> (-3.34%) ⬇️
scrapy/utils/project.py 78.84% <73.33%> (+0.41%) ⬆️
scrapy/commands/__init__.py 74.71% <83.33%> (+1.45%) ⬆️
scrapy/commands/genspider.py 87.38% <88.00%> (-0.47%) ⬇️
scrapy/utils/conf.py 92.24% <93.33%> (+0.13%) ⬆️
scrapy/pipelines/files.py 71.33% <95.23%> (+0.09%) ⬆️
scrapy/extensions/httpcache.py 95.45% <96.42%> (+0.03%) ⬆️
scrapy/commands/runspider.py 93.47% <100.00%> (+4.34%) ⬆️
... and 12 more

@mdonoughe
Copy link
Contributor Author

"Not subscriptable" is also a Python version thing. I thought I had used some new features but the tox file said 3.8 and it passed with tox so I thought it was good. I'll figure it out tomorrow.

@wRAR
Copy link
Member

wRAR commented Oct 18, 2022

Indeed, tests on other versions fail for different reasons, e.g. "TypeError: unsupported operand type(s) for |: 'type' and 'types.GenericAlias'" on 3.9 and long/short file name comparison on Windows 3.10.

@mdonoughe
Copy link
Contributor Author

The path comparison issue seems like something weird with the agent environment, but it should work now. I was not getting the same problem even though my working directory contains a component longer than 8 characters. The type hints should all be compatible with Python 3.7.

When running the tests on 3.7, I got a weird error about mitmproxy not supporting the required version of SSL, but I don't see how I would of broken that so I'm assuming it's a problem with my environment.

@wRAR
Copy link
Member

wRAR commented Oct 19, 2022

There are still some typing issues and tests/test_dependencies.py::ScrapyUtilsTest::test_pinned_twisted_version failes on 3.7 (but not on later versions?)

@wRAR
Copy link
Member

wRAR commented Oct 19, 2022

And Windows tests still fail regarding short names.

@mdonoughe
Copy link
Contributor Author

I don't know what test_pinned_twisted_version is about. I don't think I changed it and it must require some special test setup.

@wRAR
Copy link
Member

wRAR commented Oct 19, 2022

I'll look into test_pinned_twisted_version

scrapy/commands/__init__.py Outdated Show resolved Hide resolved
scrapy/commands/runspider.py Outdated Show resolved Hide resolved
tests/test_dependencies.py Outdated Show resolved Hide resolved
scrapy/pipelines/files.py Outdated Show resolved Hide resolved

class Spider(object_ref):
"""Base class for scrapy spiders. All spiders must inherit from this
class.
"""

name: Optional[str] = None
name: str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change in behavior. It looks fine, but I'm not sure I understand the consequences fully. Would it make sense to keep it as it was, or should we investigate further?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indeed looks wrong as base spider classes usually don't have the name attribute set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's required not to be None by httpcache.py.

scrapy\extensions\httpcache.py:335: error: Argument 2 to "Path" has incompatible type "Optional[str]"; expected "Union[str, PathLike[str]]"  [arg-type]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what to do about this sort of issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well.
The base spiders I mentioned aren't supposed to be run so middlewares won't be executed :) Maybe it's enough to add a cast/assert to the middleware.
OTOH there is a WIP PR #4327 that would make the name optional even for spiders that can be run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it required that there is a name, even for the base Spiders? InitSpider does not have a name, but when you construct it you must provide a name as a parameter or else Spider's __init__ will raise an exception. It's these lines being removed in the other PR: https://github.com/scrapy/scrapy/pull/4327/files#diff-bfeaff1ed9c7b0a9a61d76bca0e46e4e34289df77cbc1a5143c31e4d1be6fcadL28-L29

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Spider class itself does not have a name defined, and that is not an error. And subclasses do not need a name either, as long as they are not instantiated (which is when Spider.__init__ gets called, though a subclass could actually override its __init__ to prevent that exception in the first place).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a subclass doesn't call the base __init__ that seems like all promises made by the base class are off and functions provided by the base class may not work. Python doesn't enforce type hints at runtime so the code will work the same as before even if somebody skips over the code that asserts there is always a name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking this is correct as str. name is intended not to be None. If the type checker is really smart, having it declared to be str could generate a warning if somebody does create a special subclass that doesn't call the base __init__ to set it (I haven't checked this, but I doubt it). The type checker will definitely generate a warning if somebody writes code that explicitly sets it to None.

@wRAR
Copy link
Member

wRAR commented Oct 21, 2022

The tests passed! :)

@wRAR
Copy link
Member

wRAR commented Oct 26, 2022

I wonder whether it's fine in general to replace os.path.abspath() with Path.resolve() which actually corresponds to os.path.realpath(). For example, running startproject in a location containing symlinks in its path will print a fully resolved dir which may be slightly confusing to the user.

@wRAR
Copy link
Member

wRAR commented Oct 26, 2022

There is actually Path.absolute() but it's only documented in 3.11 and it's also not equivalent to abspath: "os.path.abspath() normalizes the resulting path, which may change its meaning in the presence of symlinks, while Path.absolute() does not." ("posixpath.abspath() collapses "<path>/<symlink>/.." to "<path>", this is not correct")

So I guess we should keep using resolve().

@mdonoughe
Copy link
Contributor Author

Is this PR waiting on me to change anything? I'm sure the str type hint is correct because None is not a valid value for that attribute, but I could set it back to Optional[str] so it's documented that None is allowed but there is an exception raised if None is used.

@wRAR
Copy link
Member

wRAR commented Oct 31, 2022

I think it's not reviewed completely yet, so for now it's waiting for that.

@mdonoughe
Copy link
Contributor Author

Could I get a hacktoberfest-accepted?

@Gallaecio Gallaecio added the hacktoberfest-accepted This pull request looks valid for Hackoctoberfest even if it has not been approved yet. label Oct 31, 2022
@wRAR
Copy link
Member

wRAR commented Nov 24, 2022

I avoided changing public interfaces to accept or expose Paths if they didn't already.

Maybe we should create a feature request for this after this PR is merged. Well, at least for the accepting part.

@wRAR
Copy link
Member

wRAR commented Nov 25, 2022

Thanks! I've finished reviewing this.

sample_feeds_dir = os.path.join(os.path.abspath(os.path.dirname(__file__)), 'sample_data', 'feeds')
sample_feed_path = os.path.join(sample_feeds_dir, 'feed-sample3.csv')
sample_feed2_path = os.path.join(sample_feeds_dir, 'feed-sample4.csv')
sample_feed3_path = os.path.join(sample_feeds_dir, 'feed-sample5.csv')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch @mdonoughe!

'..',
'tox.ini',
)
tox_config_file_path = Path(__file__).parent / '..' / 'tox.ini'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tox_config_file_path = Path(__file__).parent / '..' / 'tox.ini'
tox_config_file_path = Path(__file__).parent.parent / 'tox.ini'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code was like this so I didn't change it. '..' and .parent are slightly different if there are symbolic links. It'd probably be okay to change it here.

os.remove(local_fname)
self.assertTrue(local_fname.exists())
self.assertEqual(local_fname.read_bytes(), b"I have the power!")
local_fname.unlink()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this PR: if a previous asserts fails, the file is not removed

@kmike
Copy link
Member

kmike commented Nov 26, 2022

Thanks for the careful implementation @mdonoughe!
There were so many chances to get it wrong - and you avoided all of them :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hacktoberfest-accepted This pull request looks valid for Hackoctoberfest even if it has not been approved yet.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants