Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor storage operations into separate Backend classes #348

Merged
merged 32 commits into from
Feb 2, 2021

Conversation

elfjes
Copy link
Contributor

@elfjes elfjes commented Oct 15, 2020

Following the discussion in #253 and #325 I've created a first iteration on what a Backend interface could look like and how the current file storage operations may be refactored into this interface. It goes from the following principles

  • app.py talks only to core.py with regards to package operations
  • at configuration time, a Backend implementation is chosen and created for the lifetime of the configured app
  • core.py proxies requests for packages to this Backend()
  • The Backend interface/api is defined through three things
    • methods that an implementation must implement
    • methods that an implementation may override if it knows better than the defaults
    • the PkgFIle class that is (should be) the main carrier of data
  • where possible, implementation details must be hidden from concrete Backends to promote extensibility

Other things I've done in this PR:

  • I've tried to talk about packages and projects, rather than files and prefixes, since these are the domain terms PEP503 uses, and imho it's also more clear what it means
  • Better testability of the CacheManager (no more race conditions when watchdog is installed during testing)
  • Cleanup some more Python 2 code
  • Started moving away from os.path and py.path in favour of pathlib

Furthermore I've created a plugin.py with a sample of how I think plugin system could look like. This sampIe assumes we use argparse and allows for the extension of cli arguments that a plugin may need. I think the actual implementation of such a plugin system is beyond the scope of this PR, but I've used it as a target for the Backend refactoring. If requested, I'll remove it from this PR.

The following things still need to be done / discussed. These can be part of this PR or moved into their own, separate PRs

  • Simplify the PgkFile class. It currently consists of a number of attributes that don't necessarily belong with it, and not all attributes are aptly named (imho). I would like to minimalize the scope of PkgFile so that its only concern is being a data carrier between the app and the backends, and make its use more clear.
  • Add a PkgFile.metadata that backend implementations may use to store custom data for packages. For example the current PkgFile.root attribute is an implementation detail of the filestorage backends, and other Backend implementations should not be bothered by it.
  • Use pathlib wherever possible. This may also result in less attributes for PkgFile, since some things may be just contained in a single Path object, instead of multtiple strings.
  • Improve testing of the CacheManager.

@mplanchard
Copy link
Contributor

mplanchard commented Oct 15, 2020 via email

@elfjes
Copy link
Contributor Author

elfjes commented Oct 15, 2020

Sure no worries. Looking forward to your feedback 😄

@mplanchard
Copy link
Contributor

Got a chance to give this a quick overview this evening, and I really like the direction! I have an in-progress branch to replace all usage of the old config with the stuff from #339 with all tests passing. You can see that here. It seems like that could mesh really well with the approach taken here. I think we would update the RunConfig to expose a backend attribute, rather than having it expose an iter_packages method.

I've got no major concerns with what I've seen so far, but I'll be sure to make time to give this a more thorough review over the course of this week.

In regards to the things you mentioned as potentially needing more work:

Simplify the PgkFile class. It currently consists of a number of attributes that don't necessarily belong with it, and not all attributes are aptly named (imho). I would like to minimalize the scope of PkgFile so that its only concern is being a data carrier between the app and the backends, and make its use more clear.

I agree with this goal! I don't feel strongly either way on whether it should be a part of this PR, but I definitely think it's worth exploring

Add a PkgFile.metadata that backend implementations may use to store custom data for packages. For example the current PkgFile.root attribute is an implementation detail of the filestorage backends, and other Backend implementations should not be bothered by it.

Yeah, and having some access to metadata will also give us a route towards e.g. fully supporting PEP 503 and the data_requires attribute (a la #320).

Use pathlib wherever possible. This may also result in less attributes for PkgFile, since some things may be just contained in a single Path object, instead of multtiple strings.

I've been working in this direction in my branch as well, so definitely no issues with seeing more of it.

Improve testing of the CacheManager.

Yeah the existing watchdog cache had really minimal tests, so anything here is an improvement


Regarding getting both of our streams of work going harmoniously, we can either get this merged, and then I can rebase and update on top of this, or vice-versa. I'll probably go ahead and open a PR with that branch tonight or tomorrow, just so that we can better look at how they might work together, but there's no rush to get that one in, and I'm happy to do the work of rebasing if we get this one in first.

Copy link
Contributor

@mplanchard mplanchard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found some time this evening for a first pass!


def with_digester(func: t.Callable[..., t.Iterable[PkgFile]]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to keep the decorator from overriding the underlying type, you can use a generic like:

DigesterFunc = t.TypeVar("DigesterFunc", bound=t.Callable[..., t.Iterable[PkgFile]])

def with_digester(func: DigesterFunc) -> DigesterFunc:
  ...

you may need a # type: ignore where you're returning your wrapper function inside the decorator, but by telling mypy that you're returning a type of the same type that was put into the function, you allow it to infer that the resulting decorated function takes the same arguments as the original

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea! :)

pypiserver/manage.py Outdated Show resolved Hide resolved

# @pytest.mark.xfail(
# ENABLE_CACHING, reason="race condition when caching is enabled"
# )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, will remove

@@ -104,7 +91,7 @@ def root():
fp = request.custom_fullpath

try:
numpkgs = len(list(packages()))
numpkgs = len(list(core.get_all_packages()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we added a "package_count" or similar method to the backend so that, for example, we wouldn't actually have to parse all the files into PkgFile instances in order to get their count, and could instead just nab all files whose names match a regex or something. It would also be useful with imagined remote backends, where pulling all the files might be a problem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a package_count method sounds good, but I'm not sure about adding "query" or filter functionality. This would require additional complexity from each backend (including said imagined ones) without adding apparent value. What would be a use case for having such filtering functionality in a package_count method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I just meant a count without any custom filtering, but I expressed that poorly. Where I was talking about a regex, I essentially just meant "count every file that looks like a package based on the filename"

links = [
(f.relfn_unix, urljoin(fp, f.fname_and_hash(config.hash_algo)))
for f in files
(pkg.relfn_unix, urljoin(fp, pkg.fname_and_hash)) for pkg in packages
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we try replacing this list comprehension ([]) with a generator comprehension (())? We definitely need to allocate a list for sorting, but here I think we may be able to be lazy and just pass a generator into the template call, which should save some RAM on systems handling a lot of packages

Copy link
Contributor Author

@elfjes elfjes Oct 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sure :)

Good catch on the fname_and_hash. Apparently no test caught that nevermind

)


def as_file(fh: t.BinaryIO, destination: PathLike):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bikeshed: as_file() or to_file() feels to me like the function will be returning something file-like. What about write_file()?

os.remove(pkg.fn)

def exists(self, filename):
# TODO: Also look in subdirectories?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think currently we're recursing into root directories, so we may want to do so here too


class CachingFileBackend(SimpleFileBackend):
def __init__(
self, config: Configuration, roots: t.List[PathLike], cache_manager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to define either a base class or a protocol for the cache_manager

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alas, no Protocols in python<3.8. I see that I put the type_hint for cache_manager in the __init__ assignment rather that the function signature, so I'm changing that.

Did you mean to add a sort of CacheManager interface/ base implementation that different backends may consume and/or extend? This might be an idea, but I'm afraid that there is not going to be much common requirements between the different backends, since most of the logic will lie in cache invalidation, which is going to be very backend-specific

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alas, no Protocols in python<3.8

This is true, but I think that if we want to use it, you can get it for earlier pythons by installing typing-extensions.

Did you mean to add a sort of CacheManager interface/ base implementation that different backends may consume and/or extend

Yep, that was the idea, so that you could potentially have like a WatchdogCachingFileBackend and a WhateverOtherCachingFileBackend, so long as the mangers share the same interface. It seems like all
we need for that at the moment is listdir() and digest_file(), so it might not be too bad to specify an abstract interface

)


class SimpleFileBackend(Backend):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mypy will probably want us to define types for the subclasses. Since this is a new file, we can add it to the typechecks in tox.ini to make sure it catches anything we miss

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too familiar with mypy and I'm not quite sure what you mean with this. A subclass is a type, right? Or do you mean something else?
I'll add the new file(s) to the tox.ini and see what needs be done :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I just meant that we would need to type these methods for mypy to be happy, and that mypy will enforce that they match the base class types to satisfy the liskov substitution principle

pypiserver/backend.py Outdated Show resolved Hide resolved
@elfjes
Copy link
Contributor Author

elfjes commented Oct 19, 2020

Thanks for the review, much appreciated :)

Got a chance to give this a quick overview this evening, and I really like the direction! I have an in-progress branch to replace all usage of the old config with the stuff from #339 with all tests passing. You can see that here. It seems like that could mesh really well with the approach taken here. I think we would update the RunConfig to expose a backend attribute, rather than having it expose an iter_packages method.

Yeah I was also thinking of having a --backend attribute which picks a backend. See also plugin.py in which I philosophized a bit on how we can support selection and configuration of different backends

Simplify the PgkFile class. It currently consists of a number of attributes that don't necessarily belong with it, and not all attributes are aptly named (imho). I would like to minimalize the scope of PkgFile so that its only concern is being a data carrier between the app and the backends, and make its use more clear.

I agree with this goal! I don't feel strongly either way on whether it should be a part of this PR, but I definitely think it's worth exploring

Maybe it's best to give it it's own PR, so that we don't have to make this one larger than it already is :)

Add a PkgFile.metadata that backend implementations may use to store custom data for packages. For example the current PkgFile.root attribute is an implementation detail of the filestorage backends, and other Backend implementations should not be bothered by it.

Yeah, and having some access to metadata will also give us a route towards e.g. fully supporting PEP 503 and the data_requires attribute (a la #320).

Yeah I realized that there are two types of metadata that we must distinguish. One is the PEP503 metadata that we need for full support of this PEP. The other is geared towards extra data that a specific backend implementation may need to keep track of it's packages, such as a package root for a FileBackend, a bucket name for an imagined MultiBucketGCSBackend or a package's primary key in a database, and have a designated place to store such additional attributes. My thought initially was with the latter, but we need the former as well.

Regarding getting both of our streams of work going harmoniously, we can either get this merged, and then I can rebase and update on top of this, or vice-versa. I'll probably go ahead and open a PR with that branch tonight or tomorrow, just so that we can better look at how they might work together, but there's no rush to get that one in, and I'm happy to do the work of rebasing if we get this one in first.

Yeah sure, I'll finish up this PR and then create a new one for the additional work to ease the integration pain a little bit

mplanchard added a commit that referenced this pull request Oct 25, 2020
This PR is a pretty substantial refactor of the entrypoints of pypiserver (`__main__` and `__init__`) to use the argparse-based config added in #339.

- Updated `RunConfig` and `UpdateConfig` classes to have exclusive init kwargs, instead of taking an namespace. This turned out to be much easier when working with the library-style app initialization in `__init__`, both for direct instantiation and via paste config
- Added an `iter_packages()` method to the `RunConfig` to iterate over packages specified by the configuration (note @elfjes, I think that replacing this with e.g. a `backend` reference will be a nice way to tie in #348)
- Added a general-purpose method to map legacy keyword arguments to the `app()` and `paste_app_factory()` functions to updated forms
- Refactored the `paste_app_factory()` to not mutate the incoming dictionary
- Removed all argument-parsing and config-related code from `__main__` and `core`
- Moved `_logwrite` from `__init__` to `__main__`, since that was the only place it was being used after the updates to `core`
- Updated `digest_file` to use `hashlib.new(algo)` instead of `getattr(hashlib, algo)`, because the former supports more algorithms
- Updated `setup.py` to, instead of calling `eval()` on the entirety of `__init__`, to instead just evaluate the line that defines the version
- Assigned the config to a `._pypiserver_config` attribute on the `Bottle` instance to reduce hacky test workarounds
- Fixed the tox config, which I broke in #339 

* Config: add auth & absolute path resolution

* Config: check pkg dirs on config creation

* Instantiate config with kwargs, not namespace

* WIP: still pulling the threads

* Init seems to be working

* tests passing locally, still need to update cache

* Fix tox command

* unused import

* Fix typing

* Be more selective in exec() in setup.py

* Require accurate casing for hash algos

* Remove old comment

* Comments, minor updates and simplifications

* move _logwrite to a more reasonable place

* Update config to work with cache

* Type cachemanager listdir in core

* Update config module docstring, rename method

* Add more comments re: paste config

* Add comments to main, remove unneded check

* Remove commented code

* Use {posargs} instead of [] for clarity in tox

* Add dupe check for kwarg updater

* Remove unused references on app instance

* Fix typo

* Remove redundancy in log level parsing
@elfjes
Copy link
Contributor Author

elfjes commented Oct 30, 2020

I've added a --backend option to the cli that by default automatically chooses between the caching and the simple file backend.

It took me a while to figure out how to exactly add the argument and make it work with the Config.default_with_overrides function that is used in the tests. Which got me thinking, if we want to add pluggable backends, these new backends may want to add new command line arguments. How would we deal with them in the currently design of the (Run)Config, which has fixed supported arguments? I guess we'd have to go back to some form of dynamic config objects, although that's probably a discussion for a different issue/PR

Speaking of pluggable backends, @mplanchard when you have time, could you give me your feedback on the proposal I made for plugins in plugin.py?

Also, if you're ok with the current status of this PR, we can merge it, so that I can start working on the other things in a separate PR.

@mplanchard
Copy link
Contributor

It took me a while to figure out how to exactly add the argument and make it work with the Config.default_with_overrides function that is used in the tests

There is definitely still room for improvement in making these obvious/easy to add to.

Which got me thinking, if we want to add pluggable backends, these new backends may want to add new command line arguments. How would we deal with them in the currently design of the (Run)Config, which has fixed supported arguments? I guess we'd have to go back to some form of dynamic config objects, although that's probably a discussion for a different issue/PR

I think the way we would approach this would be to have a dynamic portion of the config object, e.g. backend_config, which is a generic namespace. Plugins would need to provide some function to parse arguments and return a namespace, which we'd use to populate that property. Or something like that 🤷

@elfjes
Copy link
Contributor Author

elfjes commented Nov 15, 2020

yeah that could work. Or like a **custom argument to the config object that gets converted to a namespace. Or just kept as a dictionary

@@ -0,0 +1,2 @@
[run]
omit = pypiserver/bottle.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good thinking


Observer = None

ENABLE_CACHING = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but I wonder if we should (for 2.0) make this a config option instead of implicitly using the cache if the package is available. The implicitness feels potentially surprising to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see you added this already!


backend = available_backends[arg]

return BackendProxy(backend(self))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put this in the _ConfigCommon so that it's available for the update command too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense to me. I think the main reason I didn't do that yet, was because it currently requires hash_algo to be part of the same config and that one's on the RunConfig. But if you're fine with moving that to common, then I see no reason to not do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, now that we need it in both the run and update configs, we might as well move it



class _ConfigCommon:
hash_algo: t.Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move the hash_algo init argument to this common config?

"digest", # The file digest in the form of <algo>=<hash>
"relfn_unix", # The relative file path in unix notation
"parsed_version", # The package version as a tuple of parts
"digester", # a function that calculates the digest for the package
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this documentation!

Copy link
Contributor Author

@elfjes elfjes Nov 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, I mostly put it there for my own sanity until I have the chance to refactor PkgInfo

self.fn = fn
self.root = root
self.relfn = relfn
self.relfn_unix = None if relfn is None else relfn.replace("\\", "/")
self.replaces = replaces
self.digest = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be digester to match the __slots__?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both digest and digester are part of slots. I do see that I need to assign self.digester=None in __init__ to be consistent

from pypiserver.backend import SimpleFileBackend, CachingFileBackend
from pypiserver import get_file_backend

DEFAULT_PACKAGE_DIRECTORIES = ["~/packages"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pull this from config.DEFAULTS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in my opinion, if we're doing this as plugins, then the package root is a config setting specific to the file backends. Which would also mean that the cli argument should not be part of the global config, but provided by the plugin. Which is why I added that definition there, and not take it from the global config

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does make sense. I just want to avoid needing to update this twice if we update it (although the likelihood that we'll ever change this is pretty low tbf)

@elfjes
Copy link
Contributor Author

elfjes commented Jan 15, 2021

Hi @mplanchard,

It's been a while, how are you? What do you think about the status of this PR? As far as I'm concerned it's ready to be merged. I've processed most, if not all of your feedback so far. Are there still any blocking things?

mplanchard
mplanchard previously approved these changes Feb 2, 2021
Copy link
Contributor

@mplanchard mplanchard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being slow on this. Q4 and the beginning of Q1 have been crazy at my day job. Let's get it in!

# if arg not in available_backends:
# raise argparse.ArgumentTypeError(
# f"Value must be one of {', '.join(available_backends.keys())}"
# )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks, yeah I thought I might need to check that here, but it's actually picked up by argparse

@mplanchard
Copy link
Contributor

Leaving it to you to merge. Let me know if you don't have permissions, and I'll hit the button instead

@dee-me-tree-or-love
Copy link
Member

@mplanchard @elfjes if no rights to merge, I can do also!
(sorry for my low participation lately as well... also quite overloaded with various tasks on the job. @mplanchard, what do you also think, actually is it okay for me to also be merging/approving when I get a moment for it, maybe I will be poking for your call upfront in the future still?)

@mplanchard
Copy link
Contributor

@mplanchard, what do you also think, actually is it okay for me to also be merging/approving when I get a moment for it, maybe I will be poking for your call upfront in the future still

Yeah I think so! For bigger stuff, we should probably both review, but for small things I'm okay with either of us approving and merging. And feel free to tag me for a review if there's anything at all you're uncertain about or would like me to weigh in on. If there's anything time sensitive, bug me in Zulip. It's harder for me to ignore real time chat than GH issues 🤪.

@elfjes
Copy link
Contributor Author

elfjes commented Feb 2, 2021

I removed that dead code you mentioned, so I guess it needs reapproval. I don't have a merge button so if any of you @mplanchard @dee-me-tree-or-love can merge it, I'd be much obliged :)

@mplanchard mplanchard merged commit cf424c9 into pypiserver:master Feb 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants