Don't require a full archive mirror #5

Open
iainlane opened this Issue Dec 1, 2015 · 5 comments

Comments

Projects
None yet
3 participants

iainlane commented Dec 1, 2015

We just talked about deploying appstream-dep11 into production for Ubuntu. In our environment we don't have access to a mountable full mirror of the archive, so we will need to create one locally. That is quite a lot of space if you multiply by all arches and (future) releases.

Is it possible to instead rely on the Contents files to tell us which packages to be interested in, and only download those packages - so that we don't need a full mirror?

Would be happy to work on this if it's going to be feasible.

Owner

ximion commented Dec 1, 2015

Hi!
While using the contents file to ignore packages we don't need is possible (ideally by adding "ignore" tags to the database based on the contents prior to processing data), there are a few reasons why we don't do that:

  • Currently, we mainly process .desktop files and AppStream metainfo XML, but in future there will be support for extracting even more metadata, e.g. from pkg-config .pc files, or other useful data (firmware components, drivers, binaries in PATH, etc.), so we will have to process a large amount of packages anyway. So narrowing down the amount of packages in the first place might not have a big effect, also we might miss metadata that way.
  • The Contents.gz files do not contain version information for packages, and we have no idea when they were regenerated, and whether their data is in sync with the rest of the archive. This might result in us ignoring packages which do actually contain metadata which we want to have. Or it can result in us not reprocessing data, although it was updated. Contents files are not really an exact data source.

To mitigate the last issue, one could move the "select interesting packages" step to the archive software, and have it return a list of package/version/arch triplets for the DEP-11 generator to process. But that would require some additional effort.

iainlane commented Dec 2, 2015

It seems annoying to have to maintain an extra place (dak, Launchpad) to tell us which packages are interesting.

It would be possible to remember which versions you've looked at, so that you don't miss any new packages, but the problem that Contents can lag behind the archive remains.

Owner

ximion commented Dec 9, 2015

A potential solution could be mirroring packages based on previous processing of the Contents file, and make a missing-deb-package not an error in case some special PartialMirror option is set.
That way, we would get all package versions of a potentially interesting package processed. Still, there might be some delays in case the Contents file is not in sync, but the risk of "forgetting" packages should be almost zero.

cjwatson commented Dec 9, 2015

We could also perhaps rescan anything newer than Contents - fuzz-factor when Contents information becomes available.

An alternative option comes to mind. One way or another, it sounds like we'll end up fetching a large fraction of the archive, but we only use any given file once because the DEP-11 generator remembers which packages it's already examined. Would it make sense to have an option to stream packages on demand rather than generating them? That wouldn't change the network requirements, and the first run would of course take a very long time, but it would radically reduce the disk storage requirements which might simplify things in the sort of environment we'll be deploying to.

cjwatson commented Dec 9, 2015

Regarding putting the interesting test in the archive software, while at some level this makes sense it also introduces quite a lot of coupling that I think would get annoying as time goes on, so I think I agree with @iainlane there. The current design has the virtue of having practically no coupling with the archive software at all; in fact as far as I can tell it requires no changes to Launchpad itself, only to some Ubuntu-specific publishing scripts.

iainlane pushed a commit to iainlane/appstream-dep11 that referenced this issue Feb 19, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment