Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split processing into separate scan and generate operations #6

Closed
neilmayhew opened this issue Apr 29, 2016 · 3 comments
Closed

Split processing into separate scan and generate operations #6

neilmayhew opened this issue Apr 29, 2016 · 3 comments
Assignees

Comments

@neilmayhew
Copy link
Contributor

I understand that speed and cost are an issue for generating metadata. How would it be if you only looked at newly-added packages? You already have data about the existing packages cached in the database, so they don't all need to be rescanned. The job that processes incoming packages could tell asgen which packages to consider by passing it the .changes files to look at. That would avoid the expensive loading of indices. (This would be a good option in my infrastructure, where we could call asgen from the cron job that scans the incoming directories for changes.)

It would still be necessary to regenerate the whole of the Components and icons files, but this would be done from cached data in the database and would be considerably less expensive than reading the indices since there are far fewer apps than packages. Alternatively, it could be done by a cron job, but that could be run more frequently since this part of the operation is less costly.

I propose that the current process operation be split into two separate operations, scan and generate. The scan operation would take a list of .changes or .deb files to look at. There would still be a process operation to scan an entire suite, but it would use these two other operations internally.

@ximion
Copy link
Owner

ximion commented Apr 29, 2016

No ETA for this yet, this would be a rather big change, needing changes on backends and the engine - but it's a useful feature.

@ximion
Copy link
Owner

ximion commented Mar 26, 2018

I am looking into this now, and there is no way around index-loading, because we need to know which packages are currently in a suite.
For efficiency, the index is only loaded once at the moment, and in the first step data is extracted from the index and in the second step output is generated from extracted metadata and media content.

What I can (and will) add is a way to process a certain package given its filename and section and suite (maybe we'll also parse a .changes file at some point) - this does not require much cache loading, so should be really fast.
However, it won't save you from loading the whole index as soon as you want to publish your archive with new metadata.

At the moment, there is also a lot of internal state which decides which steps do or don't happen (e.g. we don't publish metadata if we didn't just scan a package which had new data, or the index wasn't changed) - those would have to be written to disk. So, at first I assume the split version of asgen will be much less efficient.

ximion added a commit that referenced this issue Mar 26, 2018
See #6
This is really inefficient right now and will need to be improved.
ximion added a commit that referenced this issue Mar 26, 2018
See #6
This needs backend support now, as well as a bunch of experimental
changes on how we deal with the arch:all architecture.
ximion added a commit that referenced this issue Mar 26, 2018
See #6
This needs backend support now, as well as a bunch of experimental
changes on how we deal with the arch:all architecture.
@ximion
Copy link
Owner

ximion commented Apr 5, 2018

This is implemented now, although not very efficient (and it likely never will be).
I still consider this a somewhat experimental feature, and I have a few ideas on how to improve it and make it more useful for unit testing or package archives for which you know that the appstream-generator does not have to hunt down icons across the whole package repository (with a flag for that, efficiency and speed could be greatly increased).

For now, you can play with this feature using the process-file command, which takes a suite name, archive component (section) and a list of (.deb) package files as parameters. The publish command can be used to compile the final icon tarballs, YAML or XML metadata and HTML and JSON files.
Don't rely on those features yet though, it's early work and not as well tested as the rest of the generator (especially, I am not yet sure if arch:all will behave the same way as they do in a complete generator run).

@ximion ximion closed this as completed Apr 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants