New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to set repodata mtime to specific value #9

Open
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
3 participants
@marmarek

marmarek commented Oct 6, 2018

This allows to create repodata reproducibly (together with 'revision'
option). One use case is setting this to SOURCE_DATE_EPOCH during installation iso compose, to make it reproducible.

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1430662

@james-antill

This comment has been minimized.

Show comment
Hide comment
@james-antill

james-antill Oct 7, 2018

Contributor

This seems like a bad idea, you are just brute forcing the timestamp of the metadata no matter what the input is. If you want to go this way make it a "use this timestamp" instead of the clamp idea.

What I think you want is something more like setting the timestamp of the primary/etc. metadata based on the latest timestamp of the rpms (even this needs more thought though because there are problems here where you have a set of rpms generate metadata and then remove the newest rpm and regenerate the metadata ... and the timestamp goes backwards). The obvious solution is to also take any basedir directories mtime into account.

Contributor

james-antill commented Oct 7, 2018

This seems like a bad idea, you are just brute forcing the timestamp of the metadata no matter what the input is. If you want to go this way make it a "use this timestamp" instead of the clamp idea.

What I think you want is something more like setting the timestamp of the primary/etc. metadata based on the latest timestamp of the rpms (even this needs more thought though because there are problems here where you have a set of rpms generate metadata and then remove the newest rpm and regenerate the metadata ... and the timestamp goes backwards). The obvious solution is to also take any basedir directories mtime into account.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Oct 7, 2018

If you want to go this way make it a "use this timestamp" instead of the clamp idea.

What do you think about binding this with 'revision'? Like --revision XXX --set-timestamp-to-revision? Both options needs to be used to get reproducible result anyway.
I was also considering only adjusting timestamp fields in repomd.xml, but skip os.utime calls. But IMO that could be confusing.

(even this needs more thought though because there are problems here where you have a set of rpms generate metadata and then remove the newest rpm and regenerate the metadata ... and the timestamp goes backwards).

This is exactly why why I discarded this idea before (see linked bugzilla comment). As for basedir mtime - do you mean doing this by default, without separate option? Otherwise I don't see why it would be better than an option to force specific timestamp. That would only change how to provide timestamp value (touch --date ... instead of an option).

A totally different solution could be skipping <timestamp> nodes in repomd.xml at all. But I'm not sure if that's a valid thing to do.

marmarek commented Oct 7, 2018

If you want to go this way make it a "use this timestamp" instead of the clamp idea.

What do you think about binding this with 'revision'? Like --revision XXX --set-timestamp-to-revision? Both options needs to be used to get reproducible result anyway.
I was also considering only adjusting timestamp fields in repomd.xml, but skip os.utime calls. But IMO that could be confusing.

(even this needs more thought though because there are problems here where you have a set of rpms generate metadata and then remove the newest rpm and regenerate the metadata ... and the timestamp goes backwards).

This is exactly why why I discarded this idea before (see linked bugzilla comment). As for basedir mtime - do you mean doing this by default, without separate option? Otherwise I don't see why it would be better than an option to force specific timestamp. That would only change how to provide timestamp value (touch --date ... instead of an option).

A totally different solution could be skipping <timestamp> nodes in repomd.xml at all. But I'm not sure if that's a valid thing to do.

@james-antill

This comment has been minimized.

Show comment
Hide comment
@james-antill

james-antill Oct 7, 2018

Contributor

So the main problem is people use the timestamp value as a meaningful value of "this always goes up, so newer is better". Eg. yum will use it to decide you've hit an old mirror and not download older data. mirrormanager also uses it to order metalink data (the SuSE metalink generator may do something similar).
yum will also show the value as the "repo-updated" value in repoinfo.

Like --revision XXX --set-timestamp-to-revision?

If you prefer that I'm fine with it, either way. My main concern was the clamping would do confusing things in weird edge cases, so it'd be better to just always set it to a value instead of clamp.

I was also considering only adjusting timestamp fields in repomd.xml, but skip os.utime calls. But IMO that could be confusing.

I don't think yum cares about the timestamp on the files but I wouldn't trust that until someone looked at the code and did some tests.

As for basedir mtime - do you mean doing this by default, without separate option?

Yeh, like that seems a useful feature that multiple runs of createrepo will generate the exact same repodata ... and I can't even think of a reason to not do it.

A totally different solution could be skipping nodes in repomd.xml at all. But I'm not sure if that's a valid thing to do.

My guess at that is that everything blows up in some horrible way, but I haven't tested that either :).

Contributor

james-antill commented Oct 7, 2018

So the main problem is people use the timestamp value as a meaningful value of "this always goes up, so newer is better". Eg. yum will use it to decide you've hit an old mirror and not download older data. mirrormanager also uses it to order metalink data (the SuSE metalink generator may do something similar).
yum will also show the value as the "repo-updated" value in repoinfo.

Like --revision XXX --set-timestamp-to-revision?

If you prefer that I'm fine with it, either way. My main concern was the clamping would do confusing things in weird edge cases, so it'd be better to just always set it to a value instead of clamp.

I was also considering only adjusting timestamp fields in repomd.xml, but skip os.utime calls. But IMO that could be confusing.

I don't think yum cares about the timestamp on the files but I wouldn't trust that until someone looked at the code and did some tests.

As for basedir mtime - do you mean doing this by default, without separate option?

Yeh, like that seems a useful feature that multiple runs of createrepo will generate the exact same repodata ... and I can't even think of a reason to not do it.

A totally different solution could be skipping nodes in repomd.xml at all. But I'm not sure if that's a valid thing to do.

My guess at that is that everything blows up in some horrible way, but I haven't tested that either :).

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Oct 8, 2018

Unfortunately the basedir mtime idea wouldn't work that way, because running createrepo alters that mtime (by creating new repodata dir). I'll go with setting timestamp to fixed value, same as --revision.

marmarek commented Oct 8, 2018

Unfortunately the basedir mtime idea wouldn't work that way, because running createrepo alters that mtime (by creating new repodata dir). I'll go with setting timestamp to fixed value, same as --revision.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Oct 8, 2018

Actually, maybe even separate --set-timestamp-to-revision switch is not needed? All of it applies to files created during createrepo run, so their mtime will be "now". Always resetting mtime (and so <timestamp> repomd.xml fields) to --revision shouldn't be a problem, as --revision defaults to "now". This may be even better, because all the repodata will look like modified at the same time, instead of affected by the time needed to compute them. And since repodata is created in temporary directory, it shouldn't result in any outside side effects.
The only effect would be inability to set custom --revision without affecting timestamps, which would matter if one use arbitrary number instead of timestamp there. Is that the case anywhere (doesn't look to be in Fedora at least)?
What do you think about it?

marmarek commented Oct 8, 2018

Actually, maybe even separate --set-timestamp-to-revision switch is not needed? All of it applies to files created during createrepo run, so their mtime will be "now". Always resetting mtime (and so <timestamp> repomd.xml fields) to --revision shouldn't be a problem, as --revision defaults to "now". This may be even better, because all the repodata will look like modified at the same time, instead of affected by the time needed to compute them. And since repodata is created in temporary directory, it shouldn't result in any outside side effects.
The only effect would be inability to set custom --revision without affecting timestamps, which would matter if one use arbitrary number instead of timestamp there. Is that the case anywhere (doesn't look to be in Fedora at least)?
What do you think about it?

@dmnks

This comment has been minimized.

Show comment
Hide comment
@dmnks

dmnks Oct 12, 2018

Contributor

The only effect would be inability to set custom --revision without affecting timestamps, which would matter if one use arbitrary number instead of timestamp there. Is that the case anywhere (doesn't look to be in Fedora at least)?
What do you think about it?

I agree with you that we should always strive to keep things simple so, with that in mind, setting mtime of the metadata files and <timestamp> tags to --revision seems like a good idea because then you don't need yet another option such as --set-timestamp-to-revision to get fully reproducible builds.

OTOH, as you say, people may be using that field arbitrarily (all the man page says is "Arbitrary string for a repository revision.") and changing the default behavior could easily introduce unexpected regressions somewhere.

So, if we decide to go down the explicit route, one random idea to avoid having too many new options would be to change the option name --clamp-mtime-to to just --clamp-timestamp and its semantics to something like "clamp metadata revision and modification times to the value of the SOURCE_DATE_EPOCH environment variable, to ensure reproducible builds". That looks like a good blend of configurable mtime/revision as well as keeping things (and options) simple, with the added bonus of supporting the actual specification out of the box.

It's basically what you proposed originally but without the need to specify the desired mtime twice on the command line; once for --clamp-mtime-to and once for --revision (it would read the env var instead, and set both the mtime and revision to that value).

One drawback would be that people might not be familiar with the specification and the whole idea, so the option might not make that much sense in such form. What do you think?

Contributor

dmnks commented Oct 12, 2018

The only effect would be inability to set custom --revision without affecting timestamps, which would matter if one use arbitrary number instead of timestamp there. Is that the case anywhere (doesn't look to be in Fedora at least)?
What do you think about it?

I agree with you that we should always strive to keep things simple so, with that in mind, setting mtime of the metadata files and <timestamp> tags to --revision seems like a good idea because then you don't need yet another option such as --set-timestamp-to-revision to get fully reproducible builds.

OTOH, as you say, people may be using that field arbitrarily (all the man page says is "Arbitrary string for a repository revision.") and changing the default behavior could easily introduce unexpected regressions somewhere.

So, if we decide to go down the explicit route, one random idea to avoid having too many new options would be to change the option name --clamp-mtime-to to just --clamp-timestamp and its semantics to something like "clamp metadata revision and modification times to the value of the SOURCE_DATE_EPOCH environment variable, to ensure reproducible builds". That looks like a good blend of configurable mtime/revision as well as keeping things (and options) simple, with the added bonus of supporting the actual specification out of the box.

It's basically what you proposed originally but without the need to specify the desired mtime twice on the command line; once for --clamp-mtime-to and once for --revision (it would read the env var instead, and set both the mtime and revision to that value).

One drawback would be that people might not be familiar with the specification and the whole idea, so the option might not make that much sense in such form. What do you think?

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Oct 12, 2018

If going to support actual specification, mere presence of SOURCE_DATE_EPOCH variable should trigger this behavior: Build processes MUST use this variable for embedded timestamps in place of the "current" date and time.. It makes sense as in generic case, the createrepo/createrepo_c call could be buried deep in the build process. In this case, the caller would use "if SOURCE_DATE_EPOCH is set, add --clamp-mtime option" anyway. So I propose to not mix those two things and either:

  • trigger this feature by mere presence of SOURCE_DATE_EPOCH (per specification), no extra option, or
  • use explicit value from command line option (either --revision, or else)

Here, I think the second version will be less confusing, as there are also other use cases than reproducibly building installation image.

marmarek commented Oct 12, 2018

If going to support actual specification, mere presence of SOURCE_DATE_EPOCH variable should trigger this behavior: Build processes MUST use this variable for embedded timestamps in place of the "current" date and time.. It makes sense as in generic case, the createrepo/createrepo_c call could be buried deep in the build process. In this case, the caller would use "if SOURCE_DATE_EPOCH is set, add --clamp-mtime option" anyway. So I propose to not mix those two things and either:

  • trigger this feature by mere presence of SOURCE_DATE_EPOCH (per specification), no extra option, or
  • use explicit value from command line option (either --revision, or else)

Here, I think the second version will be less confusing, as there are also other use cases than reproducibly building installation image.

@dmnks

This comment has been minimized.

Show comment
Hide comment
@dmnks

dmnks Oct 12, 2018

Contributor

Oh, you're right, thanks for the correction. I didn't read the spec carefully and missed the "MUST use" part.

The first approach of checking the env var presence looks better in this context, however, it's still a bit brittle since you could theoretically have such an env var defined (for whatever reason) without knowing anything about the spec or reproducible builds.

What I also didn't realize is that the linked bugzilla was not originally filed by you and it wasn't about this particular specification.

Given all that, I agree that simply supporting reproducible builds (regardless of the specification) is the better approach here.

I still think, though, that having a single option to achieve this (instead of --revision XXX --set-timestamp-to-revision) would be nice. By using these two, it's not clear whether that's all that has to be "clamped" or if there are some other variables that could break the reproducibility. So having something like --clamp-timestamp could make for a nicer UX in terms of ensuring that I really get reproducible builds.

Contributor

dmnks commented Oct 12, 2018

Oh, you're right, thanks for the correction. I didn't read the spec carefully and missed the "MUST use" part.

The first approach of checking the env var presence looks better in this context, however, it's still a bit brittle since you could theoretically have such an env var defined (for whatever reason) without knowing anything about the spec or reproducible builds.

What I also didn't realize is that the linked bugzilla was not originally filed by you and it wasn't about this particular specification.

Given all that, I agree that simply supporting reproducible builds (regardless of the specification) is the better approach here.

I still think, though, that having a single option to achieve this (instead of --revision XXX --set-timestamp-to-revision) would be nice. By using these two, it's not clear whether that's all that has to be "clamped" or if there are some other variables that could break the reproducibility. So having something like --clamp-timestamp could make for a nicer UX in terms of ensuring that I really get reproducible builds.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Oct 12, 2018

So having something like --clamp-timestamp could make for a nicer UX in terms of ensuring that I really get reproducible builds.

I assume with explicit value to that option of course, right?
I'll adjust the patch. And then make similar patch to createrepo_c...

marmarek commented Oct 12, 2018

So having something like --clamp-timestamp could make for a nicer UX in terms of ensuring that I really get reproducible builds.

I assume with explicit value to that option of course, right?
I'll adjust the patch. And then make similar patch to createrepo_c...

@dmnks

This comment has been minimized.

Show comment
Hide comment
@dmnks

dmnks Oct 12, 2018

Contributor

So having something like --clamp-timestamp could make for a nicer UX in terms of ensuring that I really get reproducible builds.

I assume with explicit value to that option of course, right?
I'll adjust the patch. And then make similar patch to createrepo_c...

Yes :) (you can also name it --clamp-timestamp-to if you wish)

One alternative that also came to mind would be that we would keep the --revision XXX --set-timestamp-to-revision paradigm but document it in the man page (in a separate section such as "REPRODUCIBLE BUILDS" or something).

But honestly, I don't know what's better. @james-antill , any idea?

Contributor

dmnks commented Oct 12, 2018

So having something like --clamp-timestamp could make for a nicer UX in terms of ensuring that I really get reproducible builds.

I assume with explicit value to that option of course, right?
I'll adjust the patch. And then make similar patch to createrepo_c...

Yes :) (you can also name it --clamp-timestamp-to if you wish)

One alternative that also came to mind would be that we would keep the --revision XXX --set-timestamp-to-revision paradigm but document it in the man page (in a separate section such as "REPRODUCIBLE BUILDS" or something).

But honestly, I don't know what's better. @james-antill , any idea?

@dmnks

This comment has been minimized.

Show comment
Hide comment
@dmnks

dmnks Oct 12, 2018

Contributor

I'll adjust the patch. And then make similar patch to createrepo_c...

OK. I'll be watching what createrepo_c thinks about that proposal as well before merging this so that we don't have it implemented in different ways.

Contributor

dmnks commented Oct 12, 2018

I'll adjust the patch. And then make similar patch to createrepo_c...

OK. I'll be watching what createrepo_c thinks about that proposal as well before merging this so that we don't have it implemented in different ways.

Add option to set repodata timestamps to --revision value
This allows to create repodata reproducibly.
BZ 1430662
@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Oct 12, 2018

Pushed update with --set-timestamp-to-revision and its documentation.

marmarek commented Oct 12, 2018

Pushed update with --set-timestamp-to-revision and its documentation.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek commented Oct 12, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment