Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client refactor: simplify mirrors loop #1352

Merged
merged 4 commits into from
Apr 28, 2021

Conversation

sechkova
Copy link
Contributor

@sechkova sechkova commented Apr 15, 2021

Addresses: #1318, #1339

Description of the changes being introduced by the pull request:

This set of changes tries to simplify the code and make it easy to follow as a result it:

  • Removes mirror_*_download functions and moves the mirror's loop inside the Updater
  • Replaces file objects used as function arguments for metadata files, which are read in memory right after download, anyway. Temporary files are now handled in a single function.
  • Does small code cleanup

Please verify and check that the pull request fulfills the following
requirements
:

  • The code follows the Code Style Guidelines
  • Tests have been added for the bug fix or new feature
  • Docs have been added for the bug fix or new feature

@sechkova sechkova added the experimental-client Items related to the development of a new client (see milestone/8 and theexperimental-client branch) label Apr 15, 2021
@sechkova sechkova added this to the Experimental client milestone Apr 15, 2021
@jku
Copy link
Member

jku commented Apr 16, 2021

One of the reasons for using a file object (a streaming object if you will) is that we have not needed to read the file in memory in all cases. With metadata that's a little moot as we do have to read it to verify and process... but with targets not loading the whole file is a very good idea. So I think the the file object commit is not good. This also means the updater_rework _check_hashes() is wrong.

@jku
Copy link
Member

jku commented Apr 16, 2021

This is not really a review (sorry) but maybe background for future discussion:

Moving the download loop back to updater looks like the right move (in other words my original idea was bad): I can understand this code :) . However I'm not sure the idea of separate loops is better than what the current/old updater does. The download loop is not trivial (we've had multiple bugs with file objects etc) so is multiplying it going to help?

I think the underlying idea with your design is that

  • in current updater is very hard to follow how e.g. specific metadata gets verified (what actually gets checked before new root metadata is accepted)
  • this becomes easier to understand when the root download code is in its own function and not intermingled with e.g. snapshot download code

This is valid... but I think the cost of duplicating the download code is too high -- this is visible in root download where you already had to refactor that one to another function. I think having a generic metadata download function is fine we just have to figure out how to make it clear what is verified when: I'm hoping #1313 will help there -- my vision of what would happen is something like this:

  • root download loop starts going through versions N+1
    • it calls generic metadata download loop that start downloading files from mirrors
      • for each candidate file, it tries to create a new metadata object: this verifies the internal state
      • and tries to insert the object into the current valid metadata bundle (aka local repository controller) as new root: this does most(?) of the required tuf spec checks in the root update part
      • if anything fails we continue the loop, otherwise our metadata bundle is now using the new root and we can break the loop

Another observation:

  • the Updater.verify*() functions are named confusingly: they create a Metadata object, then they do additional verification and return the object. It's not a verifying method, it's a factory method to create a Metadata object that meets some criteria

if self._metadata["root"].keys(
"timestamp"
) != intermediate_root.keys("timestamp"):
# FIXME: use abstract storage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use TODO instead of FIXME?
I did a quick search in the tuf repo and found only 1 place in the code, besides this file, where we are using those.
It seems we aren't using FIXME in our code and if (hopefully) one day we decide to fix our TODOs this will be left out.

Copy link
Member

@jku jku Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really want to use abstract storage in the client (or make it a priority, even if Metadata supports it) but as this part of code is hopefully going to move to bundle anyway, I'm not looking at this sort of things too closely now

@MVrachev
Copy link
Collaborator

This is valid... but I think the cost of duplicating the download code is too high -- this is visible in root download where you already had to refactor that one to another function. I think having a generic metadata download function is fine we just have to figure out how to make it clear what is verified when: I'm hoping #1313 will help there -- my vision of what would happen is something like this:

I was going to write the same thing and then I saw your comment.
I also feel like there is too much repetition and I hope we find another way to do this.
Will follow your Metadata bundle suggestion.

@sechkova sechkova force-pushed the mirrors-loop branch 3 times, most recently from 6561f7a to 660eb25 Compare April 20, 2021 11:06
@sechkova
Copy link
Contributor Author

Agree that now when mirror download code is moved back to Updater the code is more readable but also makes the repetition obvious. However replacing it with a single function with the current state of the Updater code does not result in the desired clean implementation. I propose keeping this changes as a base to test the work proposed in #1313, see how it fits in and at the same time reduce the code repetition.
I force pushed, fixed a bug that @jku noticed in the mirrors loop and also split the _check_hashes function in two: one correctly dealing with file objects for target files download and one using file content read in memory for metadata files.

@sechkova sechkova changed the title WIP: client refactor: simplify mirrors loop client refactor: simplify mirrors loop Apr 20, 2021
@sechkova sechkova marked this pull request as ready for review April 20, 2021 11:54
Copy link
Member

@jku jku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So previous comment still stands

I think the cost of duplicating the download code is too high

... but we have a plan:

After these steps we will be able to evaluate whether the branch has value or if we should just consider it lessons learned and start over...

So with this idea in mind I've reviewed again: I think I found one actual issue in target_download and then some comments...

Comment on lines +167 to +174
if temp_obj:
temp_obj.close()
Copy link
Member

@jku jku Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this applies to every case with close(): current client does temp_obj = None here (inside the if-clause) and it makes sense I think: otherwise temp_obj might be defined on next iteration even if it should not be, leading to us calling close twice...

Calling close() twice should not break things so this probably does not fail anything but it looks fishy

In fact now that you've made the metadata handling bytes-based (and not file object), all of the metadata download loops could just use a download method that returns bytes -- but it does not need to happen in this PR. Maybe adding temp_obj = None is more straight forward for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, thanks for noticing. I added temp_obj = None for now, since changing the download function would mean a difference between metadata and targets. I'll keep this in mind.

except Exception as exception:
if temp_obj:
temp_obj.close()
raise exceptions.NoWorkingMirrorError({file_mirror: exception})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should store the exception like the other functions, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fixed, thanks

tuf/client_rework/updater_rework.py Show resolved Hide resolved

# Raise an exception if any of the hashes are incorrect.
if trusted_hash != computed_hash:
raise sslib_exceptions.BadHashError(trusted_hash, computed_hash)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a BadHashError of our own and should not raise sslib errors

@sechkova
Copy link
Contributor Author

Oops my "fixup" commit is not signed off ... but I apparently want to rebase and squash this one after your review.
On the other hand, do I add the linter fix by running the latest black version in this PR?

@jku
Copy link
Member

jku commented Apr 28, 2021

Looks good enough for me -- let's get this merged so we can see if streamlining mirrors helps...

I'll merge develop into experimental-client now (#1365): After that if you'd like to rebase this on experimental-client (to get the linter fix) and also rebase --autosquash (to squash the fixup commit), that works for me.

mirror_target_download and mirror_meta_download functions that
return iterators are causing more troubles than good.

Since there is a need to download and verify each file inside
the mirrors loop (before continuing to the next mirror), feels
like the correct component to do this is the Updater itself.
This means slightly repetitive code inside the Updater class
but with the benefit of better readability and less bug-prone
implementation.

Signed-off-by: Teodora Sechkova <tsechkova@vmware.com>
Current implementation does not bring much benefit and
can be replaced with a simple f-string.

Signed-off-by: Teodora Sechkova <tsechkova@vmware.com>
By replacing file objects passed as function arguments with
the read file content, we simplify temporary file objects
life cycle management. Temporary files are handled in a single
function.  This is done for metadata files, which
are fully read into memory right after download, anyway.

Same is not true for target files which preferably should be
treated in chunks so targets download and verification still
deal with file objects. _check_hashes is split in two functions,
one dealing correctly with file objects and one using directly
file content.

Signed-off-by: Teodora Sechkova <tsechkova@vmware.com>
Replace sslib exceptions raised in Updater with the
corresponding ones defined in tuf.

Signed-off-by: Teodora Sechkova <tsechkova@vmware.com>
@sechkova
Copy link
Contributor Author

Thanks, it should be ready now.

@jku jku merged commit 79aa42a into theupdateframework:experimental-client Apr 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experimental-client Items related to the development of a new client (see milestone/8 and theexperimental-client branch)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants