Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing features for Archives #1091

Open
bari12 opened this issue Apr 24, 2018 · 8 comments

Comments

@bari12
Copy link
Member

commented Apr 24, 2018

Archive missing features tracker

In this ticket we are tracking/discussing the features missing from the Rucio archive support.

  • Download full archive and extract a single file #46
  • xrdcp transfer a single file part of a zip file #1137
  • Transparent list-replicas support of zip contents #1138
  • Forward of rule creation from constituent to archive #1376
  • list_dataset_replicas should resolve archives as well #1375
  • extract flag needs to be added to metalink #1353
  • Bug with checksums on xrdcp streamed replicas needs to be fixed #1613
  • Checksum checking needs to be enabled again due to fix in #1613
  • Forward of constituent rule to archive in case of tape replica #1663
  • Client needs to extract files from archive #1354
  • Revert xrdcp workaround #1598
  • Handling archives in the reaper #1431
  • Handling of lost files in the necromancer (Should not be removed if there is a zip replica)
  • Full transparent handling of constituents in the judge/rules
  • list_replicas should re-order zips and prioritise root protocol over everything else #2313

The advantage of the transparent support of replica listing of content archives (with the # format root supports) would be that there is little to do in the actual rucio client, as there is no protocol difference between downloading a normal file or a file in a zip file.

Other discussions

@rodwalker

This comment has been minimized.

Copy link

commented Apr 26, 2018

I am a bit concerned that the only info so far is a list of files inside a zip file. In order to see if a file is in a zip file, you`d need to search this table - also when the file is not archived.
It seems to me you need a special replica, which is just the name of the zipfile. Then list-replicas would also show the locations of the zip file.
Adding files to the zip, would mean adding a special replica to all the content files.

Anyway, I`m glad some else has to worry about this stuff.

Cheers,
Rod.

@vingar

This comment has been minimized.

Copy link
Contributor

commented Apr 26, 2018

Without entering into all the details, listing replicas without any additional entries is doable.

About the format of the url replica for a file contained in a zip, you proposed <scheme>://zipfile#constituent1 I'm wondering if it's a well adopted semantic for this. like in web browser, etc.

@bari12

This comment has been minimized.

Copy link
Member Author

commented Apr 26, 2018

We have to see how we do this. Adding a "special" replica for each zip-content is a possibility, but this
adds workflow complexity to keep everything consistent. I prefer the way we discussed in the dev meeting today by just incorporating the information we already have. The implication there though is performance, as each list-replicas call gets more complicated and we really have to make sure that this doesn't slow down things in general.

@rodwalker

This comment has been minimized.

Copy link

commented May 2, 2018

So, any progress/thoughts?

@bari12

This comment has been minimized.

Copy link
Member Author

commented May 2, 2018

For #46 @TomasJavurek will work on this, no update yet.
The transparent list-replica support we discussed and we think it should be fine with the information which is already there, without creating some fake replicas. Which would be better for consistency reasons. Essentially, when you list the replicas for a file the information that it is also a constituent in a zip file is there, thus we can list the parent replicas then. It might have some performance implications for the general workflow, so this needs to be done carefully. We didn't discuss a development plan specifically, but I was hoping that @vingar could work on this? Maybe he can comment?

@bari12

This comment has been minimized.

Copy link
Member Author

commented May 7, 2018

Hi @rodwalker,
So we discussed the timeline of the the transparent archive support (list-replicas of files also exposes parent archives) and we are aiming to have this by end of June (Hopefully in the 1.17.0 release).

@bari12

This comment has been minimized.

Copy link
Member Author

commented Jun 25, 2018

Just to follow up on the results so far (There will be a presentation in the ATLAS S&C week this Thursday about this as well):

  • rucio download <constituent> --archive-did <archive> issues an xrdcp streamed download of the constituent in the archive; However, the archive has to be named specifically for this to be possible.
  • list_replicas has been adapted if you do list_replicas that it also outputs the archive with the streaming option. This should enable the client to do a download of a constituent transparently. This will need an update of root as far as I know (@mlassnig ?), as the streaming is currently only possible via xrdcp and not gfal.

What is not there:

  • rucio add-rule <constituent> makes a rule on the archive. This will be discussed on Thursday.
  • With the list_replicas change @mlassnig did, if the constituent is not on a root enabled storage, the replica dictionary has a client_extract=True flag, which tells the client to download the archive and extract the file. This functionality needs to be implemented. (@TWAtGH or @TomasJavurek ?)
@bari12

This comment has been minimized.

Copy link
Member Author

commented Jul 24, 2018

I have updated the missing features/bugs to the overview on the top. Please comment here in this ticket if anything else is missing.

@bari12 bari12 added the Overview label Aug 9, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.