New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function to return reader of nested member #119

Closed
bruth opened this Issue Aug 18, 2018 · 6 comments

Comments

Projects
None yet
2 participants
@bruth

bruth commented Aug 18, 2018

Hi! I am looking to implement a function that would ideally leverage the recursive unpacking and decompression this package already does. The function signature would look something like this:

func ReadMember(path string, member string) (io.ReadCloser, error)

Where path would be the path to the source file and member would be the name of the member within the file whose byte stream will be returned in the io.ReadCloser. For now, member would be the filename returned by Siegfried that delimits paths by # when denoting nested files.

For example, given a (contrived) archive:

foo.zip
    dir/bar.zip
        baz.csv.gz

Calling ReadMember("foo.zip", "foo.zip#dir/bar.zip#baz.csv.gz#baz.csv") would return a io.ReadCloser that would be the decompressed contents of baz.csv.

The use case is to dynamically read out portions of an archive given the semantics of Siegfried.

A more general function that would walk the members of an input, but that would need to be limited to leaves in the hierarchy.

Do you have a suggestion on how to implement this given the components available in this package?

@richardlehane

This comment has been minimized.

Show comment
Hide comment
@richardlehane

richardlehane Aug 19, 2018

Owner

Hi Byron
thanks for the issue. I'll have a think about this. But just to clarify - do you need siegfried at all in terms of its file format ID functionality or are you just trying to replicate some of the ancillary file walk/unpacking functionality from the command line tool (i.e. if you know ahead of time the member path then you also know ahead of time what formats you need to unpack?)?

Owner

richardlehane commented Aug 19, 2018

Hi Byron
thanks for the issue. I'll have a think about this. But just to clarify - do you need siegfried at all in terms of its file format ID functionality or are you just trying to replicate some of the ancillary file walk/unpacking functionality from the command line tool (i.e. if you know ahead of time the member path then you also know ahead of time what formats you need to unpack?)?

@bruth

This comment has been minimized.

Show comment
Hide comment
@bruth

bruth Aug 19, 2018

Thanks Richard. I should have stated this up front, yes the file format detection specifically relying on the standards is necessary for my use case. My team and I are building a data catalog and archive for biomedical data. We are currently using Archivematica as a pipeline to prepare archive packages and it uses PRONOM as the file format standard.

I need to look at the Roy tool more, but an unrelated question is how to add support for "unofficial" or non-registered file formats. We have genomic data files that we are cataloging such as VCF and FASTQ files. My assumption is that I can create a custom signature file that includes a detection mechanism for these formats?

bruth commented Aug 19, 2018

Thanks Richard. I should have stated this up front, yes the file format detection specifically relying on the standards is necessary for my use case. My team and I are building a data catalog and archive for biomedical data. We are currently using Archivematica as a pipeline to prepare archive packages and it uses PRONOM as the file format standard.

I need to look at the Roy tool more, but an unrelated question is how to add support for "unofficial" or non-registered file formats. We have genomic data files that we are cataloging such as VCF and FASTQ files. My assumption is that I can create a custom signature file that includes a detection mechanism for these formats?

@richardlehane

This comment has been minimized.

Show comment
Hide comment
@richardlehane

richardlehane Aug 20, 2018

Owner

I've had a look at this again this morning & is definitely possible but unfortunately I think at the moment any solution would be pretty ugly and involve a lot of copy/paste of non-exported bits of the siegfried codebase: specifically the decompress.go file within the cmd/sf package & the internal/siegreader package (which is what you'd need to get an io.Reader). I'm currently working on a new release and will look at either exporting some of this stuff so can be used externally or create a helper function for this use case within the top level siegfried package.

Re. a custom signature file - yes you'd use the roy tool for this. See this wiki page for instructions.

Basically the steps are:

  1. use Ross Spencer's signature development utility to make a DROID compatible signature file;
  2. copy that file into a "custom" folder in your siegfried home e.g. ~/home/siegfried/custom/my_sig.xml;
  3. then use the "-extend" flag with roy build (e.g. roy build -name biomedical -extend my_sig.xml biomedical.sig).

You can invoke sf with custom signatures using the -sig flag. E.g. sf -sig biomedical.sig ....

Owner

richardlehane commented Aug 20, 2018

I've had a look at this again this morning & is definitely possible but unfortunately I think at the moment any solution would be pretty ugly and involve a lot of copy/paste of non-exported bits of the siegfried codebase: specifically the decompress.go file within the cmd/sf package & the internal/siegreader package (which is what you'd need to get an io.Reader). I'm currently working on a new release and will look at either exporting some of this stuff so can be used externally or create a helper function for this use case within the top level siegfried package.

Re. a custom signature file - yes you'd use the roy tool for this. See this wiki page for instructions.

Basically the steps are:

  1. use Ross Spencer's signature development utility to make a DROID compatible signature file;
  2. copy that file into a "custom" folder in your siegfried home e.g. ~/home/siegfried/custom/my_sig.xml;
  3. then use the "-extend" flag with roy build (e.g. roy build -name biomedical -extend my_sig.xml biomedical.sig).

You can invoke sf with custom signatures using the -sig flag. E.g. sf -sig biomedical.sig ....

@bruth

This comment has been minimized.

Show comment
Hide comment
@bruth

bruth Aug 20, 2018

Thank you. Having looked through the codebase before, I was going to start there anyway. I will trace my way back from the command entrypoint.

re: sig. Great I will try this out. I appreciate it.

bruth commented Aug 20, 2018

Thank you. Having looked through the codebase before, I was going to start there anyway. I will trace my way back from the command entrypoint.

re: sig. Great I will try this out. I appreciate it.

@richardlehane

This comment has been minimized.

Show comment
Hide comment
@richardlehane

richardlehane Aug 30, 2018

Owner

Hi Byron - v1.7.9 released today now exports a decompress package that you should be able to use for your purposes. I left the siegreader package internal but exposed a public Reader() method on the siegreader.Buffer type. You can already get Buffers from the main siegfried package and with this new method you can now create io.Readers from those Buffers.

See this gist for a worked example along the lines of the ReadMember func you proposed

Owner

richardlehane commented Aug 30, 2018

Hi Byron - v1.7.9 released today now exports a decompress package that you should be able to use for your purposes. I left the siegreader package internal but exposed a public Reader() method on the siegreader.Buffer type. You can already get Buffers from the main siegfried package and with this new method you can now create io.Readers from those Buffers.

See this gist for a worked example along the lines of the ReadMember func you proposed

@bruth

This comment has been minimized.

Show comment
Hide comment
@bruth

bruth Aug 30, 2018

@richardlehane Thank you, this looks great and I really appreciate you adding support for it. I hope to give it a try tomorrow.

bruth commented Aug 30, 2018

@richardlehane Thank you, this looks great and I really appreciate you adding support for it. I hope to give it a try tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment