-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add docker-archive unpacking #106
Conversation
@mladkau I think by answering I'll address both your comments! The current implementation is not yet thinking about key binaries hashing, we most likely would like to pass through the archive twice as there isn't a clean way to work with a stream by consuming it N times. This current approach is for finding individual specific files and applying some sort of "transformation" to it if necessary, something like having an array and applying .map() to it. The idea is that we will have two phases to processing files: one is to get or locate the file in the archive, the second step is to apply the different analyzers (e.g. apk, apt, rpm) once a file is obtained. I would like to get the initial implementation here so we can start working with something and when it comes time to worry about key binaries we can adapt the code, for example we can pass the stream directly to the callbacks and the callback handler can pipe to its own processing logic or listen to 'data' events. This is a really good example how we can adapt it: https://stackoverflow.com/a/51143558 As to why we have the |
My main concern is memory consumption here. You might encounter some large files in a file system and loading them into memory is not a good idea. If you really cannot work with streams then do a stream to Temp file. |
@mladkau these are great concerns, thanks for raising them! the package manager manifest files would still be basically my question is - with this current implementation, are we simply not improving some aspects of the scan, or are we actually making some aspects worse, to your knowledge? |
b86f89b
to
04246ec
Compare
@mladkau I have updated the PR; we can now process the stream directly in the callback! This would help us do things like key binaries hashing efficiently. |
As part of static image analysis we need to process an image tar on the filesystem. This PR adds functionality to allow unpacking of what is known as a docker-archive. For the Runtime team this means processing docker-archive as produced by the Skopeo tool; the format there is slightly different but handled in this PR as well. For every unpacked file of interest we can run a bunch of actions: useful for key binaries hashing in the future. Also added unit tests with mocks/fixtures to verify it's working as intended.
04246ec
to
77d0b00
Compare
@Shesekino You are correct we do run into the same memory consumption errors with the current approach. One reason why we don't do proper key binary lookup in the moment. Getting static scanning to work properly in terms of architecture and performance would be a HUGE step forward! @ivanstanev giving the stream directly to the callback avoid loading everything into memory. But we need now to be careful as the stream can only be read once or? |
@mladkau If we can trust both first two answers here https://stackoverflow.com/questions/51076356/multiple-listeners-reading-from-the-same-stream-in-nodejs then the approach I went with (running the tar file stream through PassThrough streams for every callback) should allow us to clone the stream and process it independently in every callback! If that really doesn't work then we'll have to introduce some code changes but I think it's a problem only for static analysis so far and it'll be on me to make it work 😄 |
🎉 This PR is included in version 1.31.0 🎉 The release is available on: Your semantic-release bot 📦🚀 |
🎉 This issue has been resolved in version 6.7.0 🎉 The release is available on: Your semantic-release bot 📦🚀 |
As part of static image analysis we need to process an image tar on the filesystem.
This PR adds functionality to allow unpacking of what is known as a docker-archive.
For the Runtime team this means processing docker-archive as produced by the Skopeo tool; the format there is slightly different but handled in this PR as well.
For every unpacked file of interest we can run a bunch of actions: useful for key binaries hashing in the future.
Also added unit tests with mocks/fixtures to verify it's working as intended.
Where should the reviewer start?
Have a look at the
image-extractor.ts
file: the entry point isextractFromTar
.Any background context you want to provide?
Part of static image analysis for @snyk/runtime
What are the relevant tickets?
Jira ticket RUN-450
Jira ticket RUN-462