Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add docker-archive unpacking #106

Merged
merged 1 commit into from
Sep 27, 2019
Merged

Conversation

ivanstanev
Copy link
Contributor

As part of static image analysis we need to process an image tar on the filesystem.
This PR adds functionality to allow unpacking of what is known as a docker-archive.
For the Runtime team this means processing docker-archive as produced by the Skopeo tool; the format there is slightly different but handled in this PR as well.
For every unpacked file of interest we can run a bunch of actions: useful for key binaries hashing in the future.

Also added unit tests with mocks/fixtures to verify it's working as intended.

  • Ready for review
  • Follows CONTRIBUTING rules
  • Reviewed by Snyk internal team

Where should the reviewer start?

Have a look at the image-extractor.ts file: the entry point is extractFromTar.

Any background context you want to provide?

Part of static image analysis for @snyk/runtime

What are the relevant tickets?

Jira ticket RUN-450
Jira ticket RUN-462

@ivanstanev ivanstanev requested a review from a team September 24, 2019 16:13
@ivanstanev ivanstanev self-assigned this Sep 24, 2019
@ghost ghost requested review from hisenb3rg and removed request for a team September 24, 2019 16:13
lib/stream-utils.ts Outdated Show resolved Hide resolved
@ivanstanev
Copy link
Contributor Author

ivanstanev commented Sep 25, 2019

@mladkau I think by answering I'll address both your comments! The current implementation is not yet thinking about key binaries hashing, we most likely would like to pass through the archive twice as there isn't a clean way to work with a stream by consuming it N times. This current approach is for finding individual specific files and applying some sort of "transformation" to it if necessary, something like having an array and applying .map() to it.

The idea is that we will have two phases to processing files: one is to get or locate the file in the archive, the second step is to apply the different analyzers (e.g. apk, apt, rpm) once a file is obtained.

I would like to get the initial implementation here so we can start working with something and when it comes time to worry about key binaries we can adapt the code, for example we can pass the stream directly to the callbacks and the callback handler can pipe to its own processing logic or listen to 'data' events. This is a really good example how we can adapt it: https://stackoverflow.com/a/51143558

As to why we have the streamToBuffer function, it's just a default case to return the file as binary. We could have it do streamToString by default but not every file is a text file!

@mladkau
Copy link
Contributor

mladkau commented Sep 25, 2019

My main concern is memory consumption here. You might encounter some large files in a file system and loading them into memory is not a good idea. If you really cannot work with streams then do a stream to Temp file.

@Shesekino
Copy link
Contributor

Shesekino commented Sep 25, 2019

@mladkau these are great concerns, thanks for raising them!
just out of curiosity, today (with the "dynamic" scanning) - aren't we running into the same memory consumption scenario?

the package manager manifest files would still be cat in their entirety to a variable, and the binaries would still be executed (which requires loading into memory, I guess?).

basically my question is - with this current implementation, are we simply not improving some aspects of the scan, or are we actually making some aspects worse, to your knowledge?

@ivanstanev ivanstanev requested a review from a team as a code owner September 25, 2019 15:29
@ghost ghost removed their request for review September 25, 2019 15:29
@ivanstanev
Copy link
Contributor Author

@mladkau I have updated the PR; we can now process the stream directly in the callback! This would help us do things like key binaries hashing efficiently.

lib/stream-utils.ts Outdated Show resolved Hide resolved
lib/stream-utils.ts Outdated Show resolved Hide resolved
lib/extractor/layer.ts Outdated Show resolved Hide resolved
lib/extractor/layer.ts Outdated Show resolved Hide resolved
lib/extractor/layer.ts Outdated Show resolved Hide resolved
test/lib/stream-utils.test.ts Show resolved Hide resolved
test/lib/stream-utils.test.ts Show resolved Hide resolved
lib/extractor/layer.ts Outdated Show resolved Hide resolved
lib/extractor/layer.ts Outdated Show resolved Hide resolved
As part of static image analysis we need to process an image tar on the filesystem.
This PR adds functionality to allow unpacking of what is known as a docker-archive.
For the Runtime team this means processing docker-archive as produced by the Skopeo tool; the format there is slightly different but handled in this PR as well.
For every unpacked file of interest we can run a bunch of actions: useful for key binaries hashing in the future.

Also added unit tests with mocks/fixtures to verify it's working as intended.
@mladkau
Copy link
Contributor

mladkau commented Sep 26, 2019

@Shesekino You are correct we do run into the same memory consumption errors with the current approach. One reason why we don't do proper key binary lookup in the moment. Getting static scanning to work properly in terms of architecture and performance would be a HUGE step forward!

@ivanstanev giving the stream directly to the callback avoid loading everything into memory. But we need now to be careful as the stream can only be read once or?

@ivanstanev
Copy link
Contributor Author

@mladkau If we can trust both first two answers here https://stackoverflow.com/questions/51076356/multiple-listeners-reading-from-the-same-stream-in-nodejs then the approach I went with (running the tar file stream through PassThrough streams for every callback) should allow us to clone the stream and process it independently in every callback! If that really doesn't work then we'll have to introduce some code changes but I think it's a problem only for static analysis so far and it'll be on me to make it work 😄

@ivanstanev ivanstanev merged commit 00043ca into master Sep 27, 2019
@ivanstanev ivanstanev deleted the feat/image-unpacking branch September 27, 2019 09:44
@snyksec
Copy link

snyksec commented Sep 27, 2019

🎉 This PR is included in version 1.31.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@team-lumos
Copy link
Collaborator

🎉 This issue has been resolved in version 6.7.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants