Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support packing deterministic CAR files #76

Open
vasco-santos opened this issue Aug 13, 2021 · 5 comments
Open

Support packing deterministic CAR files #76

vasco-santos opened this issue Aug 13, 2021 · 5 comments
Labels
kind/enhancement A net-new feature or improvement to an existing feature P3 stack/api-protocols

Comments

@vasco-santos
Copy link
Collaborator

Write the graph out in deterministic graph traversal order instead of in the order it parses the files

Current state

The current implementation of ipfs-car writes the CAR file blocks in any specific order, as follows:

  • receives a glob source
  • relies on ipfs-unixfs-importer for layout and chunking of the source, storing each UnixFS generated block into a temporary blockstore
  • iterates the Blockstore stored blocks and write them into the CAR File

This means that we currently have a different output for the same file as go-ipfs and js-ipfs, which do an ordered walk.

Motivation

Supporting deterministic outputs will enable ipfs-car to have the same output CAR as the core ipfs implementation and move us towards supporting other use cases like interact directly with Filecoin (and perhaps offline deals).

Implementation

Given we currently have two iterations (unixfs importer + blockstore iteration), we can support a deterministic output by getting the root and traverse the graph like https://github.com/ipld/js-datastore-car/blob/master/car.js#L198-L221

We should make this optional and pluggable, given we will need to add codecs and hashers which would increase the dependency footprint for users who not need deterministic CAR files.

We can alternatively support a different function where we do not do the two iterations and keep everything in memory. This would be faster and some users could be ok with the extra memory consumption. But, I would say the write performance to create the CAR file is not the biggest concern, and we have been focusing on efficiency more on Reads than Writes.

cc @rvagg @olizilla @mikeal @alanshaw

@vasco-santos vasco-santos added the kind/enhancement A net-new feature or improvement to an existing feature label Aug 13, 2021
@rvagg
Copy link
Contributor

rvagg commented Aug 13, 2021

Additional note on an ideal here: we should not hide yet another "ordered DAG walk" implementation in here. I think it probably belongs in js-multiformats, it's in the same family of concerns as the Block functionality already there. It's just complicated by the need to have multiple codecs available. Such a walk function could be provided with:

  1. a list of supported codecs so it can decode blocks with those
  2. a list of codecs that are ok to not decode (this has an easy default of raw, json, cbor but there are potentially more a user may want to supply)
  3. an indication of what to do when encountering a block that can't be decoded with the existing codecs - bail, or ignore?

https://github.com/ipfs/js-ipfs/blob/6a2c710e4b66e76184320769ff9789f1fbabe0d8/packages/ipfs-core/src/components/dag/export.js#L82-L107 has an implementation that's a little like this that we did for dag export. It would be good to implement something shared so we could even remove code from there.

@mikeal
Copy link

mikeal commented Aug 13, 2021

One requirement I’d like to surface here.

Users with large amounts of data are writing custom tooling to get their file data “into IPFS” so that they can then write out a CAR file suitable for Filecoin (which really needs to be deterministic).

There are obvious perf issues with moving this much data and suffering excessive copying in memory and on disc.

For these users:

  • This only needs to work in Node.js, which has libraries for working with memory that allow for more optimizations than bare UInt8Array.
  • When we parse the file into a graph we should avoid any new memory allocation, or disc copy, for the raw blocks. We’re just going to write them all out again anyway so we can use a reference to the memory we’ve already read.
  • These customers will prefer allocating 40GB of memory to the process in order to avoid unnecessary disc writes to a block store.
  • Since everything is pulled into memory and written out as a single CAR file, a single writev() will be substantially faster than trying to stream because it’ll reduce the syscalls. Same thing with reading the origin files, if a single process is going to output one CAR file and then die there’s no efficiency gained in streaming the reads.

@dchoi27
Copy link

dchoi27 commented Aug 26, 2021

Given that you can retrieve the full DAG fine with a non-deterministic CAR file, this probably isn't the highest priority.

@dchoi27 dchoi27 added the P3 label Aug 26, 2021
@AugustoL
Copy link

AugustoL commented Sep 7, 2021

Hola, It would be great if the code maintainers or project managers can give more priority to this. Im working on a decentralized application and looking forward to migrate the content from IPFS to WEB3Storage but I want to do it in a deterministic way.

@olizilla
Copy link
Contributor

olizilla commented Sep 7, 2021

@AugustoL can you say more about what you need? For many use cases, the CAR itself won't need to be deterministically packed. You can import an identical DAG from it with ipfs dag import.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement A net-new feature or improvement to an existing feature P3 stack/api-protocols
Projects
None yet
Development

No branches or pull requests

6 participants