Support packing deterministic CAR files #76

vasco-santos · 2021-08-13T09:38:32Z

Write the graph out in deterministic graph traversal order instead of in the order it parses the files

Current state

The current implementation of ipfs-car writes the CAR file blocks in any specific order, as follows:

receives a glob source
relies on ipfs-unixfs-importer for layout and chunking of the source, storing each UnixFS generated block into a temporary blockstore
iterates the Blockstore stored blocks and write them into the CAR File

This means that we currently have a different output for the same file as go-ipfs and js-ipfs, which do an ordered walk.

Motivation

Supporting deterministic outputs will enable ipfs-car to have the same output CAR as the core ipfs implementation and move us towards supporting other use cases like interact directly with Filecoin (and perhaps offline deals).

Implementation

Given we currently have two iterations (unixfs importer + blockstore iteration), we can support a deterministic output by getting the root and traverse the graph like https://github.com/ipld/js-datastore-car/blob/master/car.js#L198-L221

We should make this optional and pluggable, given we will need to add codecs and hashers which would increase the dependency footprint for users who not need deterministic CAR files.

We can alternatively support a different function where we do not do the two iterations and keep everything in memory. This would be faster and some users could be ok with the extra memory consumption. But, I would say the write performance to create the CAR file is not the biggest concern, and we have been focusing on efficiency more on Reads than Writes.

cc @rvagg @olizilla @mikeal @alanshaw

The text was updated successfully, but these errors were encountered:

rvagg · 2021-08-13T11:49:30Z

Additional note on an ideal here: we should not hide yet another "ordered DAG walk" implementation in here. I think it probably belongs in js-multiformats, it's in the same family of concerns as the Block functionality already there. It's just complicated by the need to have multiple codecs available. Such a walk function could be provided with:

a list of supported codecs so it can decode blocks with those
a list of codecs that are ok to not decode (this has an easy default of raw, json, cbor but there are potentially more a user may want to supply)
an indication of what to do when encountering a block that can't be decoded with the existing codecs - bail, or ignore?

https://github.com/ipfs/js-ipfs/blob/6a2c710e4b66e76184320769ff9789f1fbabe0d8/packages/ipfs-core/src/components/dag/export.js#L82-L107 has an implementation that's a little like this that we did for dag export. It would be good to implement something shared so we could even remove code from there.

mikeal · 2021-08-13T18:04:21Z

One requirement I’d like to surface here.

Users with large amounts of data are writing custom tooling to get their file data “into IPFS” so that they can then write out a CAR file suitable for Filecoin (which really needs to be deterministic).

There are obvious perf issues with moving this much data and suffering excessive copying in memory and on disc.

For these users:

This only needs to work in Node.js, which has libraries for working with memory that allow for more optimizations than bare UInt8Array.
When we parse the file into a graph we should avoid any new memory allocation, or disc copy, for the raw blocks. We’re just going to write them all out again anyway so we can use a reference to the memory we’ve already read.
These customers will prefer allocating 40GB of memory to the process in order to avoid unnecessary disc writes to a block store.
Since everything is pulled into memory and written out as a single CAR file, a single writev() will be substantially faster than trying to stream because it’ll reduce the syscalls. Same thing with reading the origin files, if a single process is going to output one CAR file and then die there’s no efficiency gained in streaming the reads.

dchoi27 · 2021-08-26T15:59:52Z

Given that you can retrieve the full DAG fine with a non-deterministic CAR file, this probably isn't the highest priority.

AugustoL · 2021-09-07T13:26:06Z

Hola, It would be great if the code maintainers or project managers can give more priority to this. Im working on a decentralized application and looking forward to migrate the content from IPFS to WEB3Storage but I want to do it in a deterministic way.

olizilla · 2021-09-07T14:21:12Z

@AugustoL can you say more about what you need? For many use cases, the CAR itself won't need to be deterministically packed. You can import an identical DAG from it with ipfs dag import.

vasco-santos added the kind/enhancement A net-new feature or improvement to an existing feature label Aug 13, 2021

dchoi27 added the P3 label Aug 26, 2021

rvagg mentioned this issue Sep 13, 2021

TODO: DAG walk for Block API multiformats/js-multiformats#118

Closed

dchoi27 added the stack/api-protocols label Mar 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support packing deterministic CAR files #76

Support packing deterministic CAR files #76

vasco-santos commented Aug 13, 2021

rvagg commented Aug 13, 2021

mikeal commented Aug 13, 2021 •

edited

Loading

dchoi27 commented Aug 26, 2021

AugustoL commented Sep 7, 2021

olizilla commented Sep 7, 2021

Support packing deterministic CAR files #76

Support packing deterministic CAR files #76

Comments

vasco-santos commented Aug 13, 2021

Current state

Motivation

Implementation

rvagg commented Aug 13, 2021

mikeal commented Aug 13, 2021 • edited Loading

dchoi27 commented Aug 26, 2021

AugustoL commented Sep 7, 2021

olizilla commented Sep 7, 2021

mikeal commented Aug 13, 2021 •

edited

Loading