Spike: Self-Hosted Web3.Storage #1

hannahhoward · 2024-03-12T15:46:16Z

Goals

Time box to one week -- attempt to produce an upload node that can persist data and a retrieval node that can talk to each other on a local system

For this round, don't worry about web assembly -- feel free to run docker or whatever you like.

Try to make an upload api node that can be run locally (accept store/* & upload/*)
1. Should persist CAR files to any kind of store
2. Should write minimum content claims required to lookup content
Try to make a freeway node that can be run locally
1. Try to connect and read from local node

The goal here is to identify hidden dependencies on cloud services and/or assumptions that make portability difficult.

If this is trivially easy, that's wonderful news. The next move will be to seperate the upload node from the storage device -- and start building code to register storage nodes with the uploader and have it round robin content uploads between them.

reidlw · 2024-03-27T16:41:06Z

Alan is going to add some additional notes, but this is effectively done.

alanshaw · 2024-04-08T13:26:58Z

DEMO: https://youtu.be/eJVA97t-jaw?si=CL9BQCaRKZ15UHVb&t=2748

local.storage

https://github.com/w3s-project/local.storage
The implementation IS sufficiently de-coupled from centralized services.
- I was able to pick the parts of the web3.storage stack that I wanted to implement.
- Known limitations/issues/workarounds:
  - I choose not to implement the dependencies for the filecoin aggregation pipeline for time reasons.
  - A CARv2 index is generated and stored for each uploaded CAR, since the client does not yet do this.
  - I generated content claims in local.storage backend, since the client library does not yet do this.
  - I did not hook up to an email sending service for email verification, but it would be trivial to do so.
  - Upload data is fake streamed in/out due to time constraints (i.e. data is buffered in memory). This should be easy enough to remedy.
Not surprisingly, we have dependencies on resources that are in the process of being phased out i.e. carpark and dudewhere bucket. Carpark will soon be content claims backed HTTP read from anywhere. Dudewhere is essentially an inclusion claim.
Content claims
- The content claims read API is implemented and available at /claims/:cid
- Claims are generated by local.storage when content is uploaded (some on behalf of the client library since it does not yet do this)
- Generated claims:
  - For each store/add invocation:
  - For each upload/add invocation:
    - A partition claim, asserting that the root CID, can be found in the partition CIDs (shards).
Blob read interface
- You can read data by base58btc encoded multihash at /blob/:multihash
- You can get hold of CARs and CARv2 indexes
- It supports HTTP range requests
- Could probably serve content claims instead of the content claims read API if we wanted (since they are also hash addressable)
Data storage
- Instead of DynamoDB and S3 buckets local.storage stores all data in a DAG (including uploaded blobs)
- Data is persisted on disk in an IPFS compatible FS blockstore
- Data is managed by Pail - a library that implements key/value style storage similar to LevelDB
  - Spec
  - Repo
- For the most part web3.storage requires simple get/put, but sometimes needs to list data by prefix/range. This is the perfect fit for pail, which is optimized for this.
- I chose to use a single "pail", so that the entire state of the system can be captured by a single CID at any given time.
- I created a simple software partitioning system which allows each "store" to operate as if it was it's own pail.
- In order to ensure consistency, I implemented a simple transaction system which ensures only a single transaction is running concurrently at any given time.

local.freeway

https://github.com/w3s-project/local.freeway
This was extremely simple to setup, we use our gateway-lib to do the heavy lifting here.
The only new/interesting bit is the content claims index, which is a content claims backed index that maps a CID to a URL and byte offset. Obviously for our implementation the URL is always to local.storage but in theory could be to any node serving the CAR/blob.
For a given CID, resolution looks like:
1. Call /claims/:cid with walk parameters parts and includes
2. In the response, we expect to receive a partition claim and...
  - For each part (shard) we expect a location claim, and an inclusion claim
  - For each include (index) in the inclusion claims, we expect a location claim
3. We can then read the includes (indexes), by location URL
4. We can then read individual blocks at byte offsets, by shard location URL
Note: this resolution method does not allow random access!

reidlw assigned alanshaw Mar 13, 2024

hannahhoward mentioned this issue Mar 27, 2024

Support IPNI+Carpark indexing from the client #24

Closed

hannahhoward closed this as completed Apr 10, 2024

jchris mentioned this issue Apr 18, 2024

Share read-only web3.storage capabilities fireproof-storage/fireproof#105

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike: Self-Hosted Web3.Storage #1

Spike: Self-Hosted Web3.Storage #1

hannahhoward commented Mar 12, 2024

reidlw commented Mar 27, 2024

alanshaw commented Apr 8, 2024 •

edited

Loading

Spike: Self-Hosted Web3.Storage #1

Spike: Self-Hosted Web3.Storage #1

Comments

hannahhoward commented Mar 12, 2024

Goals

reidlw commented Mar 27, 2024

alanshaw commented Apr 8, 2024 • edited Loading

local.storage

local.freeway

alanshaw commented Apr 8, 2024 •

edited

Loading