Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

Serverless repository #25

Open
KOLANICH opened this issue Nov 25, 2022 · 4 comments
Open

Serverless repository #25

KOLANICH opened this issue Nov 25, 2022 · 4 comments
Labels
chat: brainstorming Wild ideas to spur on the inspiration.

Comments

@KOLANICH
Copy link

If I understand right, you wanna store the files in a shared content-addressable store and then access them by symlinks .... I wonder if it can be possible to store the files in IPFS, this way making every PC where a package is installed into a part of a distributed peer-to-peer repository. So no explicit package files are needed for online installation. And when installing a package, the manager should fetch the files from IPFS. And when repairing/updating, it should fetch only broken/changed files. The metadata should also be stored in IPFS.

Of course some measures should be taken for allowing users to maintain privacy and cybersecurity. Such as that by default

  1. taking part in the distributed repo must be opt-in.
  2. it should only be allowed to announce only the latest version of a package.
  3. certain packages can have metadata forbiding announcing them
  4. for each package and each pair of them their popularity is tracked (in a counting Bloom filter?), and so the amount of information exposed by knowing the fact that a certain set of packages is present on a certain machine. Then only the set of packages that carries the least of information (the same for almost all the users) is announced.
  5. Unfortumately not-announcing also carries information. IDK how to deal with it.
  6. Users can opt-in into distributing the packages they don't need. Their files are fetched into the IPFS store but the symlinks are not created.
@ermo
Copy link
Member

ermo commented Dec 23, 2022

If I understand right, you wanna store the files in a shared content-addressable store and then access them by symlinks ....

The files in the local PC content-addressable-storage are 1) fetched as packages, 2) shared per transaction via hard links, 3) May in the future not map to all of the sub-packages in the package to which they belong (due to user preference).

Is there a use case for this proposal or is it just a "huh, wouldn't it be cool if (...)" thing where you're just playing with the idea?

@ermo ermo added the chat: brainstorming Wild ideas to spur on the inspiration. label Dec 23, 2022
@KOLANICH
Copy link
Author

The files in the local PC content-addressable-storage are 1) fetched as packages, 2) shared per transaction via hard links, 3) May in the future not map to all of the sub-packages in the package to which they belong (due to user preference).

Thanks for the info.

Is there a use case for this proposal or is it just a "huh, wouldn't it be cool if (...)" thing where you're just playing with the idea?

Just a wild idea. Inspired by things like BitTorrent and PeerTube. The use case is speeding-up fetching the packages by using local (with mabe gigabit speeds and without consuming internet bandwidth, if the peers are in LAN, which is a likely situation in orgs) peers and taking load off the central servers.

@ermo
Copy link
Member

ermo commented Dec 23, 2022

Is a reverse squid proxy not sufficient if the goal is to maximise caching and local bandwidth usage...?

@KOLANICH
Copy link
Author

No. A proxy is not serverless and it will require a dedicated hardware. A distributed system just shifts the costs to end users. No need to set up a server, no need to maintain it, no need to upgrade it when it becomes overloaded. It is a killer feature of distributed systems - one just uses a client, and the "server" "magically" emerges.

Think about it as about BitTorrent + DHT + a webseeds on the central servers. When there is not enough machines who have the update installed, it is fetched from the webseeds. If there is enough, the update is fetched from p2p.

The problem with usual update systems is that they are distributed as packed archives, on installation the archives are unpacked and deleted, so in order to serve updates clients have to have the archives kept, it is a big storage overhead.

That's why the p2p has to be coupled with update system. Instead of exposing update archives at p2p level of update system should allow fetching individual files. So not the archive as whole is checksummed, but the individual files in it, forming a Merkle DAG.

About storage: certain archive libs (I know, for example, zstd allows this, unfortunately (brotli has the best compression in my experience, even better than lzma for some data) brotli has removed the API to create dictionaries) expose API to pretrain a dictionary. So I guess storing a pretrained dictionary (the dictionary should be a part of package archive, so created only once) can allow reduce the cost of recompressing data before sending it via the net.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
chat: brainstorming Wild ideas to spur on the inspiration.
Projects
None yet
Development

No branches or pull requests

2 participants