New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support storing multiple files in the same backend object ("fragments") #52
Comments
Please add an option to enable/disable fragments. |
Another option would be to use range downloads to download only the fragment that is needed at the time. |
Google Storage supports batched object uploads and downloads. This would give us all the advantages of fragments without any of the drawbacks. Need to check if S3 has something similar and, if not, if this is reason enough to stick with the old plan... |
Not that if we drop the plan to implement fragments we'd also be able to simplify the metadata schema and drop one table completely. |
S3 doesn't support batched operations. But maybe the latency issue can be addressed by decoupling the number of parallel uploads from the number of upload threads (which can't be very high because it determines the amount of concurrent compressions and encryptions). |
I've decided not to implement this. Revisiting the pros and cons, it is not worth it. Upload speed is better increased by using batched uploads or more concurrent connections (i.e, in the backend layer). For download of many small files, we should get much better results by implementing some sort of readahead then by hoping that files happen to be in the same fragment. Thus opened #63 and #62 instead (I won't create a bug for read-ahead unless someone actually plans to work on it). |
There are some providers that are banning by amount of requests, in case we have millons of little files this is a problem because to put or get files get lot of time. |
e.g. when backing up servers I put /etc in a tar to make it one file only. |
Could you provide some examples of providers where this is a problem, examples of some "other FS" that do this, and eloborate on what "pretty fast" means in this context? Otherwise I remain unconvinced :-). |
for example GDrive and blackblaze is banning when too much requests in a fixed time. so if you are running 100,000 1kb files you will get some time to upload them because the banning, of course s3ql retry can handle it waiiting and retrying but it got lot of time and you get soft-ban. proxyFS use fragments concept. https://github.com/swiftstack/ProxyFS. Sorry about "pretty fast" is not so descriptive 🥇 too much requests it's a problem because the latency and Providers soft-ban. Anway @Nikratio thank you for your amazing work!! I love s3ql. |
I second that. |
Neither GDrive nor Backblaze are currently supported by S3QL, so I don't think that rate limits on their end should influence this decision. The fsck time is an interesting point - but I do not fully understand the problem. With 900k objects you'd need 900 separate requests. Even when assuming a (pretty high) 0.5 seconds round-trip latency, that's only 7.5 minutes. To need 60 minutes, a single request would have to take 4 seconds - that seems too high. Could you file a separate issue about this? Please include the backend that you are using. If the time needed for bucket listings really is a problem, I can also thing of some other solutions (issue multiple listing requests in parallel), we don't need to introduce fragments just for that. |
Is there documentation for ProxyFS somewhere? Looking at the README file, it sounds to me as if ProxyFS is actually mapping files to objects 1:1. |
gdrive list object is slow, can take 2-10s per request |
[migrated from BitBucket]
Storing lots of small files is very inefficient, since every file requires its own block.
We should add support for fragments, so that multiple files can be stored in the same block.
With the new bucket interface, we should be able to implement this relatively easily:
The text was updated successfully, but these errors were encountered: