-
-
Notifications
You must be signed in to change notification settings - Fork 338
-
-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preallocate files #2169
Comments
It's easily possible, the information is inside the yEnc header. That's how nzbget does it. I already made a proof of concept a year or so ago. A much better use case would be that this way we can actually implement a proper Retry where only the missing articles are tried again and filled in at the right spot in the file. That's the reason why nzbget implemented it. |
there is tons of obfuscation methods, purposely not including pars / leaving stuff incomplete on purpose (force you to get the actual nzb from the source) to prevent site leeching or avoid dmca and so on. How often are you getting these nzb and do you find it is only from one site? I'd wonder how much actual benefit there would be when the majority of the time you just have the data to write anyways. So writing placeholder files is time consuming and wear on the hdd. To also do the zero prefill required additional accounting since you would have to know what hadn't been filled and what has? |
@thezoggy you're not actually writing data, you say |
True, implementing this would make proper Retry possible too.
I'm not familiar with NNTP. If article is missing where would the header come from? @thezoggy As for the zero prefill it doesn't write anything if implemented properly, OS handles that. I mean full preallocation is not required to fill missing articles but I felt like it's a very related issue. |
If there's no article, indeed we can't parse the header. Does it really need to preallocate the whole file? Why not add as we go? As long as the data is in the right location inside the file. |
No, full instant preallocation is not required. I guess I shouldn't have mixed these two features in one issue. But they both require accounting for the size of missing articles. I'm guessing most upload tools use fixed data length for every post in a single NZB. Can we use that to calculate correct size when article is missing? Would that work? |
All articles that are present will tell in their yEnc-header exactly in what spot their data belongs. So if an article is missing, the next article will just say where it's data should be and we can start writing there:
|
There are benefits with pre-allocation. Exact behaviour can vary across filesystems, but things include:
IMO, pre-allocating files is the most sensible approach for a downloader, if possible. |
If preallocation gets a go please make sure to use instant file initialization using IMO, this and filling missing articles should be separate options. |
It's interesting, but as I wrote before I don't see many convincing points (e.g. that would benefit 50%+ of users) to implement it right now. |
Across other applications, I see it often being an option between no preallocation, fast preallocation (using OS supplied calls, or maybe seeking to the last byte and writing a 0) and "full" preallocation (explicitly zero-fill the file), with the default being on fast preallocation. Of course, whether it's worth implementing is another judgement altogether. If I were implementing the system from scratch, I'd definitely take the approach, but changes on an existing system is a different cost. |
Sparse files are usually what you want to stay away from. At least on conventional spinning rust. Insane fragmentation, extremely slow to read from.
I've yet to see an app that transparently explains this to an end user. qBittorrent still takes the heat for stalling while it writes gigabytes of zeros on Windows for hours with preallocation enabled. The reason for that is libtorrent call to |
Actually, the way you put it, sparse files are probably what you do want most of the time. Keep in mind that torrent clients tend to download in random order, whilst Usenet would be largely (if not always) sequential, so a sparse file is no worse than having no preallocation. Thanks for sharing the info though! |
This sparse file thing is much harder than I hoped. I modified newswrapper to extract ybegin and yend and tried to use it with fseek in assemble like some examples seemed to imply that I could. Unfortunately it creates broken files when I dump all available parts in every loop. Apparently you can't update data inside a file. I don't think the files being generated this way are sparse at all or if it would work this way if they were. If they were, what happens to the data that crosses sectors? I think maybe we would have to write data in 4 KB blocks and join sections of parts to make sure they fit properly. I'm using Windows and NTFS. I have verified that the ybegin value I extract for each part is correct (after adjusting for yenc kinks) by comparing it to Related to #2459 |
Hmm python has strange open modes, I can't tell which would be "open in binary for writing without truncating" Did you try that? |
|
@mnightingale please take a look at the actual implementation in cpython of the buffering, I did and it's much more basic than you might think. There is barely anything smart about it, and the buffering limit is just a constant. Setting buffering=0 gives us a direct filepointer instead of all the useless overhead of buffering logic that we don't need because we always write blocks much bigger (750kb) than the buffering limit. |
@puzzledsab Do you have some code we can look at?
The filesystem APIs are supposed to hide the notion of sectors, so you shouldn't have to worry about that. Essentially, the data will get placed accordingly and split across sectors if necessary. |
@mnightingale : I did try it but apparently not hard enough. It seems to work now. |
Right now missing articles are simply getting skipped. Remaining data is misaligned and resulting filesize is different.
This behaviour complicates a lot of simple ways of restoring missing chunks even if you can get them from other sources. You can't easily fill just the missing data from P2P networks. You can't use RAR's internal recovery record. You can't unpack existing data from multipart archives even when total number of lost articles is relatively small.
I know, I know. NZB files do not provide length of actual data, only a number of raw bytes. But you can still guess and I believe some clients already do that quite fine.
I've just had to pull 50GB of data once again while I was missing less than 1MB. This is really frustrating. I also see an increasing number of uploaders that intentionally do not provide any PARs and 1 missing article trashes the whole batch.
Anyway this has been brought up a few times on forums with a lof valid points but I don't think it got any traction:
/viewtopic.php?t=9851
&/viewtopic.php?t=15373
.The text was updated successfully, but these errors were encountered: