Preallocate files #2169

reallyuniquename · 2022-05-03T15:16:01Z

Right now missing articles are simply getting skipped. Remaining data is misaligned and resulting filesize is different.

This behaviour complicates a lot of simple ways of restoring missing chunks even if you can get them from other sources. You can't easily fill just the missing data from P2P networks. You can't use RAR's internal recovery record. You can't unpack existing data from multipart archives even when total number of lost articles is relatively small.

I know, I know. NZB files do not provide length of actual data, only a number of raw bytes. But you can still guess and I believe some clients already do that quite fine.

I've just had to pull 50GB of data once again while I was missing less than 1MB. This is really frustrating. I also see an increasing number of uploaders that intentionally do not provide any PARs and 1 missing article trashes the whole batch.

Anyway this has been brought up a few times on forums with a lof valid points but I don't think it got any traction: /viewtopic.php?t=9851 & /viewtopic.php?t=15373.

The text was updated successfully, but these errors were encountered:

Safihre · 2022-05-03T19:50:22Z

It's easily possible, the information is inside the yEnc header. That's how nzbget does it. I already made a proof of concept a year or so ago.
It's just quite a big investment and there was no real good reason to do it. Par2 repair works just as well in my tests with filled or unfilled data.
I understand your use case, but it's also a bit exotic.

A much better use case would be that this way we can actually implement a proper Retry where only the missing articles are tried again and filled in at the right spot in the file. That's the reason why nzbget implemented it.

thezoggy · 2022-05-03T20:29:44Z

there is tons of obfuscation methods, purposely not including pars / leaving stuff incomplete on purpose (force you to get the actual nzb from the source) to prevent site leeching or avoid dmca and so on. How often are you getting these nzb and do you find it is only from one site?

I'd wonder how much actual benefit there would be when the majority of the time you just have the data to write anyways. So writing placeholder files is time consuming and wear on the hdd.

To also do the zero prefill required additional accounting since you would have to know what hadn't been filled and what has?

Safihre · 2022-05-04T04:36:08Z

@thezoggy you're not actually writing data, you say fp.seek(X). All the space between 0 and X will just not be filled with anything. So no extra writes.
Depends on the OS how it actually fills the space as far as I understand.

reallyuniquename · 2022-05-04T17:52:26Z

True, implementing this would make proper Retry possible too.

the information is inside the yEnc header

I'm not familiar with NNTP. If article is missing where would the header come from?

@thezoggy
It's not obfuscation, uploaders explicitly state they don't use PAR files. Those mostly come from one place but don't focus much on that. It's more about extracting existing data and its structure. I've had enough failed downloads with proper PAR sets that couldn't be salvaged despite having most of the articles.

As for the zero prefill it doesn't write anything if implemented properly, OS handles that. I mean full preallocation is not required to fill missing articles but I felt like it's a very related issue.

Safihre · 2022-05-05T06:46:46Z

If there's no article, indeed we can't parse the header.

Does it really need to preallocate the whole file? Why not add as we go? As long as the data is in the right location inside the file.
For example if the file is 100Kb, but the last article of 10Kb is missing, it's not fine to just leave it at 90Kb?

reallyuniquename · 2022-05-05T14:05:19Z

No, full instant preallocation is not required. I guess I shouldn't have mixed these two features in one issue. But they both require accounting for the size of missing articles.

I'm guessing most upload tools use fixed data length for every post in a single NZB. Can we use that to calculate correct size when article is missing? Would that work?

Safihre · 2022-05-05T14:37:08Z

I'm guessing most upload tools use fixed data length for every post in a single NZB. Can we use that to calculate correct size when article is missing? Would that work?

All articles that are present will tell in their yEnc-header exactly in what spot their data belongs. So if an article is missing, the next article will just say where it's data should be and we can start writing there:

      =ybegin part=41 line=128 size=49152000 name=90E2Sdvsmds0801dvsmds90E.part06.rar 
      =ypart begin=15360001 end=15744000

animetosho · 2022-05-05T22:50:48Z

Does it really need to preallocate the whole file? Why not add as we go?

There are benefits with pre-allocation. Exact behaviour can vary across filesystems, but things include:

space is preallocated, so you won't get 'out of space' errors during download. Also helps other programs be aware of the intended disk usage, so they won't get unexpected 'out of space' errors either
reduces fragmentation
I generally find pre-allocation to be faster than a filesystem constantly trying to increase the size of a file (and calls like this tend to be fast)
PAR2 repair is slightly more effective when data is in the right place. PAR2 can deal with misaligned data, but there's some minor efficiency loss, plus the mechanism it uses to do this is CRC rolling, which being a bit slow, many clients limit its usage
the file can be used across other tools, e.g. imported into a torrent client to fetch missing parts

IMO, pre-allocating files is the most sensible approach for a downloader, if possible.

reallyuniquename · 2022-05-06T17:13:17Z

If preallocation gets a go please make sure to use instant file initialization using SeManageVolumePrivilege on Windows. Otherwise you'd be literally writing zeros.

IMO, this and filling missing articles should be separate options.

Safihre · 2022-05-06T18:01:49Z

It's interesting, but as I wrote before I don't see many convincing points (e.g. that would benefit 50%+ of users) to implement it right now.
Maybe in the future.

animetosho · 2022-05-07T00:40:37Z

If preallocation gets a go please make sure to use instant file initialization

Across other applications, I see it often being an option between no preallocation, fast preallocation (using OS supplied calls, or maybe seeking to the last byte and writing a 0) and "full" preallocation (explicitly zero-fill the file), with the default being on fast preallocation.
Unfortunately, filesystems aren't consistent with how they operate. Particularly if the filesystem doesn't support sparse files, the OS may want to zero-fill the file for security reasons.

Of course, whether it's worth implementing is another judgement altogether. If I were implementing the system from scratch, I'd definitely take the approach, but changes on an existing system is a different cost.

reallyuniquename · 2022-05-08T17:06:31Z

Sparse files are usually what you want to stay away from. At least on conventional spinning rust. Insane fragmentation, extremely slow to read from.

I see it often being an option between no preallocation, fast preallocation (using OS supplied calls, or maybe seeking to the last byte and writing a 0) and "full" preallocation (explicitly zero-fill the file), with the default being on fast preallocation.

I've yet to see an app that transparently explains this to an end user. qBittorrent still takes the heat for stalling while it writes gigabytes of zeros on Windows for hours with preallocation enabled. The reason for that is libtorrent call to SetFileValidData() which allocates space without zeroing it requires additional permissions it never asks user for.

animetosho · 2022-05-09T12:54:43Z

Actually, the way you put it, sparse files are probably what you do want most of the time. Keep in mind that torrent clients tend to download in random order, whilst Usenet would be largely (if not always) sequential, so a sparse file is no worse than having no preallocation.
(as for random ordered downloaders, the actual implementation of sparse files would matter - if the filesystem actually reserves the space, but reads from unwritten sectors return zeroes, you don't get any additional fragmentation compared to zero-filling the file)

Thanks for sharing the info though!

puzzledsab · 2023-02-18T13:42:07Z

This sparse file thing is much harder than I hoped. I modified newswrapper to extract ybegin and yend and tried to use it with fseek in assemble like some examples seemed to imply that I could. Unfortunately it creates broken files when I dump all available parts in every loop. Apparently you can't update data inside a file.

I don't think the files being generated this way are sparse at all or if it would work this way if they were. If they were, what happens to the data that crosses sectors? I think maybe we would have to write data in 4 KB blocks and join sections of parts to make sure they fit properly. I'm using Windows and NTFS.

I have verified that the ybegin value I extract for each part is correct (after adjusting for yenc kinks) by comparing it to fout.tell() when I'm using the current append mode and that the size calculated using ybegin and yend is the same as length(data).

Related to #2459

mnightingale · 2023-02-18T13:59:30Z

Hmm python has strange open modes, I can't tell which would be "open in binary for writing without truncating"
Looking at https://docs.python.org/3.8/library/functions.html#open maybe "r+b".

Did you try that?

mnightingale · 2023-02-18T18:00:44Z

Also regarding the 4 KB writing thing, it might actually be worth investigating turning the buffering=0 option back to the default which will handle writing in optimal sizes, or at least understanding whether it’s ever advantageous to disable buffering.

Safihre · 2023-02-18T18:22:09Z

@mnightingale please take a look at the actual implementation in cpython of the buffering, I did and it's much more basic than you might think. There is barely anything smart about it, and the buffering limit is just a constant. Setting buffering=0 gives us a direct filepointer instead of all the useless overhead of buffering logic that we don't need because we always write blocks much bigger (750kb) than the buffering limit.

animetosho · 2023-02-19T00:04:08Z

@puzzledsab Do you have some code we can look at?

If they were, what happens to the data that crosses sectors?

The filesystem APIs are supposed to hide the notion of sectors, so you shouldn't have to worry about that. Essentially, the data will get placed accordingly and split across sectors if necessary.

puzzledsab · 2023-02-19T13:09:53Z

@mnightingale : I did try it but apparently not hard enough. It seems to work now.

Safihre added the Feature request label May 3, 2022

Safihre changed the title ~~[Feature Request] Fill missing articles with zeroes AKA preallocate files~~ Preallocate files May 24, 2022

mnightingale mentioned this issue Jan 9, 2023

Use CRC32 for file check #2405

Closed

Safihre mentioned this issue Apr 6, 2023

Basic direct write implementation #2526

Closed

mnightingale mentioned this issue Jun 6, 2023

Write articles to correct offsets and use sparse files #2574

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preallocate files #2169

Preallocate files #2169

reallyuniquename commented May 3, 2022

Safihre commented May 3, 2022

thezoggy commented May 3, 2022 •

edited

Loading

Safihre commented May 4, 2022

reallyuniquename commented May 4, 2022

Safihre commented May 5, 2022

reallyuniquename commented May 5, 2022

Safihre commented May 5, 2022 •

edited

Loading

animetosho commented May 5, 2022

reallyuniquename commented May 6, 2022

Safihre commented May 6, 2022 •

edited

Loading

animetosho commented May 7, 2022

reallyuniquename commented May 8, 2022

animetosho commented May 9, 2022 •

edited

Loading

puzzledsab commented Feb 18, 2023

mnightingale commented Feb 18, 2023

mnightingale commented Feb 18, 2023 •

edited

Loading

Safihre commented Feb 18, 2023 •

edited

Loading

animetosho commented Feb 19, 2023

puzzledsab commented Feb 19, 2023

Preallocate files #2169

Preallocate files #2169

Comments

reallyuniquename commented May 3, 2022

Safihre commented May 3, 2022

thezoggy commented May 3, 2022 • edited Loading

Safihre commented May 4, 2022

reallyuniquename commented May 4, 2022

Safihre commented May 5, 2022

reallyuniquename commented May 5, 2022

Safihre commented May 5, 2022 • edited Loading

animetosho commented May 5, 2022

reallyuniquename commented May 6, 2022

Safihre commented May 6, 2022 • edited Loading

animetosho commented May 7, 2022

reallyuniquename commented May 8, 2022

animetosho commented May 9, 2022 • edited Loading

puzzledsab commented Feb 18, 2023

mnightingale commented Feb 18, 2023

mnightingale commented Feb 18, 2023 • edited Loading

Safihre commented Feb 18, 2023 • edited Loading

animetosho commented Feb 19, 2023

puzzledsab commented Feb 19, 2023

thezoggy commented May 3, 2022 •

edited

Loading

Safihre commented May 5, 2022 •

edited

Loading

Safihre commented May 6, 2022 •

edited

Loading

animetosho commented May 9, 2022 •

edited

Loading

mnightingale commented Feb 18, 2023 •

edited

Loading

Safihre commented Feb 18, 2023 •

edited

Loading