Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qbt with lt 1.1 has worse torrent rechecking/creating speed on Windows #9061

Closed
airium opened this issue Jun 6, 2018 · 34 comments
Closed

qbt with lt 1.1 has worse torrent rechecking/creating speed on Windows #9061

airium opened this issue Jun 6, 2018 · 34 comments

Comments

@airium
Copy link
Contributor

airium commented Jun 6, 2018

This issue extends #8181 (the only I found relevant). It identifies a probably libtorrent-side performance degradation wrt torrent rechecking on Windows. Maybe this one should be forwarded to libtorrent, but I think it is better be forwarded by qbt main devs or I will do it later.

qBittorrent version and Operating System

Windows 10 1709/1803, but I think it is general to at least Windows 7+
qBittorrent 3.3.16, 4.0.3, 4.0.4, 4.1.0, 4.1.1, but I think it's general to qbt 3.3+ and 4+
libtorrent: 1.0.11 (at 62c9679) and 1.1.7 but I think it's general to 1.0.6+ and 1.1+

What is the problem

I made a benchmark on torrent rechecking/creating wrt qbt 4.1.1 + lt 1.0.11/1.1.7, additionally with uT2.2.1 for comparison. From the result below, qBittorrent 4.1.1 will have a worse torrent rechecking/creating performance using libtorrent 1.1.7. This slower speed is actually general to any qBittorrent 3.3+ and 4+ built against libtorrent 1.1.x. If built against libtorrent 1.0.x, e.g. 1.0.11, qBittorrent will have a much better speed on that. The potential disk read and hash capability of libtorrent 1.0.11 should be much higher than uTorrent, as seen by 1000MB/s creating speed on NVMe SSD, but there is a degradation on rechecking, and a further drive-dependent degradation as seen on the SATA3 SSD, not seen by uTorrent.

torrent creating
speed / active time
NVMe SSD
SM951
SATA SSD
X400
3.5" HDD
ST8000DM
2.5" HDD
ST2000LM
4.1.1+1.1.7 260-280MB/s
100%
130-150MB/s
~60%
80-85MB/s*
~80%
130-135MB/s
~60%
4.1.1+1.0.11 ~1.0GB/s
100%
460-480MB/s
~85%
195-205MB/s
~97%
130-135%
~96%
2.2.1 450-470MB/s
100%
450-470MB/s
~90%
195-200MB/s
90-95%
130-135MB/s
~95%
torrent rechecking
speed / active time
NVMe SSD
SM951
SATA SSD
X400
3.5" HDD
ST8000DM
2.5" HDD
ST2000LM
4.1.1 + 1.1.7 260-280MB/s
100%
130-150MB/s
~60%
80-85MB/s*
~80%
130-135MB/s
~60%
4.1.1 + 1.0.11 550-620MB/s
~30%
230-280MB/s
~65%
195-205MB/s
~76%
130-135MB/s
~85%
2.2.1 465-475MB/s
100%
440-460MB/s
~90%
195-200MB/s
90-95%
130-135MB/s
~95%

*The performance here is strange, but I tested it multiple times and the value is true. I have no idea why qbt4.1.1+lt1.1.7 should be slower on ST8000 than ST2000.

4.1.1+1.1.7 = the qbt official x64 build of qbt 4.1.1, lt 1.1.7, boost 1.67.0, Qt 5.10.1
4.1.1+1.0.11 = my x64 build of qbt 4.1.1, lt 1.0.11 (at 62c9679), boost 1.65.1, Qt 5.10.1
2.2.1 = uTorrent 2.2.1 build 25302
all using default configuration

Sequential read capability (the outmost cylinder if HDD):
SM951 = Samsung SM951 512G MLC M.2 NVMe SSD, ~1500MB/s, internal PCIe 3.0 4X
X400 = SanDisk X400 1TB TLC M.2 SATA3 SSD, ~500MB/s, internal SATA3
ST8000 = Seagate ST8000DM004, 3.5" 8T 5425R 256M, ~200MB/s, mounted via USB3.0
ST2000 = Seagate ST2000LM003, 2.5" 2T 5400R 32M, ~135MB/s, internal SATA3
CPU is 6700K, RAM 48GB

Test is based on 30GB large files (typ. 2GB each file), in 4MiB piece size
Speed and active time is the value shown in Task Manager
Both HDDs are reformated before test

What is the expected behaviour

Libtorrent 1.1.x should at least have a same disk perf on creating/rechecking torrent as 1.0.x.
Maybe qBittorrent should re-consider the libtorrent version to be used before libtorrent 1.1 becomes better.

Steps to reproduce

You need at least an SSD (assume you are not using some "flash disk" level SSD of 300MB/s or lower). For comparison, lt 1.1x group can include any official qbt 4.x builds (which are all built again lt 1.1.x), and lt 1.0.x group can include the official qbt 3.3.16 build (which is on lt 1.0.11) and my builds of qbt 4.1.1 against lt 1.0.11. You can also build other qbt+lt combination. My note here is that lt 1.0.11 at 4e90eb1 (the one in the releases page) or lower seems not to compile with boost 1.65.0 or above, but 1.64.0 is fine; lt 1.0.11 at 62c9679 does not compile with boost 1.66.0 or above, but 1.65.1 is fine.

It should also be noted that due to windows system read cache, before creating/rechecking you should do enough other read/write jobs to wipe the torrent content off the memory, otherwise operation will directly hit cache from memory at a high speed.

Extra info(if any)

I can upload all screenshots later (maybe in one or two days) if necessary.

As for Linux, I did not perform any similar test as I always compile libtorrent 1.0.11 on it, but if necessary maybe I can test it on my seedboxes, but this might need 1 week or more.

Previously I noticed the discussion about Windows IO and cache issue, so I also conducted some investigation on related setting about libtorrent session (typically becomes settings_pack in 1.1 branch), I altered settings including enable_os_cache, low_priority_disk, coalesce_reads (by qBittorrent GUI or modifying qbt source code) but result does not show any change. This needs further investigation.

I have been trying to fix this issue, but I don't think I can do it faster or better than current main qbt and lt devs. If main devs are willing to fix this issue, I can help run benchmarks.

@thalieht
Copy link
Contributor

thalieht commented Jun 7, 2018

Thanks for the benchmarks! Although i have nothing to comment i'd like to say that maybe you want to benchmark libtorrent master which (i think) can be used with this qBt repo: https://github.com/zeule/qbit
But maybe you should first ask arvidn if there are any improvements in that regard in libtorrent master.

@Seeker2
Copy link

Seeker2 commented Jun 7, 2018

There may have been a change to make force rechecking less "hard" on drives and CPUs.

@Chocobo1
Copy link
Member

Chocobo1 commented Jun 7, 2018

@arvidn
Ping, you might want to see this.

@arvidn
Copy link
Contributor

arvidn commented Jun 7, 2018

@airium what is the percent figure in the table?

My first guess at the factor dominating this throughput is the settings_pack::checking_mem_usage configuration option. This essentially caps the buffer size of pieces in-flight, being checked. It's specified in blocks of 16kiB, and defaults to 256 blocks (i.e. 4 MiB). If the bandwidth delay product exceeds this, you will experience degraded performance. In libtorrent 1.1.x, this delay includes the communication between the main thread and the disk thread(s), which likely explains the degraded performance; the bandwidth delay product is larger.

Perhaps the default should be 8 MiB. @airium would you mind doubling or quadrupling this setting and see how it affects the checking throughput?

@airium
Copy link
Contributor Author

airium commented Jun 7, 2018

@arvidn

what is the percent figure in the table?

It is simply the disk utilisation percentage shown in Task Manager. I just logged it but this value might not really indicate something. Empirically, a 100% does not necessarily say the disk throughput has been exhausted.
tim 20180607100954

would you mind doubling or quadrupling this setting and see how it affects the checking throughput?

Of course, I will upload the result maybe later today.

@airium
Copy link
Contributor Author

airium commented Jun 9, 2018

@arvidn Sorry for my delay. Here is the result I put on Google Sheet.
https://docs.google.com/spreadsheets/d/1q5eL3j4HVGU_wZel23M7bvcRomJRdAnNylo0yqYHKCQ/edit?usp=sharing
Data in the table are all re-run. I also add some enties of official deluge 1.3.15 build for comparison and it is on top of libtorrent 1.0.11. It seems for qbt 4.1.1+1.1.7 the piece size doesn't really matter though a slight difference is witnessed. But for libtorrent 1.0.11, both qbt and deluge see some improvement on smaller piece size.

@airium
Copy link
Contributor Author

airium commented Jun 9, 2018

Also sorry I don't have time to fill every entry in the table but if you want some specific entries just tell me.

@arvidn
Copy link
Contributor

arvidn commented Jun 9, 2018

I'm primarily interested in whether settings_pack::checking_mem_usage improves checking throughput or not. I don't see this being reflected in the table.

@airium
Copy link
Contributor Author

airium commented Jun 9, 2018

@arvidn Oh sorry I misunderstood your reply. Now I will change settings_pack::checking_mem_usage.
Stand by and I will renew the result in one hour.

@airium
Copy link
Contributor Author

airium commented Jun 9, 2018

@arvidn So this time I alter settings_pack::checking_mem_usage to
SET(checking_mem_usage, 1024, 0) i.e. 4x of the default value
SET(checking_mem_usage, 32, 0) i.e. 1/8 of the default
SET(checking_mem_usage, 256000, 0) i.e. 1000x of the default, as an aggressive one

However, the 3 builds with 32, 1024 and 256000 show no observable difference on torrent rechecking/creaing speed compared with the default build with 256.

Here are my builds: https://drive.google.com/open?id=16cerOCIJOAW2ksQhu8PgslDJxWLZvUx_

@khnielsen
Copy link

I am actually seeing the same thing here, it is not only limited to checking speed but also actual torrent speeds are also much faster and more stable when libtorrent 1.0.11 is used rather than 1.1.X.

Arvidn I have been emailing you about this, as I can reproduce this not only on Windows but also on Linux (Using deluge), 1.0.11 will outperform the newer versions in all of my testings.

I have tried nearly everything I can think of, nearly all options using "ltconfig" on my linux installs and I have simply not been able to get 1.1.X to go anywhere near the speeds of 1.0.11, thats in terms of network throughput but also in terms of creating and checking torrents.

@arvidn
Copy link
Contributor

arvidn commented Jun 13, 2018

@khnielsen would you be able to enable session stats logging (in the alert mask) and post the logs here?

@khnielsen
Copy link

I would if you could provide some documentation on how to do that, I am a bit lost on how to proceed, my findings on linux is not supported by anything but testing and messing around with it, but I have very little knowledge of how to get the proper logs for you.

@Chocobo1
Copy link
Member

Chocobo1 commented Jun 16, 2018

@airium
I'm not sure if this is relevant but have you tried bumping settings_pack::aio_threads?
arvidn/libtorrent#3005

@arvidn
Copy link
Contributor

arvidn commented Jun 16, 2018

I run some tests to see what I can find on MacOS with a spinning disk. creating and checking torrents are slightly different. For checking, there currently is logic to concentrate the checking jobs to a single thread (as long as there are 4 or more disk threads).

There's an effect where spreading out the checking over more threads slows it down, presumably because of disk I/O requests not being perfectly sequential. It's possible that having aio_threads < 4 primarily makes it worse by spreading out the hashing across 3 threads.

However, when creating a torrent, the hashing happens in a single thread.

In my testing, the CPU used for hashing in insignificant compared to the time of the disk read operations. This is on a spinning disk on macOS. People that experience this problem may have a system with different characteristics, like an SSD for instance.

@arvidn
Copy link
Contributor

arvidn commented Jun 16, 2018

looking closer at the code, I think the main reason for the poor performance is probably that libtorrent reads 16kiB at a time from the disk. I think the original motivation for this is to keep the code simple and to not use excessive amounts of memory when the piece size is large.

I will try to improve this.

@arvidn
Copy link
Contributor

arvidn commented Jun 16, 2018

anyone wants to give this patch a try? arvidn/libtorrent#3112

@arvidn
Copy link
Contributor

arvidn commented Jun 17, 2018

iirc, especially windows is very sensitive to small read calls. This reads entire pieces at a time, rather than 16kiB, but it still falls back to the generic (16 kiB at a time) logic in case it fails to allocate enough space in the cache

@airium
Copy link
Contributor Author

airium commented Jun 18, 2018

Sorry for my delay in response.

I'm not sure if this is relevant but have you tried bumping settings_pack::aio_threads?

@Chocobo1 Thank you for your advice, but actually this setting does not matter. You could find my test packages in the link below, in which the ones with "aio_thread_1" and "aio_thread_64" are.

anyone wants to give this patch a try?

@arvidn I tried this one and it works! I noticed the creating/rechecking speed on SSD is almost 2x of that of current RC_1_1, close to the same speed of R_1_0. As for the previously noticed strange observation, i.e. the abnormally slow speed on my 3.5" HDD, now I identified it is a my-side fault, so nothing needed for it.
Here is the test package, the one with "fix_hash_perf_1.1": test builds

@arvidn
Copy link
Contributor

arvidn commented Jun 18, 2018

thanks for testing and making these builds! This will be part of the 1.1.8 release

@airium
Copy link
Contributor Author

airium commented Jun 18, 2018

@arvidn btw, before I close this issue, do you have any immediate idea to further improve this performance? As we can see uTorrent still outperforms libtorrent in some cases (e.g. 500MB/s vs 200-300MB/s on SATA3 SSD), I am still of interest in beating uT and plan to look into how libtorrent implements this part, maybe you could offer a start point.

@arvidn
Copy link
Contributor

arvidn commented Jun 18, 2018

the only other thing I can think of would be to bump the settings_pack::checking_mem_usage

@khnielsen
Copy link

What kind of items changed in the past arvid? I can concur that the patch improves performance but from my testings it is still a ~250MB\s 1.1.7 with patch vs 420MB\s 1.0.11 which is a massive difference.

I would assume that newer versions include better performance, and not something that goes the other way - From my testings then 1.0.11 is generally outperforming the 1.1.x branch in basically everything, from connecting to peers, network throughput on 10G interfaces to hashing speed.

@arvidn
Copy link
Contributor

arvidn commented Jun 18, 2018

@khnielsen a lot of things changed, perhaps the most important one was to support more than one disk I/O thread.

would you like to build and run performance regression tests, to avoid changes that negatively impact performance in the future?
You can try the master branch today, and there's even a new-disk-io branch which overhauls the disk I/O subsystem in an attempt to make it simpler and faster.

@airium
Copy link
Contributor Author

airium commented Jun 18, 2018

@arvidn I have some new findings by chance. Now on the top of branch hash-job-performance-1.1, a large increase on aio_threads and checking_memory_usage both at once can effectively boost disk throughput in torrent rechecking. As an example, now when I increase aio_threads to 64 (=16x default) and checking_memory_usage to 25600 (=100x default), the speed on SM951 reaches 1.5+GB/s and X400 550MB/s, both at their theoretical max bandwidth, fully utilised.

Screenshots first:
nvme
*columns are on different torrents as 1.7GB/s is very quick to check any torrent, noticeably the CPU usage is correspondingly high

sata3
*stable ~550MB/s on SATA3 SSD

However, a single increase on one of the two has no effect, i.e. only increase aio_threads or only increase checking_memory_usage. Apart from this, an identical change (4->64 and 256->25600) but without the patch shows an weaker improvement (~250MB/s -> 300~400MB/s). My further test shows that samller increase does not show a similar effect, e.g. only doubling aio_threads and checking_memory_usage to 8 and 512 respectively gives nearly no improvement. At last, the change has no effect on creating (maybe because creating torrent only uses 1 thread as said)

Windows binaries for test: the same google drive link

I also noticed that in the past hours a little more io related patch is introduced, but I have no time to test either it or other changes in next days. Sorry in advance.

And just a note here: I also find some other problems during the whole tests. One is that libtorrent 1.1 executes all queued torrent rechecking jobs simultaneously, not likely on 1.0. This might not be expected especially on HDD as HDDs are slower in the case. Another one is that, when creating torrent, clicking on "Cancel" actually does not stop the job immediately. Hard disk is still being read until the job finishes (but gives no torrent). These 2 problems should be discussed as new issues.

@arvidn
Copy link
Contributor

arvidn commented Jun 19, 2018

One is that libtorrent 1.1 executes all queued torrent rechecking jobs simultaneously, not likely on 1.0

It's not supposed to do this by default. It suggests the active_checking setting is greater than 1. Another possible explanation is that force started torrents also apply to checking. They will be force checked so to speak.

@arvidn
Copy link
Contributor

arvidn commented Jun 19, 2018

@ssiloti I can't recall the motivation for concentrating hash jobs into specific threads. I imagine it would have benefits of making disk I/O more sequential when checking a whole torrent (by serialising them all in a single thread). But I can also imagine a benefit being that it prevents hash jobs from starving out other disk jobs, just because there are so many of them.

I get the feeling though that in this case, with an SSD, the bottleneck is not the disk I/O, but the CPU of SHA-1. Does that sound reasonable? I can't think of a good architecture that would satisfy both the I/O bound and CPU bound case.

As for the checking_memory_usage setting, I believe this is a direct consequence of moving out the loop over the pieces into the network thread (out of the disk thread as it was in 1.0). Perhaps there should be some more sophisticated logic to determine that number of outstanding hash jobs, based on the bandwidth delay product. Or maybe not, maybe the limit should just be raised 100x. As long hash jobs are posted to dedicated threads and a dedicated job queue, it's not like there any significant cost of having a long backlog of hash jobs in there. They don't allocate any memory up-front.

@Chocobo1
Copy link
Member

It's not supposed to do this by default. It suggests the active_checking setting is greater than 1.

Just want to add that qbt doesn't touch that parameter, it should be at the default 1.

@ssiloti
Copy link

ssiloti commented Jun 20, 2018

I think the splitting of hash jobs into their own threads was before my time, but preventing them from starving other jobs seems like the most plausible justification.

A 6700K CPU is capable of computing SHA1 at well over 2GB/s with four hash threads, so it shouldn't be a bottleneck even with the NVMe drive as long as aio_threads is set to at least 16.

I suspect the biggest drag on performance here is insufficient pipelining of hash jobs combined with insufficient parallelism. SSDs, especially high speed NVMe drives, really need to have multiple requests outstanding to hit their peak performance, so it makes sense that @airium needed both an increased number of active jobs and multiple hash threads to get the most out of those drives.

I think increasing aio_threads is probably a good idea, at least on master where we have a scalable thread pool. We should also probably increase the minimum number of active hash jobs when checking to 4 or 8 or maybe even 16. I think increasing checking_memory_usage 100x is probably overkill, our latency should not be high enough to need that many active jobs. In fact I wonder if we even need the checking_memory_usage setting. Calculating the number of active checking jobs by applying a fixed multiplier to aio_threads should Just Work for any reasonable situation I can think of.

@ssiloti
Copy link

ssiloti commented Jun 20, 2018

Actually nevermind about increasing the minimum active jobs and removing checking_memory_usage, I think it does make sense the way it is. I still think the number of active jobs should be multiplied by the number of hash threads though.

@arvidn
Copy link
Contributor

arvidn commented Jun 24, 2018

could someone make a new test build with latest RC_1_1 of libtorrent? I believe the situation have been improved now.

  1. hash disk jobs now read whole pieces at a time from disk
  2. read and write jobs are coalesced by default on windows (as the preadv/writev emulation seems to be slow)
  3. the default checking_memory usage has been increased
  4. the minimum number of outstanding disk jobs now scales with the number of hash threads

@airium
Copy link
Contributor Author

airium commented Jun 29, 2018

Sorry I was busy in my life. Now here is my compilation of qb 4.1.1 + lt 1.1.8 both at release: Google Drive
As far as I noticed, this one makes no difference with the previous one on hash-job-performance-1.1. In other words, the performance on disk reading / torrent rechecking / creating is similar to that of 1.0.11 (maybe still 15% lower than the latter, but much better than 1.1.7-)

This time I remind of using the statistic panel of qbittorrent. I found that on 1.1.8, during rechecking it shows queued I/O jobs at 3 and total_buffer_size at 4MB, then gives 200-300MB/s on SATA3 SSD. While older versions with default 1.1.7 settings show queued I/O jobs at 1 and total_buffer_size at 4MB, with 70-140MB/s. And the one I compilated with aio_threads = 64 or checking_memory_usage = 25600 makes it to 84 and sereral hundred MB, giving 500+MB/s (and amazingly 2.2GB/s on NVMe SSD if also give a cache size larger than 64MB). Maybe this gives some hints for further improvement (e.g. there is just too less IO jobs). So I am curious about libtorrent's internal mechasim that leads to different queued I/O jobs, but I cannot spare time looking into it.

And this time I still encounter the concurrent rechecking problem. To be specific, when queued multiple torrents to recheck, normally it should run one by one while the others are stalled at the queued for checking status. However, currently with libtorrent 1.1.x no one enters this queued state. I noticed #9120 discussed the rechecking behaviour, maybe a better policy may be only allowing concurrent rechecking on different hard disk while for torrents on the same disk they can only be run one by one. I don't know if this is feasible by libtorrent. A similar symptom should be noted here, that moving torrent are also executed concurrently. That is, if one moves serveral torrents, now all these torrents are moved in the same time instead of one by one, which is especially not expected when moving to HDDs. I have no idea if this problem comes from libtorrent or qbittorrent.

At last, some of my friends reported crash issue with 1.1.8 against either deluge and qbittorrent. Maybe I also happened to encounter once yesterday on one my seedbox, but we didn't collect any trace info.

@seiferflo
Copy link

I just want to inform I tested the last build of airium and it's day and night. Before, I was having many error / checking issues constantly, mainly with big files >30GB to a NAS. Now I've got almost no issue.
Download speed in much better as well, almost maxed out, though I need to keep playing with cache / connections per torrent as I get speed drop after a couple of hours or with more than 2 files.

Also when I exit qbittorent, it really stops now. Before it used to keep running for 5min in the background... Launching qbittorent again led to "checking" on all file.

So thanks for fixing all this, I was about to drop qbittorrent. Nice one !

@airium
Copy link
Contributor Author

airium commented Aug 2, 2018

Considering so far the disk performance has been much better, I am then closing this issue.
Thank you devs for improving things.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants