Getting around the 16 KB SSL speed limit #2352

puzzledsab · 2022-11-30T22:55:43Z

puzzledsab
Nov 30, 2022

@Safihre : how optimistic are you about the SSL patch being accepted? Eventually Sab won't be able to keep up with increasing speeds because single core speed seems to be hitting a roof. On the other hand they are getting more and more cores so multiprocessing seems like the answer. I've been looking into it a bit and trying to figure out how the code could be rearranged to support it.

My first idea was to have a main process which manages the queue and finds articles ready for downloading. Then there would be x number of processes which the main process sends requests to and receives responses from. This way it could work almost like now, with the main thread handling the decoding and writing.

Unfortunately it turns out that transferring large amounts of data between Python processes is quite slow when using pipes or queues in Windows. I measured it to about 60 MB/s on an Intel i5, with two cores using 100% CPU just for reading from and writing to the queues. The latency is low, though. Sending messages of 1MB back and forth 1 000 times takes almost exactly as much time as sending 100KB messages 10 000 times. In other words, it's probably possible but the downloaded data can't be sent back. The download processes would have to decode and write the data as well, unless there's a way to speed this up that I don't know about,

I wanted to make an experimental branch just to see if it would actually help but because of this set back I probably wait and see if adding the SSL patch continues to drag on. Anyway, I just wanted to put the idea out there and get some feedback. What do you think? Any other ideas?

Here's the program I used for testing queue speed: https://gist.github.com/puzzledsab/ad11755796694551e1e0ed46f6bac0d4

Safihre · 2022-12-01T12:44:49Z

Safihre
Dec 1, 2022
Maintainer

I am very pessimistic about the SSL patch being accepted 😢

I have looked at using multiple processes but was always held back by the amount of synchronization that I think is required for it to work. Additionally, it spins up 2 Python processes with each the base-memory usage that is caused by the imports that we have. Since that's already something like 35MB, we would require 70MB memory just to even boot up.

Personally, I think the most value we can get is from moving things to C extension(s) where we can lift the GIL.
However, I have failed to identify really good chunks of code that could be transferred to C and make an impact.
The biggest thing I can see is the Assembler.assemble call, however, it does a lot of Pyton-y stuff in that loop so it's not straightforward to port.

EDIT: The biggest conceptual issue there is that we need to write the data in sequence so we can calculate the MD5. Othwerise we could write the data directly to disk from sabyenc3 for example 😞

There are also things like: https://github.com/mypyc/mypyc
But also there I have failed to identify which chunks of code would really maken an impact.

0 replies

sanderjo · 2022-12-01T13:11:30Z

sanderjo
Dec 1, 2022

Personally, I think the most value we can get is from moving things to C extension(s) where we can lift the GIL.

... or rust (safer than C), combined with pyo3 https://github.com/PyO3/pyo3

There are also things like: https://github.com/mypyc/mypyc

Oh, wow that is cool:

Pure python:

sander@brixit:~/mypyc_spelen$ time python3 fib.py
1.1358458995819092

real    0m1,181s
user    0m1,173s
sys     0m0,008s

versus compiled module

sander@brixit:~/mypyc_spelen$ time python3 -c "import fib"
0.06586027145385742

real    0m0,110s
user    0m0,097s
sys     0m0,013s

... pure magic!

0 replies

sanderjo · 2022-12-01T13:36:31Z

sanderjo
Dec 1, 2022

Let me check:

Completely outside python, with a C / Rust program, or mypyc:
get_article_from_newsserver_to_disk(newsserver, username, password, articleid, filename_to_be_written)
... but that would need a fresh login for each article ... probably not good.

Let SAB/python setup socket to the newsserver, and use that from C / Rust / mypyc module:
get_article_from_socket_to_disk(newsserver_socketid, articleid, filename_to_be_written)

Is that what we're talking about? So writing to a file on disk? Or should the article get into memory?

3 replies

Safihre Dec 1, 2022
Maintainer

Yes/no, we want Python to deal with all the nasty platform specific network and disk stuff.
There's so many things we would like to keep in Python, so it's a hard balance.

sanderjo Dec 1, 2022

if you implement get_article_from_socket_to_disk(newsserver_socketid, articleid, filename_to_be_written) in python and let mypyc create a compiled module ... is that good? Or do you still have th 16kB and/or GIL problem?

puzzledsab Dec 2, 2022
Author

There are some issues with such a method.

I assume it would block while reading and writing, so we'd need one thread for each connection.
It would write the data to disk, so it would have to be read back by the Assembler. This creates much more disk access. Currently they will usually be cached in RAM.
I'm not sure if an SSL socket can be passed like that. I think the method would need the full SSL context. Maybe it can be passed too, though. I have no idea.

puzzledsab · 2022-12-01T21:54:21Z

puzzledsab
Dec 1, 2022
Author

Multiple processes would use more RAM, but it wouldn't double it with two processes. Fork uses copy on write, so only data that gets changed after the fork will use extra RAM. I think we'd have to use the spawn method, though, because Windows can't use fork and using different methods on different OSes would be harder to maintain. On the other hand, spawn only loads stuff that is necessary for the child process. Yes, there would be a lot of messaging back and forth. That's not necessarily a problem, but some calls that are blocking now might need be done none blocking. That could be hard.

The thing is, almost everything that can be done using C already is. Here are the top methods from a Yappi profile of Sab after downloading for a short while:

name                                              ncall       tsub      ttot
sabnzbd\downloader.py:549 run                     1           1.093750  12.359375
sabnzbd\newswrapper.py:181 NewsWrapper.recv_chunk 65676       1.687500  8.937500
Python.3.10_3.10\lib\ssl.py:1252 SSLSocket.recv   65676       0.843750  6.703125
Python.3.10_3.10\lib\ssl.py:1121 SSLSocket.read   65676       0.750000  5.656250
sabnzbd\decoder.py:123 run                        2           0.500000  4.046875
sabnzbd\assembler.py:65 run                       1           0.015625  3.187500
sabnzbd\assembler.py:167 assemble                 39          0.093750  2.984375
sabnzbd\decoder.py:240 decode_yenc                1283        0.015625  2.781250
sabnzbd\decorators.py:35 call_func                6488/6424   0.156250  1.078125
sabnzbd\nzbqueue.py:714 NzbQueue.get_articles     3251        0.250000  0.750000

ttot = CPU time used by that method plus sub methods (including C modules, I believe)
tsub = CPU time used by that method only

If my interpretation is correct, the downloader thread uses the most time by far, and it's mostly because of recv_chunk and reading sockets. Assembler.assembler uses almost no python time at all. It's all in the reading/writing to disk and calculating MD5. Both will already release the GIL while running.

The SSL patch would obviously be the best and easiest solution. It might also be possible to fork the SSL code and make it a standalone module installable through pip. Maybe you could ask if someone would be willing to help on r/usenet or something. It would have to be maintained, but it might not be too hard if fixes could be merged from the Python code.

Other than that, I don't see any other solution than multiprocessing. It's the downloading itself that uses most of the CPU time so optimizing other parts will have limited effect.

6 replies

puzzledsab Dec 2, 2022
Author

I don't understand what you mean by monkey patch in this context? How would it work, specifically?

Safihre Dec 3, 2022
Maintainer

I will attempt to make a copy of the Python function _ssl__SSLSocket_read_impl and put it in a C extension. Maybe it works!

Monkey patching the way normally is possible in Python won't work here.

animetosho Dec 20, 2022

PEP 554 – Multiple Interpreters in the Stdlib might provide an alternative to multiple processes, with lower overhead, if it ever lands.

animetosho Dec 20, 2022

Actually reading it further, it sounds like they initially intend to share the GIL across subinterpreters. Might get improved later on, but it won't be useful at first. Ah well :(

animetosho Dec 20, 2022

PEP 684 – A Per-Interpreter GIL covers the next step.

puzzledsab · 2022-12-02T23:31:43Z

puzzledsab
Dec 2, 2022
Author

I've made a version which makes it easier to take advantage of mypyc to read sockets. It doesn't replace the illusive SSL patch but it may be a noticeable improvement. See the pull requests.

0 replies

Safihre · 2022-12-08T16:03:41Z

Safihre
Dec 8, 2022
Maintainer

I can't believe it, but the trick I imagned to hack this actually seems to work.

I made a sub-struct of the internal CPython one:

typedef struct {
    PyObject_HEAD
    PyObject *Socket; /* weakref to socket on which we're layered */
    SSL *ssl;
} PySSLSocket;

And then made a C extension function with the core-code from _ssl__SSLSocket_read_impl:

PyObject* test(PyObject* self, PyObject* Py_ssl_socket) {
	PySSLSocket *test = (PySSLSocket *)Py_ssl_socket;

	PyObject *dest = NULL;
	char *mem;
	size_t len = 1000;
	size_t count = 0;
	int retval;

	dest = PyBytes_FromStringAndSize(NULL, 1200);
	mem = PyBytes_AS_STRING(dest);

	retval = SSL_read_ex(test->ssl, mem, 1000, &count);
	printf("%s", mem);

	return dest;
}

And then trial code:

hostname = 'eunews.frugalusenet.com'
context = ssl.create_default_context()

with socket.create_connection((hostname, 563)) as sock:
    with context.wrap_socket(sock, server_hostname=hostname) as ssock:
        ssock.setblocking(False)
        time.sleep(1)
        sabyenc3.test(ssock._sslobj)

Prints:

200 Welcome to Usenet

Even though my code is compiled against a slightly different OpenSSL version than the Python version is using. Seems that doesn't matter as long as the structs stay the same.

Let's see if I can make this into something workable.

0 replies

puzzledsab · 2022-12-08T18:39:36Z

puzzledsab
Dec 8, 2022
Author

SABnzbd 4.0, here we come? :)

0 replies

puzzledsab · 2022-12-19T16:13:40Z

puzzledsab
Dec 19, 2022
Author

Any progress on this?

14 replies

Safihre Dec 21, 2022
Maintainer

Windows first and if possible Linux, but I guess Linux would also include macOS 😅

animetosho Dec 22, 2022

@mnightingale that's some interesting info!
Unfortunately, it seems like ssl._ssl.__file__ points to a stub module, which links to the real OpenSSL library, and it doesn't work correctly on Windows.
On my Linux setup, SSL_read_ex doesn't seem to be exported by the stub module (though SSL_read is), but ctypes finds it - not sure how CDLL works to achieve that.

I'm guessing CDLL actually loads the shared library, whereas getting the loaded module should theoretically be safer, as you're only pulling from what's already loaded (without potentially newly linking a wrong library).

@Safihre there's usually going to be some caveats when trying to work around what Python gives you.
For one, there's no guarantee it'll work (so a fallback would be necessary), and theoretically someone could have an odd setup where the OpenSSL library is named differently. It also won't work at all if OpenSSL is statically linked, if Python allows that (or eventually does).

dlopen on Linux requires linking to libdl during compile, so we could be going around in circles, though I think libdl is a bit special in that it isn't runtime linked. Not sure.

At the end of the day, it's a bit of a gamble, though you might be able to put enough safety checks in front such that it works most of the time, and is unlikely going to do something bad.

Safihre Dec 22, 2022
Maintainer

I would assume the usual Python implementation would be the fallback.
We could really start with strict confinements when we do use this special function, when we know for sure it works.

animetosho Dec 24, 2022

Here's a quick example of what to do, using your sample code above.

One thing to note is that ssl must be imported before sabyenc3 for this example to work. If that's not feasible, you could probably export the loading code to a separate function, that's called after ssl is imported (though I'm not entirely sure if Python doesn't like setting global variables like this).

Safihre Dec 24, 2022
Maintainer

@animetosho You are truly a miracle worker with these things! It works perfectly on Windows (Python 3.11) and Ubuntu (Python 3.8).
Amazing.

Safihre · 2022-12-25T09:01:03Z

Safihre
Dec 25, 2022
Maintainer

So, I was thinking of making a new C extension called sabutils. We could add other C hacks in there too.
Or should we just attach all to sabyenc3?
It's easier, but also bit strange since this function works on sockets and not yEnc at all.

6 replies

sanderjo Dec 26, 2022

As long as it's only for Windows, and thus part of the pre-built SAB, it's not too public and IMHO the name doesn't matter too much. And changing the name has impact on Linux and other non-Windows, so not handy if that offers no extra functionality

thezoggy Dec 26, 2022

back in the day there was reason to segment sabyenc from sab.. but now its pretty much just integrated and required.. and due to some of the version requirements actually causes a bit of confusion for some. perhaps just pull it in house rather than have it separate? then including the ssl fixup also can just be done in house. but if you really wanted to just group them together.. then maybe rename sabyenc to sabutils and toss it all in that?

Safihre Dec 26, 2022
Maintainer

@jcfp does it matter for you? Either rename sabyenc3 to sabutils or create a whole new seperate package?

jcfp Dec 27, 2022

@jcfp does it matter for you? Either rename sabyenc3 to sabutils or create a whole new seperate package?

Both options are fine.

Safihre Dec 27, 2022
Maintainer

Allright, then I will start working on it next week probably to make a POC inside SABnzbd.

Safihre · 2022-12-28T13:51:06Z

Safihre
Dec 28, 2022
Maintainer

I implemented a proof of concept that works on Windows and Ubuntu.
Patched sabyenc: https://github.com/sabnzbd/sabyenc/tree/openssl_link
Patched SABnzbd: https://github.com/sabnzbd/sabnzbd/tree/feature/unlocked_ssl_recv

23 replies

puzzledsab Dec 29, 2022
Author

I've thought a bit about that too. Theoretically we could have a thread for every connection and download the entire article every time. My main concerns about that are that it would be hard to implement a bandwidth limit and I don't know how well Python handles 100+ threads. It might also end up requiring even more error handling in the C module, which we're not good at.

If we don't download the entire article then we'd still have to check if parts smaller than 5 bytes contain a split \r\n.\r\n in Python. The possibility for getting much bigger chunks can reduce the number of checks for \r\n.\r\n so it would be less of a problem to do it in Python.

animetosho Dec 30, 2022

Oh, if it's all being done on a single thread, it should probably be kept that way (having many threads increases context switching overhead). Using >1 thread can have a benefit if TLS decrypting takes a fair bit of CPU though.

I'm missing something then - if it's all being done on one thread, what does releasing the GIL here really do? Allow non-download threads to run?

puzzledsab Dec 30, 2022
Author

Allow non-download threads to run?

Yes, and doing so while the C code is actively reading data. None of the pure Python code can be run concurrently unless we're using the multiprocessing module.

Safihre Dec 30, 2022
Maintainer

When releasing the GIL in the Downloader the decoder can for example push another article to sabyenc to decode (again releases the GIL), so hopefully we achieve more than 1 thread being actually active.

animetosho Dec 30, 2022

I see - so this change just makes the receive part a little more async.

In this case, it'd be interesting to see if sleeping brings about any/much benefit. It sounds like you'd be relying on the OS to buffer some data, which I personally don't like relying on (little guarantee that they buffer much at all), but I suppose could be experimented with.

Safihre · 2022-12-29T06:27:03Z

Safihre
Dec 29, 2022
Maintainer

@puzzledsab we should indeed see if we need to tweak the way the loop works. Now it works "the same" as non-Ssl connections, so our tweaks would work for both types.

However, I did already see it "work" by printing chunk_len and saw it often larger than 16K. What I think happens is that more time is spent in the recv since openssl is doing more decoding, so by the time it gets to the next thread already more data is available. And this is on my i7 downloading over wifi at 10MB/s, so not CPU limited at all.

1 reply

puzzledsab Dec 29, 2022
Author

I've noticed that too but I think it's mainly because of the sleep. If reading a socket took so long that the data piles up to more than 16KB in the other sockets then the old code wouldn't be able to keep up.

sanderjo · 2022-12-29T08:01:47Z

sanderjo
Dec 29, 2022

If you want to test locally (not influenced by your Internet speed nor Usenet plan), you can with nzbget's built-in nntp / nntps server "nserv". I use the self-signed keys generated by SAB itself (self-signed, so set verfication to Disabled).

cd
mkdir data_nzbget_nserv
fallocate -l1GB data_nzbget_nserv/data_1GB.bin # generate a big file
nzbget --nserv -p 9999 -d data_nzbget_nserv/ -z 555111 -s ~/.sabnzbd/admin/server.cert ~/.sabnzbd/admin/server.key

So: running on port 9999

Within SABnzbd, add the NZB you will find inside data_nzbget_nserv.

$ ll data_nzbget_nserv/
total 976724
drwxrwxr-x  2 sander sander       4096 dec 29 08:50 ./
drwxr-x--- 51 sander sander       4096 dec 29 08:48 ../
-rw-rw-r--  1 sander sander 1000000000 dec 29 08:48 data_1GB.bin
-rw-rw-r--  1 sander sander     147510 dec 29 08:50 data_1GB.bin.nzb

Patched SABnzbd & sabyenc3, on my i3 laptop, 8GB RAM, NVME disk:
Result with 1GB file: Downloaded in 2 seconds at an average of 361.2 MB/s
Result with 10GB file: Downloaded in 29 seconds at an average of 318.0 MB/s

Non-patched sabnzbdplus (with patched sabyenc in place, but not used, I guess)
10GB: Downloaded in 28 seconds at an average of 331.3 MB/s

Higher speed with non-patched?

4 replies

Safihre Dec 29, 2022
Maintainer

Recently I made those statistics a bit less accurate in order to save CPU cyles. Too bad there's no huge difference, but good to see also no real slowdown.
The "stable" speed at the end is more of interest to me, how much the cache is filled and if SABnzbd is reporting more/less that it's delayed by CPU.

sanderjo Dec 29, 2022

10GB download, with 10 connections

plain sabnzbd:
Download speed limited by CPU (26x)
Downloaded in 31 seconds at an average of 307.6 MB/s

patched SABnzbd:
Download speed limited by CPU (58x)
Downloaded in 29 seconds at an average of 323.4 MB/s

10GB, with 30 connections:

plain sabnzbdplus:
Download speed limited by CPU (62x)
Downloaded in 31 seconds at an average of 302.8 MB/s

patched SABnzbd
Download speed limited by CPU (87x)
Downloaded in 29 seconds at an average of 320.1 MB/s

So with these test runs, the patched SABnzbd achieves a bit higher download speed, with a bit higher "speed limited by CPU"

puzzledsab Dec 29, 2022
Author

The limited by CPU problem occurs when the decoder can't keep up. It may help to increase the number of decoder threads. It may also help to increase the downloader sleep time but I'm not sure if it will trigger often enough at those speeds.

There are some issues with running the usenet server on the same computer as SAB because they will be sharing the resources. If possible you should be using a ram disk the file the server is reading. Another issue is that the latency will be practically 0. In the real world we can end up with a lot of idle connections if we read very large chunks because the articles are so small. Emptying the other buffers before requesting a new one takes time.

Maybe the downloader loop should be changed so that a new article is immediately requested when a connection is done instead of first polling all the connections and then request new articles for all the idle ones. We already have a queue of easily available articles now that we're fetching multiple every time.

thezoggy Dec 29, 2022

tried with num_simd_decoders set to 4 (default: 2). restarted sab and tested:

tried with num_simd_decoders set to 1 (default: 2). restarted sab and tested:

mnightingale · 2022-12-29T14:22:39Z

mnightingale
Dec 29, 2022

I've tried Windows on a 5800x (8 core) with NServ 800 KB segments, I found the NServ cache option helped (2-3x) from a RAMdisk, my main disk is an NVMe 980 Pro, I think disk alone would be quick enough.
I'm running SABnzbd from source, cipher set as AES128.

I've tried 'Article Cache Limit' as high as 16GB, but haven't seen it get to over ~4GB, set it down to 1GB since it doesn't seem to be limiting the speed even though it fills.

Higher connections seemed to hurt performance, I don't know why maybe more thread lock contention?

10GB Download, 8 connections

unlocked_ssl_recv: Downloaded in 11 seconds at an average of 924.1 MB/s - start 1000 MB/s, end 800 MB/s
recv: Downloaded in 20 seconds at an average of 503.3 MB/s

10GB Download, 30 connections

unlocked_ssl_recv: Downloaded in 14 seconds at an average of 729.2 MB/s - pretty flat 650-750
recv: Downloaded in 21 seconds at an average of 472.7 MB/s

My interpretation is that it does seem to improve performance quite dramatically but I think there could still be delays caused by lots of thread locks?

NZBget really suffered with a low connection count so I bumped them to 30 connections, AES128-SHA, enabled SkipWrite option (can hit same speed with), was getting around 1400-1800MB/s.

A yappi profile, I'm not quite sure what to make of it.
I think assembler is unimportant? - it appears to be assembling the parts back into files which isn't really relevant to what the POC changes.

name	ncall	tsub	ttot	tavg
assembler.py:167 assemble	6	0.34375	16.76562	2.794271
newswrapper.py:182 NewsWrapper.recv_chunk	56147	0.28125	8.890625	0.000158
articlecache.py:85 ArticleCache.save_article	13422	0.03125	5.609375	0.000418
articlecache.py:162 __flush_article_to_disk	4708	0.046875	5.40625	0.001148
decoder.py:246 decode_yenc	13422	0.046875	3.109375	0.000232
articlecache.py:116 ArticleCache.load_article	13422	0.09375	2.796875	0.000208

23 replies

animetosho Jan 2, 2023

Is that with GF() / Galois Fields? Brings me back good memories ... pure magic.

Yep - CRC basically interprets the message as a bit number, and does modulo reduction into the Galois Field.

I've moved the MD5->CRC topic to #2396 to avoid polluting this discussion too much.

mnightingale Jan 2, 2023

I'm not sure whether to mention this here, #2396 or create a new issue/pr but a further observations now that I've tested in SAB with my Go test server:

10GB Download, 8 connections

unlocked_ssl_recv: Downloaded in 5 seconds at an average of 1.9 GB/s
recv: Downloaded in 9 seconds at an average of 1.0 GB/s

However I've noticed that if I remove the md5.update and md5.digest the download speed is actually slower however the overall download+assemble time is faster because assemble is keeping up with the download speed.

unlocked_ssl_recv + hardcoded md5: Downloaded in 7 seconds at an average of 1.4 GB/s

Without MD5 or replacing it with a quicker CRC, assemble will be called more often.
I have changed ArticleCache maximum size so all completed articles are coming from in memory, however it seems assemble is being called ~250 times which means it's writing ~40MB each time plus an open/close operation (it might open/close and not actually write anything).

I think after each run of assemble we need to give other threads a chance, at the moment it could just go straight into another assemble attempt rather than focussing on downloading/decoding:

diff --git a/sabnzbd/assembler.py b/sabnzbd/assembler.py
index 40e8bff8c..1ae6f0727 100644
--- a/sabnzbd/assembler.py
+++ b/sabnzbd/assembler.py
@@ -23,6 +23,7 @@ import os
 import queue
 import logging
 import re
+import time
 from threading import Thread
 import hashlib
 import ctypes
@@ -87,6 +88,7 @@ class Assembler(Thread):
 
                         # Continue after partly written data
                         if not file_done:
+                            time.sleep(0.5)
                             continue
 
                         # Clean-up admin data

0.5 seconds may not be the best fit, ideally it could be based on how populated the cache is or the actual average disk write speed.

With this change "Downloaded in 5 seconds at an average of 1.7 GB/s" so it's not perfect and I expect there is more to optimise but I think this could help even lower speed devices since they can give less priority to writing the cache to disk and focusing on downloading/decoding.

I think it may be worth profiling the actual assemble loop and especially nzf.decodetable to see if there is a neater way of doing that, maybe a queue so we can peek/pop instead of looping through all the ones that are already written every time.

Safihre Jan 2, 2023
Maintainer

Thanks for testing!
Indeed I think I would like to focus here on the SSL part and #2396 for the assemble updates.

puzzledsab Jan 2, 2023
Author

I started a new discussion for these more general talks about assembler optimizations: #2398

animetosho Jan 3, 2023

0.5 seconds may not be the best fit, ideally it could be based on how populated the cache is or the actual average disk write speed.

Random sleeping doesn't jive well with my idea of good code.
Assemble should probably try to do bigger writes, if blocking other threads is a problem. For example, if the next 5 articles are available, they should be written in a single write call.
If Python has a writev method, that could be used, otherwise, it makes sense to decode into a large contiguous buffer so that writes can span an arbitrary number of articles.

Safihre · 2022-12-29T21:24:38Z

Safihre
Dec 29, 2022
Maintainer

@puzzledsab a while back I worked on allocating a buffer once for each article and then use recv_into, to save all the allloc/resize stuff and all the trouble in sabyenc3 to combine all the list chucks.
I might need to revisit that after all we learned so far!
https://github.com/sabnzbd/sabnzbd/tree/feature/buffered_reader

25 replies

thezoggy Jan 5, 2023

aye, can confirm that 768000 bytes is whats used for certain scripts :)

thezoggy Jan 5, 2023

just grabbed latest win binary from the buffered read branch and see that it runs a little slower than before:

should i tweak anything to try and optimize for this branch? (using my normal setup of 40 connections / 512M article cache)

Safihre Jan 15, 2023
Maintainer

@mnightingale I updated the feature/buffered_reader to use sabnzbd/sabctools#79
Wondering what you see in terms of performance?
And of course if you see more space for optimizing!

mnightingale Jan 15, 2023

Flat line stable at 1.1GB/s which is an improvement (only 8 connections), I've compared to it to what being limited to 16 KB at a time would give us with (I haven't checked this for errors):

bytes_recv = sabyenc3.unlocked_ssl_recv_into(
    self.nntp.sock, self.data_view[self.data_position:self.data_position + min(self.data_view.nbytes - self.data_position, 16*1024)]
)

Which is only ~840MB/s, so a ~30% improvement.
Compared to before any of the optimisations it is a >110% improvement!

However I've also tried without the buffer decode and it doesn't seem to make any difference, I haven't looked into why.
So the improvement I'm seeing could just be down to the reduced CPU usage due to the CRC changes?

With the above in mind I'm a bit concerned that potentially looping while waiting for 1MB of data could cause problems, I think it's unlikely with the way the method works but I think it might be safer to limit the max read.
Locally I only need around 64-128 KB to hit max speed anyway, even 256 KB would be better than 1 MB I think.

In my opinion with a single download thread this is as fast as it's going to go and I don't think it's worth the effort at the moment to create pools of downloads.

Your latest sabyenc change to use "Si", I'm unclear from the documentation basically what the 1st and 3rd bullet points say but I'm not sure you should do this.
y* says "This is the recommended way to accept binary data."
Also bytes are supposed to be immutable but decode is changing them? - maybe there should be a bytearray(data) somewhere?

There are a couple of further areas I think are worth investigating, maybe you have an opinion already:

Optimise assembler to reduce disk access - I've put a lot of findings on Assembler optimizations #2398 but the main thing is I think it could be smarter about how often it opens the output files and writes to disk, it's better to do 1x100MB write than 100x1MB, NZBGet appears to do something with only writing once the cache is 90% full or a few other conditions + writing articles to the correct offset even if some are missing (more a feature than optimisation).
Conditionally (or completely) remove decoder threads if SIMD decoding is supported don't bother with the decoder thread just pass the buffer straight from the downloader - would still need to make a copy to store in the cache.
- with the removal of the 16 KB limitation it's not important that we call the recv method as often as possible
- node-yencode claims a single thread can do >300MB/s on a Raspberry Pi 3 or >4.5GB/s on a Core-i CPU.
- would remove some data copying, thread scheduling and therefore CPU usage.

It would be really nice if there were flags to turn off assembly, decoding or saving - doesn't need to be in the UI or anything but would help identify if there are any bottlenecks.

Safihre Jan 15, 2023
Maintainer

However I've also tried without the buffer decode and it doesn't seem to make any difference, I haven't looked into why. So the improvement I'm seeing could just be down to the reduced CPU usage due to the CRC changes?

The buffered sabyenc approach is indeed not faster because the chunks were not really a bottleneck, it's just needed now that we get the data from the socket in 1 buffer then it simplifies the sabyenc code a lot to just use that directly. Did you see how much extra code there is for getting the meta-data from the chunks?

With the above in mind I'm a bit concerned that potentially looping while waiting for 1MB of data could cause problems, I think it's unlikely with the way the method works but I think it might be safer to limit the max read. Locally I only need around 64-128 KB to hit max speed anyway, even 256 KB would be better than 1 MB I think.

Where do we loop for 1MB? Maybe I'm misunderstanding?

Your latest sabyenc change to use "Si", I'm unclear from the documentation basically what the 1st and 3rd bullet points say but I'm not sure you should do this. y* says "This is the recommended way to accept binary data." Also bytes are supposed to be immutable but decode is changing them? - maybe there should be a bytearray(data) somewhere?

Because in SABnzbd we take the memoryview from the socket and do .tobytes() we have a bytes object to decode in the Decoder, not a buffer-like object. If we wanted to use a buffer, we'd have to wrap it in another memoryview. So that would be wasteful.
Additionally, I checked the Python source-code for how it parses the arguments and the S is just simply passing the pointer to the bytes buffer instead of doing extra steps to fill a Py_buffer.
We used the bytes manipulation also in the old chuncked-method and it worked without a problem for a few years:
https://github.com/sabnzbd/sabyenc/blob/9a6a31af4b9a6c8b06ddf1b8d637b01a8df26c8e/src/sabyenc3.cc#L534-L540

Agreed to your further points of improvement for disk access and decoder-threads! 👍

thezoggy · 2022-12-30T05:05:30Z

thezoggy
Dec 30, 2022

created 11G file:
fsutil.exe file createnew dummy 11811160064

started nserv having it generate .nzb with 500,000 article size (no cache)
./nzbget.exe --nserv -p 9999 -d C:\dummy -z 500000 -s C:\sab\admin\server.cert C:\sab\admin\server.key

modified sab ini so only had the localhost:9999 nserv server, 40 connections. (sab wont test successfully but it works).
started test, saw that it was hitting the sab max line speed i normally used (116MB/s), raised that to 1016MB/s and continue test:

re-ran nserv to create cache'd yenc to save on having to encode to yenc on the fly for next run
./nzbget.exe --nserv -p 9999 -c C:\nserv -d C:\dummy -z 500000 -s C:\sab\admin\server.cert C:\sab\admin\server.key

i had sab article cache limit to 512MB, increased that to 1G. restarted sab. trimmed ssd.

started test up, I can see cache gets used:
Used cache: 1.0 GB (2108 articles)

less cpu is used, but can see speed is slower.. as more disk io being done vs just done on the fly in memory:

dropped nserv cache, trimmed ssd. increased sab article cache to 2G. set sab to only do 30 connections to the nserv server. restarted sab. re-ran test.

sab not using much of the article cache this time.. Used cache 152.1 MB (319 articles)
also I noticed watching the connections that it would have 28-30 connections from time to time, not 30 continuously..

--

downgraded back to sab 3.7.1. set sab article cache back to 512MB and nserv to use 40 connections. (so matching the above, first test) so we can see what speed difference sab has without the sabyenc/ssl speedup. trim ssd.

can see that it only yielded a 50 MB/s difference.. might be more if I move the nzb or sab downloading to a ram drive. as i'm doing everything on the same ssd (not even nvme) so probably hitting disk io limits on the thing

0 replies

Safihre · 2023-01-04T19:35:26Z

Safihre Jan 4, 2023
Maintainer

@sanderjo This was indeed a bug, it has been fixed in my recent work.

puzzledsab · 2022-12-30T16:51:50Z

puzzledsab
Dec 30, 2022
Author

I've forked feature/unlocked_ssl_recv and made some changes here: https://github.com/puzzledsab/sabnzbd/tree/feature/unlocked_ssl_recv

The buffer size is configurable through the special variable tcp_read_buffer. It should always be a multiple of 16384.
The sleep algorithm is changed. Now it will sleep if all the chunks read in the previous round were smaller than the largest chunk that has been read since startup.
If there are any articles available in server.article_queue then a new one will be requested immediately when a connection is done downloading an article.

10 replies

sanderjo Dec 31, 2022

If you could set up the nzbget server on a fast computer and SABnzbd on a slower computer you would get a more realistic scenario. In your case you may mostly be testing how fast it can copy a file from one directory to another while simultaneously calculating the MD5 value.

Or: change nzbget nserv so that it just spits out articles as if reading from an all-zero source file blabla-100MB.bin. As a matter of fact, "fallocate -l 10M blabla-100MB.bin" does generate an all-zero file, so no change in the resulting articles and behaviour, but now with nzbget not accessing disk anymore

$ head myarticle-127.0.0.1-\<bla-100MB.bin\?1\=0\:555111\>.bin 
Message-ID: <bla-100MB.bin?1=0:555111>
Subject: "bla-100MB.bin"

=ybegin part=1 line=128 size=104857600 name=bla-100MB.bin
=ypart begin=1 end=555111
********************************************************************************************************************************
********************************************************************************************************************************
********************************************************************************************************************************
********************************************************************************************************************************
********************************************************************************************************************************

with "*" = ASCII value 42, so indeed 0 (from the all-zero file) plus the yencode add of 42.

I'll have a look at the nzbget nserv code to see if I can achieve this

puzzledsab Jan 1, 2023
Author

I have updated my test branch to do md5 calculations in a separate thread. #2391

In my first tests for running the md5 calculations I piled up some data in a list and started a thread which would do the calculations while the main thread wrote the data to disk. For some reason this took just as long as writing and calculating in the same thread. However, I just found that if I use a queue instead then the parallelization is nearly perfect. My test went down from 11 seconds to 7 seconds.

I also found that I could start any number of threads that would calculate the MD5 for the same amount of 750 KB blocks with no performance reduction. On a quad core CPU it would take the same real time using 5 threads as 1000.
The test: https://gist.github.com/puzzledsab/8cb29ac0c58adc343bb8836b67bc77f6

Polling the connections for new data could probably also be done in 2 or more threads without to many changes.

thezoggy Jan 3, 2023

@puzzledsab grabbed your latest version, this one doesn not look to be good for performance:

puzzledsab Jan 4, 2023
Author

Thanks. That's weird. I don't really know why but I changed it a bit so that it will wait longer before requiring md5 sum calculation to be finished. Are you downloading to an SSD?

thezoggy Jan 4, 2023

Yes, everything on ssd (incomplete/complete) - as well as os. same as all my other tests on this windows box. If i go back to saf's build goes back to ~90MB/s.

Safihre · 2023-01-04T14:30:50Z

Safihre
Jan 4, 2023
Maintainer

Made some progress of using a buffer per connection and loading the data into that, something we should have done long ago since for non-SSL connection it allows "whatever is available" to be loaded.
https://github.com/sabnzbd/sabnzbd/compare/feature/buffered_reader?expand=1
Uses:
https://github.com/sabnzbd/sabyenc/compare/openssl_link?expand=1

Next step is to then use that buffer also when decoding.
So no need to speed-test yet, since this is not optimized yet.

0 replies

mnightingale · 2023-01-07T10:10:05Z

mnightingale
Jan 7, 2023

I managed to crash it quite a lot yesterday when playing with threads and pools of connections.

Crash was at https://github.com/sabnzbd/sabyenc/blob/a43f01e4044ece02a2632e094f2998c92757339d/src/sabyenc3.cc#L716

I think at the time what I was doing on the python side would have been triggering an error state (I think I wasn't reading the whole article before the next was being written), I've fixed it since but will try and figure out what was going wrong.

I think the whole method needs to more closely match https://github.com/python/cpython/blob/main/Modules/_ssl.c
Things like the difference between Py_BEGIN_ALLOW_THREADS/PySSL_BEGIN_ALLOW_THREADS and _PySSL_errno/SSL_get_error and calling the get error after every read.

multiple_pools version of https://gist.github.com/mnightingale/06fb3d4341dc72c2d9fd290cf724cb3a seems to work well for lots of connections, I think I'm going to try it against a real server from a VPS to see what real speed it can do.
There is a balance to find between how many connections to assign per thread and what to limit number of pools to.

8 replies

mnightingale Jan 8, 2023

I've almost finished and have pushed what I have so far, I just have PySSL_SetError and _setSSLError left but they look quite large so maybe we can just leave them at generic errors?

I don't think this will be an issue since it appears SAB doesn't do anything with the exceptions anyway.

@Safihre do you want sabyenc3.SSLWantReadError or can we just use ssl.SSLWantReadError instead?

I think the crashes I had before were probably the connection was closed and the code wasn't checking if the socket was null or calling Py_INCREF, so it was probably being GC'd while we where using it, I'm not sure if unit testing of that would have worked it crashed the process.

Definitely will need unit tests for it.

Safihre Jan 8, 2023
Maintainer

I've almost finished and have pushed what I have so far, I just have PySSL_SetError and _setSSLError left but they look quite large so maybe we can just leave them at generic errors?

I don't think this will be an issue since it appears SAB doesn't do anything with the exceptions anyway.

Exactly my thoughts.

@Safihre do you want sabyenc3.SSLWantReadError or can we just use ssl.SSLWantReadError instead?

I only did that because I didn't know how to use the ssl.SSLWantReadError. If that's possible, that would be great of course.

Safihre Jan 8, 2023
Maintainer

I saw you copied a lot of code. Wonder if it will work under all python versions we want to support (3.8 and above). We should probably also include some license information that we copied this.

mnightingale Jan 8, 2023

I saw you copied a lot of code. Wonder if it will work under all python versions we want to support (3.8 and above).

I'm starting to see what I can trim out, I also thought we could avoid relying on the structs never changing and use things like PyObject_GetAttrString(Py_ssl_socket, "Socket") and the similar set method.

We should probably also include some license information that we copied this.

I agree it needs to but I'm not really sure what needs to go in in, are the licences compatible?
I would also link to the PR.

Safihre Jan 8, 2023
Maintainer

We'll add a Licences folder and add the same file as we do for SABnzbd, where we also use Python (in the packaged executables):
https://github.com/sabnzbd/sabnzbd/blob/develop/licenses/License-Python.txt

puzzledsab · 2023-01-15T21:29:09Z

puzzledsab
Jan 15, 2023
Author

@Safihre : Do you have any plans for branches and releases going forward? Will the buffer branch be a 4.0 release or do you plan to merge it all into one branch? I'm a bit uncertain which branch to use as a base for updates, or if I should just wait until things have settled a bit before I try to change anything. No point in creating more merge conflicts than necessary.

0 replies

Safihre · 2023-01-24T15:39:48Z

Safihre
Jan 24, 2023
Maintainer

FYI: I am moving to a new house this week, so it's been a bit busy at home and will be for the next week or so!
At least today I managed to finish the rename of sabyenc3 to sabctools (I tried sabcutils but for some reason that really sounded like the name of a disease to me 🤣).
I will still have to update the SABnzbd code to match, then all can be merged in the develop branch and we can really start testing.
I also want to do some more refactoring and update documentation, so really releasing SABnzbd 4.0.0 might still take some time.
But of course the bulk of the work was already done by all of you 🤩

3 replies

Safihre Jan 24, 2023
Maintainer

I managed to make the changes to SABnzbd, so far so good.
Merged to develop 🥳

Safihre Jan 24, 2023
Maintainer

(I did not rename to 4.0.0-develop yet because that also requires more updates such as wiki and other texts)

thezoggy Feb 19, 2023

do we want to start a new discussion for sab 4.0 alpha/beta to have people share feedback (note 3.7.2->4.x speed/cpu). or do we want to push for them to go submit on forums or something

Getting around the 16 KB SSL speed limit #2352

Replies: 21 comments · 127 replies

Safihre Dec 1, 2022 Maintainer

Safihre Dec 1, 2022 Maintainer

puzzledsab Dec 2, 2022 Author

puzzledsab Dec 1, 2022 Author

puzzledsab Dec 2, 2022 Author

Safihre Dec 3, 2022 Maintainer

puzzledsab Dec 2, 2022 Author

Safihre Dec 8, 2022 Maintainer

puzzledsab Dec 8, 2022 Author

puzzledsab Dec 19, 2022 Author

Safihre Dec 21, 2022 Maintainer

Safihre Dec 22, 2022 Maintainer

Safihre Dec 24, 2022 Maintainer

Safihre Dec 25, 2022 Maintainer

Safihre Dec 26, 2022 Maintainer

Safihre Dec 27, 2022 Maintainer

Safihre Dec 28, 2022 Maintainer

puzzledsab Dec 29, 2022 Author

puzzledsab Dec 30, 2022 Author

Safihre Dec 30, 2022 Maintainer

Safihre Dec 29, 2022 Maintainer

puzzledsab Dec 29, 2022 Author

Safihre Dec 29, 2022 Maintainer

puzzledsab Dec 29, 2022 Author

Safihre Jan 2, 2023 Maintainer

puzzledsab Jan 2, 2023 Author

Safihre Dec 29, 2022 Maintainer

Safihre Jan 15, 2023 Maintainer

Safihre Jan 15, 2023 Maintainer

This comment has been hidden.

Safihre Jan 4, 2023 Maintainer

puzzledsab Dec 30, 2022 Author

puzzledsab Jan 1, 2023 Author

puzzledsab Jan 4, 2023 Author

Safihre Jan 4, 2023 Maintainer

Replies: 21 comments 127 replies

Safihre
Dec 1, 2022
Maintainer

Safihre Dec 1, 2022
Maintainer

puzzledsab Dec 2, 2022
Author

puzzledsab
Dec 1, 2022
Author

puzzledsab Dec 2, 2022
Author

Safihre Dec 3, 2022
Maintainer

puzzledsab
Dec 2, 2022
Author

Safihre
Dec 8, 2022
Maintainer

puzzledsab
Dec 8, 2022
Author

puzzledsab
Dec 19, 2022
Author

Safihre Dec 21, 2022
Maintainer

Safihre Dec 22, 2022
Maintainer

Safihre Dec 24, 2022
Maintainer

Safihre
Dec 25, 2022
Maintainer

Safihre Dec 26, 2022
Maintainer

Safihre Dec 27, 2022
Maintainer

Safihre
Dec 28, 2022
Maintainer

puzzledsab Dec 29, 2022
Author

puzzledsab Dec 30, 2022
Author

Safihre Dec 30, 2022
Maintainer

Safihre
Dec 29, 2022
Maintainer

puzzledsab Dec 29, 2022
Author

Safihre Dec 29, 2022
Maintainer

puzzledsab Dec 29, 2022
Author

Safihre Jan 2, 2023
Maintainer

puzzledsab Jan 2, 2023
Author

Safihre
Dec 29, 2022
Maintainer

Safihre Jan 15, 2023
Maintainer

Safihre Jan 15, 2023
Maintainer

Safihre Jan 4, 2023
Maintainer

puzzledsab
Dec 30, 2022
Author

puzzledsab Jan 1, 2023
Author

puzzledsab Jan 4, 2023
Author

Safihre
Jan 4, 2023
Maintainer