Getting around the 16 KB SSL speed limit #2352
Replies: 21 comments 127 replies
-
I am very pessimistic about the SSL patch being accepted 😢 I have looked at using multiple processes but was always held back by the amount of synchronization that I think is required for it to work. Additionally, it spins up 2 Python processes with each the base-memory usage that is caused by the imports that we have. Since that's already something like 35MB, we would require 70MB memory just to even boot up. Personally, I think the most value we can get is from moving things to C extension(s) where we can lift the GIL. EDIT: The biggest conceptual issue there is that we need to write the data in sequence so we can calculate the MD5. Othwerise we could write the data directly to disk from There are also things like: https://github.com/mypyc/mypyc |
Beta Was this translation helpful? Give feedback.
-
... or rust (safer than C), combined with pyo3 https://github.com/PyO3/pyo3
Oh, wow that is cool: Pure python:
versus compiled module
... pure magic! |
Beta Was this translation helpful? Give feedback.
-
Let me check: Completely outside python, with a C / Rust program, or mypyc: Let SAB/python setup socket to the newsserver, and use that from C / Rust / mypyc module: Is that what we're talking about? So writing to a file on disk? Or should the article get into memory? |
Beta Was this translation helpful? Give feedback.
-
Multiple processes would use more RAM, but it wouldn't double it with two processes. Fork uses copy on write, so only data that gets changed after the fork will use extra RAM. I think we'd have to use the spawn method, though, because Windows can't use fork and using different methods on different OSes would be harder to maintain. On the other hand, spawn only loads stuff that is necessary for the child process. Yes, there would be a lot of messaging back and forth. That's not necessarily a problem, but some calls that are blocking now might need be done none blocking. That could be hard. The thing is, almost everything that can be done using C already is. Here are the top methods from a Yappi profile of Sab after downloading for a short while:
ttot = CPU time used by that method plus sub methods (including C modules, I believe) If my interpretation is correct, the downloader thread uses the most time by far, and it's mostly because of recv_chunk and reading sockets. Assembler.assembler uses almost no python time at all. It's all in the reading/writing to disk and calculating MD5. Both will already release the GIL while running. The SSL patch would obviously be the best and easiest solution. It might also be possible to fork the SSL code and make it a standalone module installable through pip. Maybe you could ask if someone would be willing to help on r/usenet or something. It would have to be maintained, but it might not be too hard if fixes could be merged from the Python code. Other than that, I don't see any other solution than multiprocessing. It's the downloading itself that uses most of the CPU time so optimizing other parts will have limited effect. |
Beta Was this translation helpful? Give feedback.
-
I've made a version which makes it easier to take advantage of mypyc to read sockets. It doesn't replace the illusive SSL patch but it may be a noticeable improvement. See the pull requests. |
Beta Was this translation helpful? Give feedback.
-
I can't believe it, but the trick I imagned to hack this actually seems to work. I made a sub-struct of the internal CPython one: typedef struct {
PyObject_HEAD
PyObject *Socket; /* weakref to socket on which we're layered */
SSL *ssl;
} PySSLSocket; And then made a C extension function with the core-code from PyObject* test(PyObject* self, PyObject* Py_ssl_socket) {
PySSLSocket *test = (PySSLSocket *)Py_ssl_socket;
PyObject *dest = NULL;
char *mem;
size_t len = 1000;
size_t count = 0;
int retval;
dest = PyBytes_FromStringAndSize(NULL, 1200);
mem = PyBytes_AS_STRING(dest);
retval = SSL_read_ex(test->ssl, mem, 1000, &count);
printf("%s", mem);
return dest;
} And then trial code: hostname = 'eunews.frugalusenet.com'
context = ssl.create_default_context()
with socket.create_connection((hostname, 563)) as sock:
with context.wrap_socket(sock, server_hostname=hostname) as ssock:
ssock.setblocking(False)
time.sleep(1)
sabyenc3.test(ssock._sslobj) Prints:
Even though my code is compiled against a slightly different OpenSSL version than the Python version is using. Seems that doesn't matter as long as the structs stay the same. Let's see if I can make this into something workable. |
Beta Was this translation helpful? Give feedback.
-
SABnzbd 4.0, here we come? :) |
Beta Was this translation helpful? Give feedback.
-
Any progress on this? |
Beta Was this translation helpful? Give feedback.
-
So, I was thinking of making a new C extension called |
Beta Was this translation helpful? Give feedback.
-
I implemented a proof of concept that works on Windows and Ubuntu. |
Beta Was this translation helpful? Give feedback.
-
@puzzledsab we should indeed see if we need to tweak the way the loop works. Now it works "the same" as non-Ssl connections, so our tweaks would work for both types. However, I did already see it "work" by printing |
Beta Was this translation helpful? Give feedback.
-
If you want to test locally (not influenced by your Internet speed nor Usenet plan), you can with nzbget's built-in nntp / nntps server "nserv". I use the self-signed keys generated by SAB itself (self-signed, so set verfication to Disabled).
So: running on port 9999 Within SABnzbd, add the NZB you will find inside data_nzbget_nserv.
Patched SABnzbd & sabyenc3, on my i3 laptop, 8GB RAM, NVME disk: Non-patched sabnzbdplus (with patched sabyenc in place, but not used, I guess) Higher speed with non-patched? |
Beta Was this translation helpful? Give feedback.
-
I've tried Windows on a 5800x (8 core) with NServ 800 KB segments, I found the NServ cache option helped (2-3x) from a RAMdisk, my main disk is an NVMe 980 Pro, I think disk alone would be quick enough. I've tried 'Article Cache Limit' as high as 16GB, but haven't seen it get to over ~4GB, set it down to 1GB since it doesn't seem to be limiting the speed even though it fills. Higher connections seemed to hurt performance, I don't know why maybe more thread lock contention? 10GB Download, 8 connections unlocked_ssl_recv: Downloaded in 11 seconds at an average of 924.1 MB/s - start 1000 MB/s, end 800 MB/s 10GB Download, 30 connections unlocked_ssl_recv: Downloaded in 14 seconds at an average of 729.2 MB/s - pretty flat 650-750 My interpretation is that it does seem to improve performance quite dramatically but I think there could still be delays caused by lots of thread locks? NZBget really suffered with a low connection count so I bumped them to 30 connections, AES128-SHA, enabled SkipWrite option (can hit same speed with), was getting around 1400-1800MB/s. A yappi profile, I'm not quite sure what to make of it.
|
Beta Was this translation helpful? Give feedback.
-
@puzzledsab a while back I worked on allocating a buffer once for each article and then use |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
This comment has been hidden.
This comment has been hidden.
-
I've forked feature/unlocked_ssl_recv and made some changes here: https://github.com/puzzledsab/sabnzbd/tree/feature/unlocked_ssl_recv
|
Beta Was this translation helpful? Give feedback.
-
Made some progress of using a buffer per connection and loading the data into that, something we should have done long ago since for non-SSL connection it allows "whatever is available" to be loaded. Next step is to then use that buffer also when decoding. |
Beta Was this translation helpful? Give feedback.
-
I managed to crash it quite a lot yesterday when playing with threads and pools of connections. Crash was at https://github.com/sabnzbd/sabyenc/blob/a43f01e4044ece02a2632e094f2998c92757339d/src/sabyenc3.cc#L716 I think at the time what I was doing on the python side would have been triggering an error state (I think I wasn't reading the whole article before the next was being written), I've fixed it since but will try and figure out what was going wrong. I think the whole method needs to more closely match https://github.com/python/cpython/blob/main/Modules/_ssl.c multiple_pools version of https://gist.github.com/mnightingale/06fb3d4341dc72c2d9fd290cf724cb3a seems to work well for lots of connections, I think I'm going to try it against a real server from a VPS to see what real speed it can do. |
Beta Was this translation helpful? Give feedback.
-
@Safihre : Do you have any plans for branches and releases going forward? Will the buffer branch be a 4.0 release or do you plan to merge it all into one branch? I'm a bit uncertain which branch to use as a base for updates, or if I should just wait until things have settled a bit before I try to change anything. No point in creating more merge conflicts than necessary. |
Beta Was this translation helpful? Give feedback.
-
FYI: I am moving to a new house this week, so it's been a bit busy at home and will be for the next week or so! |
Beta Was this translation helpful? Give feedback.
-
@Safihre : how optimistic are you about the SSL patch being accepted? Eventually Sab won't be able to keep up with increasing speeds because single core speed seems to be hitting a roof. On the other hand they are getting more and more cores so multiprocessing seems like the answer. I've been looking into it a bit and trying to figure out how the code could be rearranged to support it.
My first idea was to have a main process which manages the queue and finds articles ready for downloading. Then there would be x number of processes which the main process sends requests to and receives responses from. This way it could work almost like now, with the main thread handling the decoding and writing.
Unfortunately it turns out that transferring large amounts of data between Python processes is quite slow when using pipes or queues in Windows. I measured it to about 60 MB/s on an Intel i5, with two cores using 100% CPU just for reading from and writing to the queues. The latency is low, though. Sending messages of 1MB back and forth 1 000 times takes almost exactly as much time as sending 100KB messages 10 000 times. In other words, it's probably possible but the downloaded data can't be sent back. The download processes would have to decode and write the data as well, unless there's a way to speed this up that I don't know about,
I wanted to make an experimental branch just to see if it would actually help but because of this set back I probably wait and see if adding the SSL patch continues to drag on. Anyway, I just wanted to put the idea out there and get some feedback. What do you think? Any other ideas?
Here's the program I used for testing queue speed: https://gist.github.com/puzzledsab/ad11755796694551e1e0ed46f6bac0d4
Beta Was this translation helpful? Give feedback.
All reactions