Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Session threads left running after Client.stop, occurs in long running process #68

Closed
Teriks opened this issue Apr 30, 2018 · 13 comments
Closed
Labels

Comments

@Teriks
Copy link
Contributor

Teriks commented Apr 30, 2018

Threads created in Session.start() appear to be left running in a loop after stopping a client that has been running for some time.

On CTRL-C this causes the program to hang, because the threads are not deamon threads and continue to keep the program alive.

It appears that new threads might be getting created before the previous successfully exited when session reconnects while running?

In my project I have added something like this for debugging:

client.start()
client.idle()
time.sleep(5)
for i in (x for x in threading.enumerate() if x.name != 'MainThread'):
    print(i.name)

Occasionally after a long run I get this output on CTRL-C (put some debug prints in Syncer.remove because I thought it might be a deadlock there but it is not):

Syncer.remove lock before
Syncer.remove cls.sync(client) after
Syncer.remove cls.stop() before
Syncer.stop cls.event.set() before
Syncer.stop cls.event.set() after
Syncer.remove cls.stop() after
Syncer.remove lock after
NetWorker#1
RecvThread
NextSaltThread
PingThread

... program hangs ...

I think the problem might be in Session.start() itself but I am not certain.

@Teriks
Copy link
Contributor Author

Teriks commented Apr 30, 2018

This is the bulk of what my program is doing (cut down) for testing purposes.

I am experiencing this on windows.

import threading
import time

from pyrogram import Client, RawUpdateHandler, Message
from pyrogram.api import types as api_types
from pyrogram.client import message_parser

import uuid

app = Client("pyrogram.session", api_id=999999, api_hash="hash_here")


def update_handler(client, update, users: dict, chats: dict):
    if not isinstance(update, (api_types.UpdateNewChannelMessage, api_types.UpdateNewMessage)):
        return

    update_message: api_types.Message = update.message

    parsed_message: Message = message_parser.parse_message(
        client,
        update.message,
        users,
        chats
    )

    if isinstance(update_message.media, (api_types.MessageMediaPhoto, api_types.MessageMediaDocument)):
        m_name = str(uuid.uuid4())
        print('Download media: downloads/'+m_name)
        client.download_media(parsed_message, file_name=m_name, block=False)

    print(update_message.message)


app.add_handler(RawUpdateHandler(update_handler))

app.start()
app.idle()

time.sleep(5)

for i in (x for x in threading.enumerate() if x.name != 'MainThread'):
    print(i.name)

@delivrance
Copy link
Member

Hi, thanks for the detailed report!

It appears that new threads might be getting created before the previous successfully exited when session reconnects while running?

Hm, currently this can be true only for the NetWorker thread because Session.stop() isn't waiting for it to finish, and would be a really rare case. I think de33247 is gonna fix this. For any other thread there was already a call to join().

Regarding threads count. I have a welcome bot running H24 on a linux box and I often monitor its resources usage, but the threads count never exceeded the maximum amount possible. I also tried stopping with CTRL+C and restarting it successfully right now after being online for days, so this could really be a windows only problem.

Having said this, looks like Session.start() got called twice in a row somehow and the references to the previous threads were lost. but I can't really see any case where this can happen. Did you notice any weird behaviour when handling updates? These extra threads started (which, again, I've seen none in linux and might be a windows only issue) should mess with the queues.

About your snippet, I tried to run that on windows and simulating connection drops multiple times on a rather busy account but I wasn't able to replicate the issue.

@Teriks
Copy link
Contributor Author

Teriks commented May 1, 2018

I begin to get "AUTH_BYTES_INVALID" API errors from the download worker, or rather the issue starts with that and maybe for other API exceptions.

I've put this at the top of my update handler in my project and the test program I gave:

print("Update thread: " + threading.current_thread().name)
print("Other threads: " + 
         (',\n' + ' ' * 15).join(x.name for x in threading.enumerate() if x.name != 'MainThread'))

After a while I get the following:

A bunch of good updates where the active threads look normal,
and then some duplicates start appearing.

Update thread: UpdateWorker#4
Other threads: NetWorker#1,
               RecvThread,
               NextSaltThread,
               PingThread,
               UpdatesWorker#1,
               DownloadWorker#1,
               DownloadWorker#2,
               DownloadWorker#3,
               DownloadWorker#4,
               UpdateWorker#1,
               UpdateWorker#2,
               UpdateWorker#3,
               UpdateWorker#4,
               Syncer
- > MESSAGE HERE
[2018-04-30 19:09:59.077] client.py:734 ERROR: [520 Unknown error]: {
    "_": "types.RpcError",
    "error_code": 400,
    "error_message": "AUTH_BYTES_INVALID"
}
Traceback (most recent call last):
  File "j:\projects\python\pyrogram\pyrogram\client\client.py", line 726, in download_worker
    progress=progress
  File "j:\projects\python\pyrogram\pyrogram\client\client.py", line 2704, in get_file
    bytes=exported_auth.bytes
  File "j:\projects\python\pyrogram\pyrogram\session\session.py", line 409, in send
    return self._send(data)
  File "j:\projects\python\pyrogram\pyrogram\session\session.py", line 396, in _send
    Error.raise_it(result, type(data))
  File "j:\projects\python\pyrogram\pyrogram\api\errors\error.py", line 63, in raise_it
    raise UnknownError(x=rpc_error, query_type=query_type)
pyrogram.api.errors.error.UnknownError: [520 Unknown error]: {
    "_": "types.RpcError",
    "error_code": 400,
    "error_message": "AUTH_BYTES_INVALID"
}
Update thread: UpdateWorker#3
Other threads: NetWorker#1,
               RecvThread,
               NextSaltThread,
               PingThread,
               UpdatesWorker#1,
               DownloadWorker#1,
               DownloadWorker#2,
               DownloadWorker#3,
               DownloadWorker#4,
               UpdateWorker#1,
               UpdateWorker#2,
               UpdateWorker#3,
               UpdateWorker#4,
               Syncer,
               NetWorker#1,
               RecvThread,
               NextSaltThread,
               PingThread
- > MESSAGE HERE

In the above it looks like: NextSaltThread, PingThread, RecvThread, and NetWorker#1 get leaked.

I get "AUTH_BYTES_INVALID" pretty frequently after this occurs, and each time it happens more threads are spawned.

I do have one lock on a threading.Lock, followed by an inter-process lock inside my update handler that happens when adding message content into a full-text index. That could be aggravating this, I'm still narrowing it down since it takes a while to occur.

For now I've removed my synchronization stuff to see if that's whats getting the threads to hang, if so I can probably hoist indexing out into a background thread instead of locking in the update handler.

@Teriks
Copy link
Contributor Author

Teriks commented May 1, 2018

quick update.

I had both the test program I attached to this issue above, and my project running with my locking logic taken out while writing the above post.

Both appear to have started exhibiting the same behavior after "AUTH_BYTES_INVALID" was encountered.

I am currently running your master branch installed in development mode.

@Teriks
Copy link
Contributor Author

Teriks commented May 1, 2018

I think I see what might be happening, a pyrogram.api.errors.error.UnknownError exception is raised from here and not handled.

The session created just before it is never stopped which leaks the threads.

session.send(

@delivrance
Copy link
Member

Nice and detailed explanations, really appreciated!

I get "AUTH_BYTES_INVALID" pretty frequently after this occurs, and each time it happens more threads are spawned.

Both appear to have started exhibiting the same behavior after "AUTH_BYTES_INVALID" was encountered.

I think I see what might be happening, a pyrogram.api.errors.error.UnknownError exception is raised from here and not handled. The session created just before it is never stopped which leaks the threads.

Good, you got it! I now do really think this is the cause of the issue.

Can I ask how many medias you are downloading in a given time frame, say, 5 minutes? This is just to know how often the new media download sessions are started.

#40 is fixing some other session-related problems but I think it can be applied to this as well and that would also make it possible to completely avoid AUTH_BYTES_INVALID errors.

I didn't merge it yet because with the current implementation of Session it would create quite a lot of threads for a single client (that's because there are 5 main data centers plus some other CDN, and there could be a session for each of them).
This thread count problem, though, can be solved/contained as soon as we manage to convert Session to a fully asyncio thing, and until then I think I'm gonna merge it just to (hopefully) solve this issue here.

@delivrance
Copy link
Member

Hi, this issue should be fixed now. To test it right away simply install the master branch version, but be aware that there were quite some new changes as well. Relevant commit: 83febf2

@Teriks
Copy link
Contributor Author

Teriks commented May 1, 2018

Can I ask how many medias you are downloading in a given time frame, say, 5 minutes? This is just to know how often the new media download sessions are started.

Just to answer this, about 1 file per minute in the 80 minute span I measured mid day.

Seems like caching the session object would definitely help with flood waits and pitfalls, I'll try it out and see if anything crops up.

@delivrance
Copy link
Member

delivrance commented May 1, 2018

The real cause of AUTH_BYTES_INVALID was the concurrent access to get_file(); a download worker (you had more than one) creates a new auth key and the next worker would do the same if they were running almost in parallel, hence invalidating all the previous keys and causing the import to fail.

Now you can safely use more than one download worker. I've tested against https://t.me/bestmemes with 10 download workers and I was able to download all medias (~8k / 815 MB) in about 5 minutes.

@Teriks
Copy link
Contributor Author

Teriks commented May 1, 2018

I see, that makes sense.

I did notice though in master that the session might still manage to leak threads if this throws something (since the try block was taken out):

session.send(

I'm not sure if that's still relevant given what it could possibly throw now, but it might be a good idea to shut it down.

@delivrance
Copy link
Member

Hmm, right, but it has become really unlikely to fail now. If you want a possibility for that to happen again you need to restart the whole Client hundreds of times in a short period (which you don't ever do, unless you are deliberately stress testing things). #40 was getting flood waits after about 400 downloads, and that means that Export/Import was called 400 times. Now it is called only once per Client run, per session.

I think that adding the sessions to the media_sessions before even starting them would allow the Client to raise some error in case something really bad happens (like telegram goes down at the same time pyrogram wants to start a new media session (any other re-connections after a session has been successfully connected are already handled)).
This can be done right now because it's simple and I think it works and won't leave any zombie thread (worst case scenario is pyrogram throwing some error). Any other solution to make this more resilient should be proposed in another issue/enhancement, mainly because it requires quite some extra work and the fact that now this issue is unlikely to ever happen again.

What do you think?

@Teriks
Copy link
Contributor Author

Teriks commented May 1, 2018

I think things might get a little complicated if a session that is not started gets into the cache (if the program is going to continue running after that). Like if starting the session fails, but it is in the cache already and not running.

You might have to attempt to start it when it eventually gets pulled out of the cache next, and also take into account which sessions are not started when shutting them all down later on.

Could be easier to guarantee that only started sessions exist in the cache since they can be re-used immediately without prior setup.

Caching them immediately after they start successfully and before they are given any tasks could work,
since the session will clean itself up when it fails to start.

delivrance added a commit that referenced this issue May 1, 2018
@EriHoss is right, the session can know how to clean itself in
case for some reason it fails to start, hence it makes much more
sense to cache sessions right after they start. Also, clear the
media_sessions dict when stopping to get rid of closed sessions.
@delivrance
Copy link
Member

Caching them immediately after they start successfully and before they are given any tasks could work,
since the session will clean itself up when it fails to start.

That's correct and makes much more sense, thanks! ^^

You might have to attempt to start it when it eventually gets pulled out of the cache next, and also take into account which sessions are not started when shutting them all down later on.

Regarding this, I forgot to clear the session cache dict (f29b8bb fixes this). The idea is to discard every media session once you stop the client and recreate them the next start. Media sessions are also cheap enough now and their keys can only be used to download/upload media, nothing else.

So, I think that keeping the sessions online and cleaning when stopping the client is enough for the current implementation. As I said, for ideas on how to improve this even further is better to keep another issue as enhancement, but if you have other concerns let me know. Oh, and thanks again for your detailed bug reports!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants