-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection reset while transferring stages #509
Comments
A quick update. The responses with and without Thanks |
Another update. The following changes would be sufficient to change the default parameters used by
I have set both Then, when I test the toy case, there is no longer a connection resetting problem. |
Hello @Weiming-Hu, let me try to reproduce it and I will get back at you. Can you let me know where is your RMQ compared to EnTK and which resource you are using for this toy example? Can you also let me if your workaround is it okay for now? Thank you |
Thank you for the response. My workaround is currently working. I'm using NCAR Cheyenne. Below is my stack information:
I'm connecting to RMQ prepared for me by @mturilli. I'm sorry I don't know the details about the RMQ though. |
I see two things that have to do with RabbitMQ heartbeats and connections closing. The first is that RMQ after version 3.5.5 changed the heartbeat from 580 secs to 60secs. Our RMQ version is 3.8.9. The second is that RMQ server disconnects when a connection is idle for several minutes (here is a stackoverflow discussion). They also suggest that to set heartbeat to 0 as that disables it completely and keeps the connection active. Thank you. |
Thank you. Is there a way to to set the hearbeat of EnTK task manager in the user code? Currently I'm changing the source code at |
I believe the environment variables should be enough. In a simple example, I changed the default values of heartbeat and timeout, as you show above. I logged the heartbeat value that the appmanager picked, and it was 0. Please remove the change you made in the appmanager, and try to run a small example of your workflow by only setting the two default values in the user script. I think it will work, but can you let me know as well? I'm still investigating how to solve it. I'll let you know when I have more. |
Thanks for the reply. I'm afraid I'm not sure how to set the heartbeat in my user code. Could you give me an example without changing the source code? Thank you! |
This is how:
|
Ah. My bad. I misunderstood you. Thank you very much for the solution. I'm closing this. |
I am reopening this issue as it is not resolved cleanly. A workaround exists. |
@iparask do we know how we want to solve this beyond the initial workaround? Is there anything to discuss before you work at the PR? |
FYI, I experienced similar error:
this was a large run asking 6120 concurrent tasks |
Srinivas also experiences this issue. I think we should disable pika's heartbeat timeout completely. There are several things that can affect this issue. The one that I think causes this is that we stopped acknowledging messages, but I am not very sure. I think I have a replicator somewhere. @lee212 how long were the tasks of your run execute for? It might make sense to send an ack every now and to make sure that the connection stays open, but not often enough to nullify the performance gain that @lee212 introduced. |
I found the reason. If the channel stays idle for enough time, the connection is dropped by RMQ. This is also discussed here. I created a producer and a worker. The producer sleeps for x seconds and then sends a message. The worker prints the message. Producer: #!/usr/bin/env python
import pika
import os
import time
hostname = os.environ['RMQ_HOSTNAME']
password = os.environ['RMQ_PASSWORD']
port = os.environ['RMQ_PORT']
username = os.environ['RMQ_USERNAME']
credentials = pika.PlainCredentials(username, password)
rmq_conn_params = pika.connection.ConnectionParameters(host=hostname, port=port, credentials=credentials)
connection = pika.BlockingConnection(rmq_conn_params)
channel = connection.channel()
channel.queue_declare(queue='task_queue', durable=True)
for i in range(0,1800,100):
message = str(i)
print(" [x] Sending %r" % message)
time.sleep(i)
channel.basic_publish(
exchange='',
routing_key='task_queue',
body=message,
properties=pika.BasicProperties(
delivery_mode=2, # make message persistent
))
connection.close() worker: #!/usr/bin/env python
import pika
import time
import os
hostname = os.environ['RMQ_HOSTNAME']
password = os.environ['RMQ_PASSWORD']
port = os.environ['RMQ_PORT']
username = os.environ['RMQ_USERNAME']
credentials = pika.PlainCredentials(username, password)
rmq_conn_params = pika.connection.ConnectionParameters(host=hostname, port=port, credentials=credentials)
connection = pika.BlockingConnection(rmq_conn_params)
channel = connection.channel()
channel.queue_declare(queue='task_queue', durable=True)
print(' [*] Waiting for messages. To exit press CTRL+C')
def callback(ch, method, properties, body):
sleep = body.decode()
print(" [x] Received %r" % sleep)
#time.sleep(int(sleep))
print(" [x] Done")
ch.basic_ack(delivery_tag=method.delivery_tag)
channel.basic_qos(prefetch_count=1)
channel.basic_consume(queue='task_queue', consumer_callback=callback)
channel.start_consuming() This is the output I got:
There are several places where we share a pika channel; an example is here. There are three possible solutions.
I prefer solution 2. It may be the slowest one, but it is the safest. |
Thanks Giannis, great job! Would solution 3 be as safe as 2 but without the potential performance issue? |
Update: The whole connection closes. I changed the producer to handle the specific exception. #!/usr/bin/env python
import pika
import os
import time
hostname = os.environ['RMQ_HOSTNAME']
password = os.environ['RMQ_PASSWORD']
port = os.environ['RMQ_PORT']
username = os.environ['RMQ_USERNAME']
credentials = pika.PlainCredentials(username, password)
rmq_conn_params = pika.connection.ConnectionParameters(host=hostname, port=port, credentials=credentials)
connection = pika.BlockingConnection(rmq_conn_params)
channel = connection.channel()
channel.queue_declare(queue='task_queue', durable=True)
for i in range(0,1800,100):
message = str(i)
print(" [x] Sending %r" % message)
time.sleep(i)
try:
channel.basic_publish(exchange='',routing_key='task_queue',body=message,properties=pika.BasicProperties(delivery_mode=2))
except pika.exceptions.ConnectionClosed:
print('Connection closed? %s' % connection.is_closed)
print('Channel Status: %s' % channel.is_open)
connection = pika.BlockingConnection(rmq_conn_params)
channel = connection.channel()
channel.basic_publish(exchange='',routing_key='task_queue',body=message,properties=pika.BasicProperties(delivery_mode=2))
channel.queue_delete(queue='new_task')
connection.close() and the output looks like (channel status is
I'll apply the change where we have a publish and open a PR. |
I'm testing on a small toy example. I have one executable that simply waits for 10 minutes. This workflow requires 1 core and 1 hour of resources. When the executable finishes waiting, EnTK got stuck and never made any progress.
Below error message can be found from the client-side logs:
I have done a bit study on this issue and I have found a workaround.
I have added the following line to
radical/entk/appman/appmanager.py
:Please note
heartbeat_interval=0
to disable active termination of connection from the server side.Once I do this, the program runs correctly.
However, I'm not sure about the impact of this. Please advice.
Thank you very much.
The text was updated successfully, but these errors were encountered: