New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Testing indications against a container sometimes fails #3022
Comments
For example, this is a test that failed:
|
However, this is clearly sending significant quantities of indications from an OpenPegasus container. Test repeated send of 1000 indications 10 time successfully.
Or this test with 4000 indications sent:
|
We are still failing some time but not sure at this point if it is a pywbem issue or an OpenPegasus issue. It appears that the same issue occurs much more frequently in pywbemtools where the test will send only a few indications before failing with the same issue. I propose that we can better analyze the issue with that code to determine whether it is OpenPegasus or pywbem that is causing the problem. Note that in comments below we document that the issue can be seen in OpenPegasus where in the indication send and receive the failure is documented as 0 bytes received return in very short time (i.e. less than millisecond). |
It appears that there is some timing issue between the pywbemlistener and OpenPegasus such that indication delivery fails. However, there is another issue behind that problem in our test design. OpenPegasus includes a the following two configuration parameters:
to set the minimum time to next retry and the max number of retries. The default values of these parameters in the container are: minIndicationDeliveryRetryInterval=30 In effect, they are designed to delay for 30 seconds (delay the failed indication and any subsequent indications) after a delivery failure where a delivery failure is one call without any retry. The following is a test I ran after changing the minIndicationDeliveryRetryInterval from 30 sec (the default) to 2 seconds. It indicates that we do keep retrying and that if the min retry interval is less than the test programs max time value for failure (right now a few seconds) the client does continue to receive indications but with frequent retries. Note however, in the example below we still finally get a failure but it is after may cases where we go through the server 2 sec timeout and we again get indications. In addition, it would appear that indications are being dropped at the time of the failures. This however introduces other questions. The reason for the long delay is that the OpenPegasus authors thought that failures would be largely due to indication receivers being off line for some time, not the type of issue we see here. with the default delays the indication will still be sent if the indication receiver is missing for a couple of minutes. Setting the delay to just 2-3 seconds means that the indication receiver can be missing for only a few seconds before Pegasus starts throwing away indications after 3 delayed retries.
|
We have also notices that:
|
Testing with pywbem examples. In a number of tests with the pywbem examples of listen.py and send_indications.py/send_indications.sh we are able to send many thousands of indications without an issue between these senders in both the same machine as the listener and with the senders in a container. from container From sender local send_indication.sl 10,000 indications 1:45 min or 105 sec or 95 per sec We do not have python on the OpenPegasus container so did not run that test. However, my conclusion at this point is that we do not lose indications when sent either local or from container using just the test tools. However, we clearly do have issues when using OpenPegasus. |
The following is an example of an indication that failed to send and then was resent after the minretry interval successfully. see lines with "===================" for comments `` 1700345666s-250704us: Http [11:139770333767424:HTTPConnection.cpp:401]: HTTPConnection::handleEnqueue - HTTP_MESSAGE 1700345666s-250773us: Http [11:139770333767424:HTTPConnection.cpp:1029]: HTTPConnection::_handleWriteEvent: Sending non-chunked data. 1700345666s-251289us: Http [11:139770334353152:HTTPConnection.cpp:1029]: HTTPConnection::_handleWriteEvent: Sending non-chunked data.
1700345666s-252267us: Http [11:139770333767424:HTTPConnection.cpp:401]: HTTPConnection::handleEnqueue - HTTP_MESSAGE 1700345666s-252322us: Http [11:139770333767424:HTTPConnection.cpp:1029]: HTTPConnection::_handleWriteEvent: Sending non-chunked data. 1700345668s-254432us: Http [11:139770333767424:HTTPConnection.cpp:401]: HTTPConnection::handleEnqueue - HTTP_MESSAGE 1700345668s-254568us: Http [11:139770333767424:HTTPConnection.cpp:1029]: HTTPConnection::_handleWriteEvent: Sending non-chunked data.
1700345668s-259700us: Http [11:139770333767424:HTTPConnection.cpp:401]: HTTPConnection::handleEnqueue - HTTP_MESSAGE 1700345668s-259829us: Http [11:139770333767424:HTTPConnection.cpp:1029]: HTTPConnection::_handleWriteEvent: Sending non-chunked data.
1700345668s-264619us: Http [11:139770333767424:HTTPConnection.cpp:401]: HTTPConnection::handleEnqueue - HTTP_MESSAGE 1700345668s-264845us: Http [11:139770333767424:HTTPConnection.cpp:1029]: HTTPConnection::_handleWriteEvent: Sending non-chunked data.
|
The actual lines in the trace that show the problem are:
two seconds later
|
Note that there is no indication of why the response request terminated in a less than a millisecond with 0 bytes received with no error code. |
OpenPegasus trace showing case where indication from container fails with 0 bytes rcvd and retry where the same indication is successful:
|
ACTION KS: I want to next try an immediate retry if the indication response read operation in the server returns 0 bytes read and then put the indication into the delayed retry queue. |
Sending indications from the container sometimes fails in not sending requested number of indications. This shows up in the end2end test but also when running the pegasusindicationtest.py example.
It typically sends less than the number requested but usually sends some and the remainder appear to be still the the OpenPegasus output queue.
So far, this appears to be an issue with the container and not with testing where OpenPegasus and pywbem are in the same system.
It almost never sends zero indications before the issue occurs.
The following is an example of passing the test when sending 10 indicaitons and failing the test when the indication count is 100 indications but having received 23 indications and with 77 still in the OpenPegasus output queue.
In general when insufficient indications are sent from the container, the remainer can be found in a queue in Open Pegasus after the test using code similar to the following:
'''
insts = self.conn.EnumerateInstances(
'PG_ListenerDestinationQueue', namespace='root/PG_Internal')
for inst in insts:
print('ListenerDestinationName={}:\n QueueFullDropped={} , '
'RetryAttemptsExceeded={}, InQueue={}'
.format(inst['ListenerDestinationName'],
inst['QueueFullDroppedIndications'],
inst['RetryAttemptsExceededIndications'],
inst['CurrentIndications']))
'''
The conclusion reached is that there is some sort of occasional time issue between the listener and WBEM server that causes the server to get a zero length completion to its request for the listener response on occasions. This appears to be independent of whether the listener and sender are in the same environment or in containers, etc. In pywbem this might happen for about 1 indication in 4000 when the server is sending indications without delay. However, the indications are not being lost.\
OpenPegasus indication sender attempts an indication send once and then requeues it for resend based on a retry timer. It does this the number of times defined by configuration variables. The pywbem test environment is testing for time between indications received (ex. 5 sec. but the WBEM server is delaying for about 30 seconds before retrying the indication. Therefore our tests fail thinking that an indicaiton was lost. Note that the example code in the examples directory actually confirms that the indications are still in the OpenPegasus dlayed queue to be sent later.
We have not figured out why the the zero lengh response but we propose that for now we document the issue in the pywbem documentation (ex. troubleshooting) as follows:
=========================== proposed issue resolution
If there is a case where the pywbem listener appears to be losing indications
sent from at least the OpenPegasus server, this may be due to timout/retry
settings issues between the WBEM server and pywbem listener.
OpenPegasus has two configuration settings that can impact sending indications:
maxIndicationDeliveryRetryAttempts (Default 3 seconds)
If set to a positive integer, value defines the number of times
indication service will enable the reliableIndication feature
and try to deliver an indication to a particular listener destination.
This does not effect the original delivery attempt. A value of 0
disables reliable indication feature completely, and cimserver will
deliver the indication once.
minIndicationDeliveryRetryInterval (Default: 30 seconds).
If set to a positive integer, this value defines the minimal time interval
in seconds for the indication service to wait before retrying to deliver an
indication to a listener destination that previously failed. Cimserver may
take longer due to QoS or other processing.
Together these configuration variables try to insure that indications will be
delivered. If there is an issue sending any single indication it is put into
a delay queue for the destination along with any suceeding indications that
are created for the same destination. After the timeout defined by the
configuration variable minIndicationDeliveryRetryInterval, OpenPegasus
attempts to send the indication again. It repeats this process the number of
times determined by the maxIndicationDeliveryRetryAttempts configuration
variable.
Thus, as a default after receiving anything but a successful response from the
listener OpenPegasus waits 30 seconds and retries. It repeats this process
3 times before discarding the indication.
As noted in pywbem issue #3022 tests
with OpenPegasus under high indication loading have indicated that occasionally
the WBEM server receives a zero length response immediatly after sending the
indication. This is treated as
an error and the retry process started. If any timeouts or time checks in the
listener, (ex. very short times in tests between received indications) these
timeouts could be interpreted as lost indications when, in fact, OpenPegasus
will wait 30 seconds and then retry the indication that the server thought
had failed.
This was the case with testing against local OpenPegasus Docker containers where
the WBEM server was requested to deliver a fixed number of indications as fast
as possible but the test listener set a timeout of 3 seonds with no indication
received to indicate that the delivery has stopped before all requested
indications had been delivered. However the delay was simply waiting 30 seconds
delay before resending the failed indications. Setting the
OpenPegasus WBEM server to different timeout times can correct this problem
(ex. delay 2 seconds, retry attempts 5 for local testing).
The OpenPegasus configuration variables can be set with the OpenPegasus
cimconfig
commandline utility either when the server is running or stopped.
See the OpenPegasus documenation or OpenPegasus
cimconfig --help
fordetailed information on the command parameters for setting these configuation
variables.
============================
The text was updated successfully, but these errors were encountered: