SpiffWorkflow node not sending messages #5195

jakubgs · 2024-05-21T08:00:09Z

Problem

The SpiffWorkflow node appears to have issues sending messages. Spiff shows errors like:

{
  "level": "ERROR",
  "message": "Error executing Service Task. Received error code ReadTimeout from service waku/SendMessage ::: 500 (Internal Server Error) - The service you called is experiencing technical difficulties. ::: The original message: HTTPSConnectionPool(host=test.app.spiff.status.im, port=443): Read timed out. (read timeout=45). ",
  "loggerName": "spiffworkflow_backend",
  "processName": "MainProcess",
  "processID": 36,
  "threadName": "ThreadPoolExecutor-0_9",
  "threadID": 139861463910080,
  "timestamp": "2024-05-20T13:11:21.225Z"
}

But debugging this is difficult because as far as I can tell 98% of the node logs are errors:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % zgrep -v ERROR /var/log/docker/spiff-workflow-wakunode/docker.* | wc -l
267247
jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % zgrep ERROR /var/log/docker/spiff-workflow-wakunode/docker.* | wc -l 
17513290

The text was updated successfully, but these errors were encountered:

jakubgs · 2024-05-21T08:01:15Z

Error reported from UI is:

"message_text": "There was an exception running process model manage-procurement/procurement/requisition-order-management/request-goods-services for instance 3256.\nOriginal Error:\nWorkflowExecutionServiceError("Error executing Service Task. Received error code 'ReadTimeout' from service 'waku/SendMessage' ::: 500 (Internal Server Error) - The service you called is experiencing technical difficulties. ::: The original message: HTTPSConnectionPool(host='test.app.spiff.status.im', port=443): Read timed out. (read timeout=45)")"

And if we look at the time when this was generated:

Nothing in backend-api logs can be found, and in connector logs only this:

Utterly useless.

jakubgs · 2024-05-21T08:03:08Z

This is the only thing I can find that matches:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % grep '07:48:2[23]' /var/log/docker/spiff-workflow-wakunode/docker.log
ERROR[05-21|07:48:22.564|github.com/status-im/mvds/node/node.go:364]                                         error processing acks                    namespace=mvds error="state not found"
ERROR[05-21|07:48:23.272|github.com/status-im/mvds/node/node.go:446]                                         Error while removing sync state.         namespace=mvds error="state not found"
ERROR[05-21|07:48:23.272|github.com/status-im/mvds/node/node.go:364]                                         error processing acks                    namespace=mvds error="state not found"
ERROR[05-21|07:48:23.276|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|07:48:23.276|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"

jakubgs · 2024-05-21T08:06:27Z

Error appears to becoming from here:

status-go/protocol/messenger_peersyncing.go

Lines 359 to 364 in 6f1b829

    
           func (m *Messenger) sendDataSync(receiver state.PeerID, payload *datasyncproto.Payload) error { 
        
           	ctx := context.Background() 
        
           	if !payload.IsValid() { 
        
           		m.logger.Error("payload is invalid") 
        
           		return errors.New("payload is invalid") 
        
           	}

And more specifically from:

status-go/protocol/common/message_segmentation.go

Lines 35 to 37 in 6f1b829

    
           func (s *SegmentMessage) IsValid() bool { 
        
           	return s.SegmentsCount >= 2 || s.ParitySegmentsCount > 0 
        
           }

If it is indeed the correct error I'm researching, considering most of logs is errors.

jakubgs · 2024-05-21T08:15:33Z

The logs are full of all kinds of other errors:

ERROR[05-21|08:14:26.725|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|08:14:26.725|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"
ERROR[05-21|08:14:30.276|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|08:14:30.276|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"
ERROR[05-21|08:14:30.496|github.com/status-im/status-go/protocol/pushnotificationclient/client.go:900]       could not retrieve message               error="record not found"
ERROR[05-21|08:14:36.318|github.com/status-im/status-go/protocol/messenger_curated_communities.go:49]        failed to get curated communities from contract error="could not find network: 10"
ERROR[05-21|08:14:39.127|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|08:14:39.127|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"
ERROR[05-21|08:14:39.668|github.com/status-im/status-go/protocol/pushnotificationclient/client.go:900]       could not retrieve message               error="record not found"

No idea how relevant or if it's just noise.

jakubgs · 2024-05-21T08:17:35Z

There's also shit like this:

ERROR[05-21|08:15:22.523|rpc/service.go:200]                                                                 RPC method wakuext_addContact crashed: runtime error: slice bounds out of range [:2] with length 1
goroutine 322642907 [running]:
github.com/ethereum/go-ethereum/rpc.(*callback).call.func1()
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/service.go:199 +0x89
panic({0x278d760, 0xc04fc28558})
#011/usr/local/go/src/runtime/panic.go:884 +0x213
github.com/status-im/status-go/api/multiformat.multibaseDecode({0x4864f90?, 0xc000f673b8?})
#011/go/src/github.com/status-im/status-go/api/multiformat/utils.go:242 +0x7f
github.com/status-im/status-go/api/multiformat.DeserializePublicKey({0x4864f90?, 0x483f0a0?}, {0x3145710, 0x1})
#011/go/src/github.com/status-im/status-go/api/multiformat/utils.go:50 +0x35
github.com/status-im/status-go/api/multiformat.DeserializeCompressedKey({0x4864f90?, 0x0?})
#011/go/src/github.com/status-im/status-go/api/multiformat/utils.go:87 +0x36
github.com/status-im/status-go/protocol/requests.ConvertCompressedToLegacyKey(...)
#011/go/src/github.com/status-im/status-go/protocol/requests/send_contact_request.go:35
github.com/status-im/status-go/protocol/requests.(*AddContact).HexID(0x1de8e002658780?)
#011/go/src/github.com/status-im/status-go/protocol/requests/add_contact.go:26 +0x34
github.com/status-im/status-go/protocol.(*Messenger).AddContact(0x2694a60?, {0x316b058, 0xc036a67d10}, 0xc036a67d60)
#011/go/src/github.com/status-im/status-go/protocol/messenger_contacts.go:650 +0x85
github.com/status-im/status-go/services/ext.(*PublicAPI).AddContact(...)
#011/go/src/github.com/status-im/status-go/services/ext/api.go:863
reflect.Value.call({0xc00054f740?, 0xc00052ae78?, 0x7fcd524912d8?}, {0x2902f65, 0x4}, {0xc036a67db0, 0x3, 0x0?})
#011/usr/local/go/src/reflect/value.go:586 +0xb07
reflect.Value.Call({0xc00054f740?, 0xc00052ae78?, 0x16?}, {0xc036a67db0?, 0x1?, 0x1?})
#011/usr/local/go/src/reflect/value.go:370 +0xbc
github.com/ethereum/go-ethereum/rpc.(*callback).call(0xc0007bb740, {0x316b058?, 0xc036a67d10}, {0xc04fc28540, 0x12}, {0xc05c6035a8, 0x1, 0x4e2997?})
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/service.go:205 +0x3c5
github.com/ethereum/go-ethereum/rpc.(*handler).runMethod(0xc05385bcc0?, {0x316b058?, 0xc036a67d10?}, 0xc011c65180, 0x1?, {0xc05c6035a8?, 0xc028a11cb0?, 0x90?})
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/handler.go:391 +0x45
github.com/ethereum/go-ethereum/rpc.(*handler).handleCall(0xc028a11b00, 0xc06f78ef30, 0xc011c65180)
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/handler.go:336 +0x239
github.com/ethereum/go-ethereum/rpc.(*handler).handleCallMsg(0xc028a11b00, 0x30?, 0xc011c65180)
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/handler.go:297 +0x237
github.com/ethereum/go-ethereum/rpc.(*handler).handleMsg.func1(0xc06f78ef30)
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/handler.go:138 +0x35
github.com/ethereum/go-ethereum/rpc.(*handler).startCallProc.func1()
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/handler.go:225 +0xc5
created by github.com/ethereum/go-ethereum/rpc.(*handler).startCallProc
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/handler.go:221 +0x8d

jakubgs · 2024-05-21T08:18:10Z

But I sent a test message to someone using this command:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % c -XPOST 0:8545 -d '{"jsonrpc":"2.0","method":"wakuext_sendOneToOneMessage","params":[{"id":"zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ","message":"TEST MESSAGE"}], "id": 1}' -H 'Content-type: application/json' | jq
{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "chats": [
      {
        "id": "0x04a0dcd3fc989e897089587d1d227c91d5c3898f4e74d72c6c076a18113d7c469d469f276ac599b5ce6c147a708689c3340d5cdd8e48c8dd6148477fbaec5d7942",
        "name": "0x04a0dc",
        "description": "",
        "color": "",
        "emoji": "",

And it worked. So the message delivery works fine.

jakubgs · 2024-05-21T08:21:33Z

But I did manage to get it to timeout a few times:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % c -XPOST 0:8545 -d '{"jsonrpc":"2.0","method":"wakuext_sendOneToOneMessage","params":[{"id":"zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ","message":"TEST MESSAGE"}], "id": 1}' -H 'Content-type: application/json' | jq
curl: (28) Operation timed out after 5000 milliseconds with 0 bytes received

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % c -XPOST 0:8545 -d '{"jsonrpc":"2.0","method":"wakuext_sendOneToOneMessage","params":[{"id":"zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ","message":"TEST MESSAGE"}], "id": 1}' -H 'Content-type: application/json' | jq
curl: (28) Operation timed out after 5001 milliseconds with 0 bytes received

Which suggests an issue indeed exists. But logs show me nothing useful.

jakubgs · 2024-05-21T08:27:40Z

And the host is definitely not overloaded:

jakubgs · 2024-05-21T08:34:16Z

What's interesting is I stopped the connector container which should be responsible for sending waku messages and I still see garbage like this in status-go go node logs:

ERROR[05-21|08:31:36.328|github.com/status-im/status-go/protocol/messenger_curated_communities.go:49]        failed to get curated communities from contract error="could not find network: 10"
ERROR[05-21|08:31:50.984|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|08:31:50.984|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"
ERROR[05-21|08:32:17.635|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|08:32:17.635|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"
ERROR[05-21|08:32:17.635|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|08:32:17.635|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"
ERROR[05-21|08:32:36.328|github.com/status-im/status-go/protocol/messenger_curated_communities.go:49]        failed to get curated communities from contract error="could not find network: 10"
ERROR[05-21|08:32:52.261|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|08:32:52.261|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"

Which suggests both payload is invalid and could not find network: 10 are not caused by messages being sent.

jakubgs · 2024-05-21T08:39:08Z

There are lots of different types of errors:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % grep -v peerID /var/log/docker/spiff-workflow-wakunode/docker.log | awk '/ERROR/{$1=$2=$3=""; a[$0]++}END{for (x in a) { printf "%d -> %s\n", a[x], x}}' | sort -h | tail    
344 ->    failed to sync filter error=EOF
371 ->    error syncing topics error=EOF
517 ->    failed to get curated communities from contract error="could not find network: 10"
987 ->    could not retrieve message error="record not found"
1150 ->    failed to handle an encrypted message site=handleEncryptionLayer error="failed to handle Encryption message: can't skip previous chain message keys: bad until: probably an out-of-order message that was deleted"
1353 ->    failed to send message error="payload is invalid"
1353 ->    payload is invalid
1675 ->    error processing acks namespace=mvds error="state not found"
1675 ->    Error while removing sync state. namespace=mvds error="state not found"
9662 ->    failed to decode message err="couldn't decrypt using asymmetric key: ecies: invalid message"

No idea which are relevant or not. The fact that message delivery sometimes works suggests that most of those mean nothing.

jakubgs · 2024-05-21T08:45:16Z

Here's the stats but for source code file from where the error originates:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % grep -v peerID /var/log/docker/spiff-workflow-wakunode/docker.log | awk -F'[][|:]' '/ERROR/{a[$12]++}END{for (x in a) { printf "%d -> %s\n", a[x], x}}' | sort -h
19 -> github.com/status-im/status-go/protocol/messenger_raw_message_resend.go
27 -> github.com/status-im/status-go/protocol/messenger_mailserver_cycle.go
87 -> rpc/service.go
396 -> github.com/waku-org/go-waku/waku/v2/protocol/legacy_store/waku_store_client.go
524 -> github.com/status-im/status-go/protocol/messenger_curated_communities.go
999 -> github.com/status-im/status-go/protocol/pushnotificationclient/client.go
1208 -> github.com/status-im/status-go/protocol/common/message_sender.go
1366 -> github.com/status-im/status-go/protocol/messenger_peersyncing.go
1370 -> github.com/status-im/status-go/protocol/datasync/transport.go
1465 -> github.com/status-im/status-go/protocol/messenger_mailserver.go
3530 -> github.com/status-im/mvds/node/node.go
9815 -> github.com/status-im/status-go/wakuv2/common/message.go

jakubgs · 2024-05-21T08:50:41Z

I can reproduce the timeout even with 8 seconds limit:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:/docker/spiff-workflow % c --max-time 8 -XPOST 0:8545 -d '{"jsonrpc":"2.0","method":"wakuext_sendOneToOneMessage","params":[{"id":"zQ3shQsUWc9yQMba7M3vyJxuw1wLK6jcrg3YMshrvazA3iiGM","message":"TEST MESSAGE"}], "id": 1}' -H 'Content-type: application/json' | jq
curl: (28) Operation timed out after 8001 milliseconds with 0 bytes receive

But less frequently. More so with 7 seconds.

jakubgs · 2024-05-21T09:00:39Z

If we look at connector code we can see the request is made with requests.post() without any timeout specified:

            raw_response = requests.post(url, json.dumps(request_body), headers=headers)

https://github.com/status-im/spiffworkflow-connector/blob/9b01806811b295c52d50c8cb6ad95c1444fd208b/connectors/connector-waku/connector_waku/commands/sendMessage.py#L54

And according to docs that should mean indefinite timeout:

By default, requests do not time out unless a timeout value is set explicitly. Without a timeout, your code may hang for minutes or more.

https://requests.readthedocs.io/en/latest/user/advanced/#timeouts

Which would contradict my hypothesis. But there are other timeouts involved, like the frontend one.

jakubgs · 2024-05-21T09:05:22Z

It appears the Read timed out. (read timeout=45). comes from CONNECTOR_PROXY_COMMAND_TIMEOUT

CONNECTOR_PROXY_COMMAND_TIMEOUT = 45

https://github.com/sartography/spiff-arena/blob/822e4f499b90736d53483d96f6bea08628f3d0e6/spiffworkflow-backend/src/spiffworkflow_backend/config/__init__.py#L13

Which is then used by request.post():

proxied_response = requests.post(call_url, json=params, timeout=CONNECTOR_PROXY_COMMAND_TIMEOUT)

https://github.com/sartography/spiff-arena/blob/5840db56cfc471fff54930cf54880131f4586ea1/spiffworkflow-backend/src/spiffworkflow_backend/services/service_task_service.py#L201

cammellos · 2024-05-22T09:54:14Z

So far, it looks like the problem is CPU related.
Usage is constantly around 100%, and the node seems to be stuck on a loop of fetching messages from store nodes (I can't tell yet if it's a loop or there's just a lot of data, investigating).

jakubgs · 2024-05-24T13:29:53Z

Andrea made a PR to reduce CPU usage by disabling fetching of messages:

disable store nodes for spiffy workfluffy #5225

But even that is not enough to prevent the status-go node for Spiff from abusing the CPU for no reason:

jakubgs · 2024-05-24T13:35:05Z

I have added extra logging to connector container code on app-test env:

diff --git a/connectors/connector-waku/connector_waku/commands/sendMessage.py b/connectors/connector-waku/connector_waku/commands/sendMessage.py
index 09a81bf..4d59e32 100644
--- a/connectors/connector-waku/connector_waku/commands/sendMessage.py
+++ b/connectors/connector-waku/connector_waku/commands/sendMessage.py
@@ -51,7 +51,10 @@ class SendMessage:
         status_code = None
         successful = False
         try:
+            print('POST', url, request_body)
             raw_response = requests.post(url, json.dumps(request_body), headers=headers)
+            print('RESPONSE:', raw_response.status_code)
+            print('RESPONSE TEXT:', raw_response.text)
             raw_response.raise_for_status()
             status_code = raw_response.status_code
             parsed_response = json.loads(raw_response.text)
@@ -59,9 +62,11 @@ class SendMessage:
             if not self.response_has_error(response) and status_code == 200:
                 successful = True
         except HTTPError as ex:
+            print('HTTPError:', ex)
             status_code = ex.response.status_code
             response['error'] = str(ex)
         except Exception as ex:
+            print('Exception:', ex)
             response['error'] = str(ex)
             status_code = 500
         return (response, status_code, successful)

It's not pretty but it workd. I want to see how it actually fails and what are the timings. So far I as unable to reproduce a failure.

jakubgs · 2024-05-24T13:55:50Z

I was unable to reproduce the issue using this process:
https://test.app.spiff.status.im/process-models/misc:test-waku-kb

So I'm going to leave reproducing it with new logging to Spiff testers.

jakubgs · 2024-05-27T06:38:04Z

Some failed process instances were found on app-test:

jakubgs · 2024-05-27T06:41:26Z

What's fascinating is that if we look at https://test.app.spiff.status.im/i/637 which started at 08:18:57 and failed at 08:19:42 and we check connector logs at the start time we see:

[2024-05-27 06:18:57 +0000] [143966] [DEBUG] POST /connector-proxy/v1/do/waku/SendMessage
[2024-05-27 06:19:43 +0000] [143697] [DEBUG] POST /connector-proxy/v1/do/http/GetRequest
172.17.3.1 - - [27/May/2024:06:19:43 +0000] "POST /connector-proxy/v1/do/http/GetRequest HTTP/1.0" 200 176 "-" "python-requests/2.31.0"
[2024-05-27 06:19:44 +0000] [143937] [DEBUG] POST /connector-proxy/v1/do/waku/SendMessage
[2024-05-27 06:20:28 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:143966)
[2024-05-27 06:20:28 +0000] [143966] [INFO] Worker exiting (pid: 143966)
POST http://wakunode:8545/ {'jsonrpc': '2.0', 'method': 'wakuext_addContact', 'params': [{'id': 'zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ', 'message': None}], 'id': 1}
[2024-05-27 06:20:28 +0000] [161550] [INFO] Booting worker with pid: 161550
[2024-05-27 06:20:29 +0000] [143697] [DEBUG] POST /connector-proxy/v1/do/waku/SendMessage
[2024-05-27 06:20:47 +0000] [161550] [DEBUG] POST /connector-proxy/v1/do/waku/SendMessage
[2024-05-27 06:21:15 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:143937)
[2024-05-27 06:21:15 +0000] [143937] [INFO] Worker exiting (pid: 143937)
POST http://wakunode:8545/ {'jsonrpc': '2.0', 'method': 'wakuext_addContact', 'params': [{'id': 'zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ', 'message': None}], 'id': 1}
[2024-05-27 06:21:15 +0000] [161578] [INFO] Booting worker with pid: 161578
[2024-05-27 06:21:15 +0000] [161578] [DEBUG] Ignoring EPIPE
[2024-05-27 06:21:48 +0000] [161550] [DEBUG] Ignoring EPIPE
POST http://wakunode:8545/ {'jsonrpc': '2.0', 'method': 'wakuext_addContact', 'params': [{'id': 'zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ', 'message': None}], 'id': 1}
RESPONSE: 200
RESPONSE TEXT: {"jsonrpc":"2.0","id":1,"error":{"code":-32000,"message":"database is locked"}}
[2024-05-27 06:22:00 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:143697)
[2024-05-27 06:22:00 +0000] [143697] [INFO] Worker exiting (pid: 143697)

We can see 3 new interesting error messages:

[1] [CRITICAL] WORKER TIMEOUT (pid:143966)
[DEBUG] Ignoring EPIPE
{"jsonrpc":"2.0","id":1,"error":{"code":-32000,"message":"database is locked"}}

Most interestingly the task says it failed at 08:19:42 but we se no error at that time. Only later.

madhurrya · 2024-05-27T07:03:09Z

This is something I have noticed in the past and noticed today also. Normally I open the Status app only when I want to test the messages. That can be once in few days or once in a week sometimes. I have noticed that most of the time when I open the status app then the instances start to error for some time.
Yesterday and today morning I created about 100+ instances and none of them errored. I didn't have the Status app opened at that time. But later today I opened the Status app and then created those instances (which I shared above) and they started getting the error.
This is something I have observed in the past also. Just shared what I have observed. Not sure whether it's a valid point or not.

These I ran after starting the Status app and some of them errored.

I ran 50 instances of this in the morning before opening the Status app without any error. And another 100+ instances yesterday without any error.

jakubgs · 2024-05-27T08:39:58Z

The database locked and Ignoring EPIPE errors might be a red herring as I can find only 4 instances of it in connector logs:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % zgrep 'database is locked' /var/log/docker/spiff-workflow-connector/docker.log | wc -l
4

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % zgrep 'Ignoring EPIPE' /var/log/docker/spiff-workflow-connector/docker.log | wc -l    
4

But the WORKER TIMEOUT certainly seems more prevalent:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % zgrep 'WORKER TIMEOUT' /var/log/docker/spiff-workflow-connector/docker.log | wc -l
10

jakubgs · 2024-05-27T08:42:16Z

And indeed if I search for instances of WORKER TIMEOUT I get next to it an error like this:

POST http://wakunode:8545/
{'jsonrpc': '2.0',
 'method': 'wakuext_sendOneToOneMessage',
 'params': [{'id': 'zQ3shwDqFxGXHvbCeQxTURrtSTCH4UtPHaEDu5LgNMpsjtqdF',
   'message': 'There was an exception running process model manage-procurement/procurement/requisition-order-management/request-goods-services for instance 355.\nOriginal Error:\nWorkflowExecutionServiceError("Error executing Service Task. Received error code \'ReadTimeout\' from service \'waku/SendMessage\' ::: 500 (Internal Server Error) - The service you called is experiencing technical difficulties. ::: The original message: HTTPSConnectionPool(host=\'test.app.spiff.status.im\', port=443): Read timed out. (read timeout=45)")'}],
 'id': 1}

Crucial part:

Error executing Service Task. Received error code \'ReadTimeout\' from service \'waku/SendMessage\'
::: 500 (Internal Server Error) - The service you called is experiencing technical difficulties.
::: The original message: HTTPSConnectionPool(host=\'test.app.spiff.status.im\', port=443): Read timed out. (read timeout=45)")'}],

jakubgs · 2024-05-27T08:48:46Z

Another instance at 06:24:30 failed at 06:25:15: https://test.app.spiff.status.im/i/642

[2024-05-27 06:24:28 +0000] [161614] [DEBUG] POST /connector-proxy/v1/do/waku/SendMessage
[2024-05-27 06:24:30 +0000] [161706] [DEBUG] POST /connector-proxy/v1/do/waku/SendMessage
[2024-05-27 06:25:12 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:161578)
[2024-05-27 06:25:12 +0000] [161578] [INFO] Worker exiting (pid: 161578)
POST http://wakunode:8545/ {'jsonrpc': '2.0', 'method': 'wakuext_addContact', 'params': [{'id': 'zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ', 'message': None}], 'id': 1}
[2024-05-27 06:25:13 +0000] [161742] [INFO] Booting worker with pid: 161742
[2024-05-27 06:25:13 +0000] [161742] [DEBUG] Ignoring EPIPE
POST http://wakunode:8545/ {'jsonrpc': '2.0', 'method': 'wakuext_addContact', 'params': [{'id': 'zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ', 'message': None}], 'id': 1}
RESPONSE: 200
RESPONSE TEXT: {"jsonrpc":"2.0","id":1,"result":{"chats":[{"id":" ... }]}}
[2024-05-27 06:25:58 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:161614)

We can see WORKER TIMEOUT and Ignoring EPIPE go hand in hand.

jakubgs · 2024-05-27T08:50:26Z

Interestingly neither Exception not HTTPError lines were reached even once, which means the send_message() method in connector never fails, and the failure is somewhere higher.

jakubgs · 2024-05-27T15:59:12Z

Here's an example of failure: https://test.app.spiff.status.im/process-instances/misc:qa:send-waku-message/803
And here are the logs:

2024-05-27T15:48:40.801101+00:00 [9] [DEBUG] POST /connector-proxy/v1/do/waku/SendMessage
2024-05-27T15:48:53.401283+00:00 SendMessage(message='=====Waku message test from test.app===', message_type='wakuext_sendOneToOneMessage', recipient=['zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ'])
2024-05-27T15:48:53.401293+00:00 execute START
2024-05-27T15:48:53.401300+00:00 send_message START
2024-05-27T15:48:53.401305+00:00 POST http://wakunode:8545/ {'jsonrpc': '2.0', 'method': 'wakuext_addContact', 'params': [{'id': 'zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ', 'message': None}], 'id': 1}
2024-05-27T15:48:53.401309+00:00 RESPONSE: 200
2024-05-27T15:48:53.401313+00:00 RESPONSE TEXT: {"jsonrpc":"2.0","id":1,"error":{"code":-32000,"message":"database is locked"}}
2024-05-27T15:48:53.401318+00:00 send_message END
2024-05-27T15:48:53.401322+00:00 execute END
2024-05-27T15:49:40.869893+00:00 [2024-05-27 15:49:40 +0000] [9] [DEBUG] Ignoring EPIPE

There's some other examples of database is locked errors as well. But the interesting part is it returns right after request.

There is no delay, so why would it cause a timeout.

jakubgs · 2024-05-27T20:32:24Z

Possibly relevant issue:

Make sure connector logs errors sartography/spiff-arena#1619

jakubgs · 2024-06-04T12:34:40Z

After migrating app-test environment to DigitalOcean in:

https://github.com/status-im/infra-spiff-workflow/issues/47

The problems have subsided, for now at least. So I'm closing this issue for now.

jakubgs self-assigned this May 21, 2024

jakubgs closed this as completed Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SpiffWorkflow node not sending messages #5195

SpiffWorkflow node not sending messages #5195

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

cammellos commented May 22, 2024

jakubgs commented May 24, 2024

jakubgs commented May 24, 2024 •

edited

Loading

jakubgs commented May 24, 2024

jakubgs commented May 27, 2024

jakubgs commented May 27, 2024 •

edited

Loading

madhurrya commented May 27, 2024 •

edited by jakubgs

Loading

jakubgs commented May 27, 2024

jakubgs commented May 27, 2024

jakubgs commented May 27, 2024

jakubgs commented May 27, 2024

jakubgs commented May 27, 2024

jakubgs commented May 27, 2024

jakubgs commented Jun 4, 2024

SpiffWorkflow node not sending messages #5195

SpiffWorkflow node not sending messages #5195

Comments

jakubgs commented May 21, 2024

Problem

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

jakubgs commented May 21, 2024

cammellos commented May 22, 2024

jakubgs commented May 24, 2024

jakubgs commented May 24, 2024 • edited Loading

jakubgs commented May 24, 2024

jakubgs commented May 27, 2024

jakubgs commented May 27, 2024 • edited Loading

madhurrya commented May 27, 2024 • edited by jakubgs Loading

jakubgs commented May 27, 2024

jakubgs commented May 27, 2024

jakubgs commented May 27, 2024

jakubgs commented May 27, 2024

jakubgs commented May 27, 2024

jakubgs commented May 27, 2024

jakubgs commented Jun 4, 2024

jakubgs commented May 24, 2024 •

edited

Loading

jakubgs commented May 27, 2024 •

edited

Loading

madhurrya commented May 27, 2024 •

edited by jakubgs

Loading