Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SpiffWorkflow node not sending messages #5195

Closed
jakubgs opened this issue May 21, 2024 · 28 comments
Closed

SpiffWorkflow node not sending messages #5195

jakubgs opened this issue May 21, 2024 · 28 comments
Assignees

Comments

@jakubgs
Copy link
Member

jakubgs commented May 21, 2024

Problem

The SpiffWorkflow node appears to have issues sending messages. Spiff shows errors like:

{
  "level": "ERROR",
  "message": "Error executing Service Task. Received error code ReadTimeout from service waku/SendMessage ::: 500 (Internal Server Error) - The service you called is experiencing technical difficulties. ::: The original message: HTTPSConnectionPool(host=test.app.spiff.status.im, port=443): Read timed out. (read timeout=45). ",
  "loggerName": "spiffworkflow_backend",
  "processName": "MainProcess",
  "processID": 36,
  "threadName": "ThreadPoolExecutor-0_9",
  "threadID": 139861463910080,
  "timestamp": "2024-05-20T13:11:21.225Z"
}

But debugging this is difficult because as far as I can tell 98% of the node logs are errors:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % zgrep -v ERROR /var/log/docker/spiff-workflow-wakunode/docker.* | wc -l
267247
jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % zgrep ERROR /var/log/docker/spiff-workflow-wakunode/docker.* | wc -l 
17513290
@jakubgs jakubgs self-assigned this May 21, 2024
@jakubgs
Copy link
Member Author

jakubgs commented May 21, 2024

Error reported from UI is:

"message_text": "There was an exception running process model manage-procurement/procurement/requisition-order-management/request-goods-services for instance 3256.\nOriginal Error:\nWorkflowExecutionServiceError("Error executing Service Task. Received error code 'ReadTimeout' from service 'waku/SendMessage' ::: 500 (Internal Server Error) - The service you called is experiencing technical difficulties. ::: The original message: HTTPSConnectionPool(host='test.app.spiff.status.im', port=443): Read timed out. (read timeout=45)")"

And if we look at the time when this was generated:

image

Nothing in backend-api logs can be found, and in connector logs only this:

image

Utterly useless.

@jakubgs
Copy link
Member Author

jakubgs commented May 21, 2024

This is the only thing I can find that matches:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % grep '07:48:2[23]' /var/log/docker/spiff-workflow-wakunode/docker.log
ERROR[05-21|07:48:22.564|github.com/status-im/mvds/node/node.go:364]                                         error processing acks                    namespace=mvds error="state not found"
ERROR[05-21|07:48:23.272|github.com/status-im/mvds/node/node.go:446]                                         Error while removing sync state.         namespace=mvds error="state not found"
ERROR[05-21|07:48:23.272|github.com/status-im/mvds/node/node.go:364]                                         error processing acks                    namespace=mvds error="state not found"
ERROR[05-21|07:48:23.276|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|07:48:23.276|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"

@jakubgs
Copy link
Member Author

jakubgs commented May 21, 2024

Error appears to becoming from here:

func (m *Messenger) sendDataSync(receiver state.PeerID, payload *datasyncproto.Payload) error {
ctx := context.Background()
if !payload.IsValid() {
m.logger.Error("payload is invalid")
return errors.New("payload is invalid")
}

And more specifically from:
func (s *SegmentMessage) IsValid() bool {
return s.SegmentsCount >= 2 || s.ParitySegmentsCount > 0
}

If it is indeed the correct error I'm researching, considering most of logs is errors.

@jakubgs
Copy link
Member Author

jakubgs commented May 21, 2024

The logs are full of all kinds of other errors:

ERROR[05-21|08:14:26.725|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|08:14:26.725|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"
ERROR[05-21|08:14:30.276|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|08:14:30.276|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"
ERROR[05-21|08:14:30.496|github.com/status-im/status-go/protocol/pushnotificationclient/client.go:900]       could not retrieve message               error="record not found"
ERROR[05-21|08:14:36.318|github.com/status-im/status-go/protocol/messenger_curated_communities.go:49]        failed to get curated communities from contract error="could not find network: 10"
ERROR[05-21|08:14:39.127|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|08:14:39.127|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"
ERROR[05-21|08:14:39.668|github.com/status-im/status-go/protocol/pushnotificationclient/client.go:900]       could not retrieve message               error="record not found"

No idea how relevant or if it's just noise.

@jakubgs
Copy link
Member Author

jakubgs commented May 21, 2024

There's also shit like this:

ERROR[05-21|08:15:22.523|rpc/service.go:200]                                                                 RPC method wakuext_addContact crashed: runtime error: slice bounds out of range [:2] with length 1
goroutine 322642907 [running]:
github.com/ethereum/go-ethereum/rpc.(*callback).call.func1()
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/service.go:199 +0x89
panic({0x278d760, 0xc04fc28558})
#011/usr/local/go/src/runtime/panic.go:884 +0x213
github.com/status-im/status-go/api/multiformat.multibaseDecode({0x4864f90?, 0xc000f673b8?})
#011/go/src/github.com/status-im/status-go/api/multiformat/utils.go:242 +0x7f
github.com/status-im/status-go/api/multiformat.DeserializePublicKey({0x4864f90?, 0x483f0a0?}, {0x3145710, 0x1})
#011/go/src/github.com/status-im/status-go/api/multiformat/utils.go:50 +0x35
github.com/status-im/status-go/api/multiformat.DeserializeCompressedKey({0x4864f90?, 0x0?})
#011/go/src/github.com/status-im/status-go/api/multiformat/utils.go:87 +0x36
github.com/status-im/status-go/protocol/requests.ConvertCompressedToLegacyKey(...)
#011/go/src/github.com/status-im/status-go/protocol/requests/send_contact_request.go:35
github.com/status-im/status-go/protocol/requests.(*AddContact).HexID(0x1de8e002658780?)
#011/go/src/github.com/status-im/status-go/protocol/requests/add_contact.go:26 +0x34
github.com/status-im/status-go/protocol.(*Messenger).AddContact(0x2694a60?, {0x316b058, 0xc036a67d10}, 0xc036a67d60)
#011/go/src/github.com/status-im/status-go/protocol/messenger_contacts.go:650 +0x85
github.com/status-im/status-go/services/ext.(*PublicAPI).AddContact(...)
#011/go/src/github.com/status-im/status-go/services/ext/api.go:863
reflect.Value.call({0xc00054f740?, 0xc00052ae78?, 0x7fcd524912d8?}, {0x2902f65, 0x4}, {0xc036a67db0, 0x3, 0x0?})
#011/usr/local/go/src/reflect/value.go:586 +0xb07
reflect.Value.Call({0xc00054f740?, 0xc00052ae78?, 0x16?}, {0xc036a67db0?, 0x1?, 0x1?})
#011/usr/local/go/src/reflect/value.go:370 +0xbc
github.com/ethereum/go-ethereum/rpc.(*callback).call(0xc0007bb740, {0x316b058?, 0xc036a67d10}, {0xc04fc28540, 0x12}, {0xc05c6035a8, 0x1, 0x4e2997?})
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/service.go:205 +0x3c5
github.com/ethereum/go-ethereum/rpc.(*handler).runMethod(0xc05385bcc0?, {0x316b058?, 0xc036a67d10?}, 0xc011c65180, 0x1?, {0xc05c6035a8?, 0xc028a11cb0?, 0x90?})
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/handler.go:391 +0x45
github.com/ethereum/go-ethereum/rpc.(*handler).handleCall(0xc028a11b00, 0xc06f78ef30, 0xc011c65180)
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/handler.go:336 +0x239
github.com/ethereum/go-ethereum/rpc.(*handler).handleCallMsg(0xc028a11b00, 0x30?, 0xc011c65180)
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/handler.go:297 +0x237
github.com/ethereum/go-ethereum/rpc.(*handler).handleMsg.func1(0xc06f78ef30)
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/handler.go:138 +0x35
github.com/ethereum/go-ethereum/rpc.(*handler).startCallProc.func1()
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/handler.go:225 +0xc5
created by github.com/ethereum/go-ethereum/rpc.(*handler).startCallProc
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/handler.go:221 +0x8d

@jakubgs
Copy link
Member Author

jakubgs commented May 21, 2024

But I sent a test message to someone using this command:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % c -XPOST 0:8545 -d '{"jsonrpc":"2.0","method":"wakuext_sendOneToOneMessage","params":[{"id":"zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ","message":"TEST MESSAGE"}], "id": 1}' -H 'Content-type: application/json' | jq
{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "chats": [
      {
        "id": "0x04a0dcd3fc989e897089587d1d227c91d5c3898f4e74d72c6c076a18113d7c469d469f276ac599b5ce6c147a708689c3340d5cdd8e48c8dd6148477fbaec5d7942",
        "name": "0x04a0dc",
        "description": "",
        "color": "",
        "emoji": "",

And it worked. So the message delivery works fine.

@jakubgs
Copy link
Member Author

jakubgs commented May 21, 2024

But I did manage to get it to timeout a few times:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % c -XPOST 0:8545 -d '{"jsonrpc":"2.0","method":"wakuext_sendOneToOneMessage","params":[{"id":"zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ","message":"TEST MESSAGE"}], "id": 1}' -H 'Content-type: application/json' | jq
curl: (28) Operation timed out after 5000 milliseconds with 0 bytes received

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % c -XPOST 0:8545 -d '{"jsonrpc":"2.0","method":"wakuext_sendOneToOneMessage","params":[{"id":"zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ","message":"TEST MESSAGE"}], "id": 1}' -H 'Content-type: application/json' | jq
curl: (28) Operation timed out after 5001 milliseconds with 0 bytes received

Which suggests an issue indeed exists. But logs show me nothing useful.

@jakubgs
Copy link
Member Author

jakubgs commented May 21, 2024

And the host is definitely not overloaded:

image

@jakubgs
Copy link
Member Author

jakubgs commented May 21, 2024

What's interesting is I stopped the connector container which should be responsible for sending waku messages and I still see garbage like this in status-go go node logs:

ERROR[05-21|08:31:36.328|github.com/status-im/status-go/protocol/messenger_curated_communities.go:49]        failed to get curated communities from contract error="could not find network: 10"
ERROR[05-21|08:31:50.984|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|08:31:50.984|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"
ERROR[05-21|08:32:17.635|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|08:32:17.635|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"
ERROR[05-21|08:32:17.635|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|08:32:17.635|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"
ERROR[05-21|08:32:36.328|github.com/status-im/status-go/protocol/messenger_curated_communities.go:49]        failed to get curated communities from contract error="could not find network: 10"
ERROR[05-21|08:32:52.261|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|08:32:52.261|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"

Which suggests both payload is invalid and could not find network: 10 are not caused by messages being sent.

@jakubgs
Copy link
Member Author

jakubgs commented May 21, 2024

There are lots of different types of errors:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % grep -v peerID /var/log/docker/spiff-workflow-wakunode/docker.log | awk '/ERROR/{$1=$2=$3=""; a[$0]++}END{for (x in a) { printf "%d -> %s\n", a[x], x}}' | sort -h | tail    
344 ->    failed to sync filter error=EOF
371 ->    error syncing topics error=EOF
517 ->    failed to get curated communities from contract error="could not find network: 10"
987 ->    could not retrieve message error="record not found"
1150 ->    failed to handle an encrypted message site=handleEncryptionLayer error="failed to handle Encryption message: can't skip previous chain message keys: bad until: probably an out-of-order message that was deleted"
1353 ->    failed to send message error="payload is invalid"
1353 ->    payload is invalid
1675 ->    error processing acks namespace=mvds error="state not found"
1675 ->    Error while removing sync state. namespace=mvds error="state not found"
9662 ->    failed to decode message err="couldn't decrypt using asymmetric key: ecies: invalid message"

No idea which are relevant or not. The fact that message delivery sometimes works suggests that most of those mean nothing.

@jakubgs
Copy link
Member Author

jakubgs commented May 21, 2024

Here's the stats but for source code file from where the error originates:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % grep -v peerID /var/log/docker/spiff-workflow-wakunode/docker.log | awk -F'[][|:]' '/ERROR/{a[$12]++}END{for (x in a) { printf "%d -> %s\n", a[x], x}}' | sort -h
19 -> github.com/status-im/status-go/protocol/messenger_raw_message_resend.go
27 -> github.com/status-im/status-go/protocol/messenger_mailserver_cycle.go
87 -> rpc/service.go
396 -> github.com/waku-org/go-waku/waku/v2/protocol/legacy_store/waku_store_client.go
524 -> github.com/status-im/status-go/protocol/messenger_curated_communities.go
999 -> github.com/status-im/status-go/protocol/pushnotificationclient/client.go
1208 -> github.com/status-im/status-go/protocol/common/message_sender.go
1366 -> github.com/status-im/status-go/protocol/messenger_peersyncing.go
1370 -> github.com/status-im/status-go/protocol/datasync/transport.go
1465 -> github.com/status-im/status-go/protocol/messenger_mailserver.go
3530 -> github.com/status-im/mvds/node/node.go
9815 -> github.com/status-im/status-go/wakuv2/common/message.go

@jakubgs
Copy link
Member Author

jakubgs commented May 21, 2024

I can reproduce the timeout even with 8 seconds limit:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:/docker/spiff-workflow % c --max-time 8 -XPOST 0:8545 -d '{"jsonrpc":"2.0","method":"wakuext_sendOneToOneMessage","params":[{"id":"zQ3shQsUWc9yQMba7M3vyJxuw1wLK6jcrg3YMshrvazA3iiGM","message":"TEST MESSAGE"}], "id": 1}' -H 'Content-type: application/json' | jq
curl: (28) Operation timed out after 8001 milliseconds with 0 bytes receive

But less frequently. More so with 7 seconds.

@jakubgs
Copy link
Member Author

jakubgs commented May 21, 2024

If we look at connector code we can see the request is made with requests.post() without any timeout specified:

            raw_response = requests.post(url, json.dumps(request_body), headers=headers)

https://github.com/status-im/spiffworkflow-connector/blob/9b01806811b295c52d50c8cb6ad95c1444fd208b/connectors/connector-waku/connector_waku/commands/sendMessage.py#L54

And according to docs that should mean indefinite timeout:

By default, requests do not time out unless a timeout value is set explicitly. Without a timeout, your code may hang for minutes or more.

https://requests.readthedocs.io/en/latest/user/advanced/#timeouts

Which would contradict my hypothesis. But there are other timeouts involved, like the frontend one.

@jakubgs
Copy link
Member Author

jakubgs commented May 21, 2024

It appears the Read timed out. (read timeout=45). comes from CONNECTOR_PROXY_COMMAND_TIMEOUT

CONNECTOR_PROXY_COMMAND_TIMEOUT = 45

https://github.com/sartography/spiff-arena/blob/822e4f499b90736d53483d96f6bea08628f3d0e6/spiffworkflow-backend/src/spiffworkflow_backend/config/__init__.py#L13

Which is then used by request.post():

proxied_response = requests.post(call_url, json=params, timeout=CONNECTOR_PROXY_COMMAND_TIMEOUT)

https://github.com/sartography/spiff-arena/blob/5840db56cfc471fff54930cf54880131f4586ea1/spiffworkflow-backend/src/spiffworkflow_backend/services/service_task_service.py#L201

@cammellos
Copy link
Contributor

So far, it looks like the problem is CPU related.
Usage is constantly around 100%, and the node seems to be stuck on a loop of fetching messages from store nodes (I can't tell yet if it's a loop or there's just a lot of data, investigating).

@jakubgs
Copy link
Member Author

jakubgs commented May 24, 2024

Andrea made a PR to reduce CPU usage by disabling fetching of messages:

But even that is not enough to prevent the status-go node for Spiff from abusing the CPU for no reason:

image

@jakubgs
Copy link
Member Author

jakubgs commented May 24, 2024

I have added extra logging to connector container code on app-test env:

diff --git a/connectors/connector-waku/connector_waku/commands/sendMessage.py b/connectors/connector-waku/connector_waku/commands/sendMessage.py
index 09a81bf..4d59e32 100644
--- a/connectors/connector-waku/connector_waku/commands/sendMessage.py
+++ b/connectors/connector-waku/connector_waku/commands/sendMessage.py
@@ -51,7 +51,10 @@ class SendMessage:
         status_code = None
         successful = False
         try:
+            print('POST', url, request_body)
             raw_response = requests.post(url, json.dumps(request_body), headers=headers)
+            print('RESPONSE:', raw_response.status_code)
+            print('RESPONSE TEXT:', raw_response.text)
             raw_response.raise_for_status()
             status_code = raw_response.status_code
             parsed_response = json.loads(raw_response.text)
@@ -59,9 +62,11 @@ class SendMessage:
             if not self.response_has_error(response) and status_code == 200:
                 successful = True
         except HTTPError as ex:
+            print('HTTPError:', ex)
             status_code = ex.response.status_code
             response['error'] = str(ex)
         except Exception as ex:
+            print('Exception:', ex)
             response['error'] = str(ex)
             status_code = 500
         return (response, status_code, successful)

It's not pretty but it workd. I want to see how it actually fails and what are the timings. So far I as unable to reproduce a failure.

@jakubgs
Copy link
Member Author

jakubgs commented May 24, 2024

I was unable to reproduce the issue using this process:
https://test.app.spiff.status.im/process-models/misc:test-waku-kb

So I'm going to leave reproducing it with new logging to Spiff testers.

@jakubgs
Copy link
Member Author

jakubgs commented May 27, 2024

@jakubgs
Copy link
Member Author

jakubgs commented May 27, 2024

What's fascinating is that if we look at https://test.app.spiff.status.im/i/637 which started at 08:18:57 and failed at 08:19:42 and we check connector logs at the start time we see:

[2024-05-27 06:18:57 +0000] [143966] [DEBUG] POST /connector-proxy/v1/do/waku/SendMessage
[2024-05-27 06:19:43 +0000] [143697] [DEBUG] POST /connector-proxy/v1/do/http/GetRequest
172.17.3.1 - - [27/May/2024:06:19:43 +0000] "POST /connector-proxy/v1/do/http/GetRequest HTTP/1.0" 200 176 "-" "python-requests/2.31.0"
[2024-05-27 06:19:44 +0000] [143937] [DEBUG] POST /connector-proxy/v1/do/waku/SendMessage
[2024-05-27 06:20:28 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:143966)
[2024-05-27 06:20:28 +0000] [143966] [INFO] Worker exiting (pid: 143966)
POST http://wakunode:8545/ {'jsonrpc': '2.0', 'method': 'wakuext_addContact', 'params': [{'id': 'zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ', 'message': None}], 'id': 1}
[2024-05-27 06:20:28 +0000] [161550] [INFO] Booting worker with pid: 161550
[2024-05-27 06:20:29 +0000] [143697] [DEBUG] POST /connector-proxy/v1/do/waku/SendMessage
[2024-05-27 06:20:47 +0000] [161550] [DEBUG] POST /connector-proxy/v1/do/waku/SendMessage
[2024-05-27 06:21:15 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:143937)
[2024-05-27 06:21:15 +0000] [143937] [INFO] Worker exiting (pid: 143937)
POST http://wakunode:8545/ {'jsonrpc': '2.0', 'method': 'wakuext_addContact', 'params': [{'id': 'zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ', 'message': None}], 'id': 1}
[2024-05-27 06:21:15 +0000] [161578] [INFO] Booting worker with pid: 161578
[2024-05-27 06:21:15 +0000] [161578] [DEBUG] Ignoring EPIPE
[2024-05-27 06:21:48 +0000] [161550] [DEBUG] Ignoring EPIPE
POST http://wakunode:8545/ {'jsonrpc': '2.0', 'method': 'wakuext_addContact', 'params': [{'id': 'zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ', 'message': None}], 'id': 1}
RESPONSE: 200
RESPONSE TEXT: {"jsonrpc":"2.0","id":1,"error":{"code":-32000,"message":"database is locked"}}
[2024-05-27 06:22:00 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:143697)
[2024-05-27 06:22:00 +0000] [143697] [INFO] Worker exiting (pid: 143697)

We can see 3 new interesting error messages:

  • [1] [CRITICAL] WORKER TIMEOUT (pid:143966)
  • [DEBUG] Ignoring EPIPE
  • {"jsonrpc":"2.0","id":1,"error":{"code":-32000,"message":"database is locked"}}

Most interestingly the task says it failed at 08:19:42 but we se no error at that time. Only later.

@madhurrya
Copy link

madhurrya commented May 27, 2024

This is something I have noticed in the past and noticed today also. Normally I open the Status app only when I want to test the messages. That can be once in few days or once in a week sometimes. I have noticed that most of the time when I open the status app then the instances start to error for some time.
Yesterday and today morning I created about 100+ instances and none of them errored. I didn't have the Status app opened at that time. But later today I opened the Status app and then created those instances (which I shared above) and they started getting the error.
This is something I have observed in the past also. Just shared what I have observed. Not sure whether it's a valid point or not.

These I ran after starting the Status app and some of them errored.
image

I ran 50 instances of this in the morning before opening the Status app without any error. And another 100+ instances yesterday without any error.

image

@jakubgs
Copy link
Member Author

jakubgs commented May 27, 2024

The database locked and Ignoring EPIPE errors might be a red herring as I can find only 4 instances of it in connector logs:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % zgrep 'database is locked' /var/log/docker/spiff-workflow-connector/docker.log | wc -l
4

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % zgrep 'Ignoring EPIPE' /var/log/docker/spiff-workflow-connector/docker.log | wc -l    
4

But the WORKER TIMEOUT certainly seems more prevalent:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % zgrep 'WORKER TIMEOUT' /var/log/docker/spiff-workflow-connector/docker.log | wc -l
10

@jakubgs
Copy link
Member Author

jakubgs commented May 27, 2024

And indeed if I search for instances of WORKER TIMEOUT I get next to it an error like this:

POST http://wakunode:8545/
{'jsonrpc': '2.0',
 'method': 'wakuext_sendOneToOneMessage',
 'params': [{'id': 'zQ3shwDqFxGXHvbCeQxTURrtSTCH4UtPHaEDu5LgNMpsjtqdF',
   'message': 'There was an exception running process model manage-procurement/procurement/requisition-order-management/request-goods-services for instance 355.\nOriginal Error:\nWorkflowExecutionServiceError("Error executing Service Task. Received error code \'ReadTimeout\' from service \'waku/SendMessage\' ::: 500 (Internal Server Error) - The service you called is experiencing technical difficulties. ::: The original message: HTTPSConnectionPool(host=\'test.app.spiff.status.im\', port=443): Read timed out. (read timeout=45)")'}],
 'id': 1}

Crucial part:

Error executing Service Task. Received error code \'ReadTimeout\' from service \'waku/SendMessage\'
::: 500 (Internal Server Error) - The service you called is experiencing technical difficulties.
::: The original message: HTTPSConnectionPool(host=\'test.app.spiff.status.im\', port=443): Read timed out. (read timeout=45)")'}],

@jakubgs
Copy link
Member Author

jakubgs commented May 27, 2024

Another instance at 06:24:30 failed at 06:25:15: https://test.app.spiff.status.im/i/642

[2024-05-27 06:24:28 +0000] [161614] [DEBUG] POST /connector-proxy/v1/do/waku/SendMessage
[2024-05-27 06:24:30 +0000] [161706] [DEBUG] POST /connector-proxy/v1/do/waku/SendMessage
[2024-05-27 06:25:12 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:161578)
[2024-05-27 06:25:12 +0000] [161578] [INFO] Worker exiting (pid: 161578)
POST http://wakunode:8545/ {'jsonrpc': '2.0', 'method': 'wakuext_addContact', 'params': [{'id': 'zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ', 'message': None}], 'id': 1}
[2024-05-27 06:25:13 +0000] [161742] [INFO] Booting worker with pid: 161742
[2024-05-27 06:25:13 +0000] [161742] [DEBUG] Ignoring EPIPE
POST http://wakunode:8545/ {'jsonrpc': '2.0', 'method': 'wakuext_addContact', 'params': [{'id': 'zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ', 'message': None}], 'id': 1}
RESPONSE: 200
RESPONSE TEXT: {"jsonrpc":"2.0","id":1,"result":{"chats":[{"id":" ... }]}}
[2024-05-27 06:25:58 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:161614)

We can see WORKER TIMEOUT and Ignoring EPIPE go hand in hand.

@jakubgs
Copy link
Member Author

jakubgs commented May 27, 2024

Interestingly neither Exception not HTTPError lines were reached even once, which means the send_message() method in connector never fails, and the failure is somewhere higher.

@jakubgs
Copy link
Member Author

jakubgs commented May 27, 2024

Here's an example of failure: https://test.app.spiff.status.im/process-instances/misc:qa:send-waku-message/803
And here are the logs:

2024-05-27T15:48:40.801101+00:00 [9] [DEBUG] POST /connector-proxy/v1/do/waku/SendMessage
2024-05-27T15:48:53.401283+00:00 SendMessage(message='=====Waku message test from test.app===', message_type='wakuext_sendOneToOneMessage', recipient=['zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ'])
2024-05-27T15:48:53.401293+00:00 execute START
2024-05-27T15:48:53.401300+00:00 send_message START
2024-05-27T15:48:53.401305+00:00 POST http://wakunode:8545/ {'jsonrpc': '2.0', 'method': 'wakuext_addContact', 'params': [{'id': 'zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ', 'message': None}], 'id': 1}
2024-05-27T15:48:53.401309+00:00 RESPONSE: 200
2024-05-27T15:48:53.401313+00:00 RESPONSE TEXT: {"jsonrpc":"2.0","id":1,"error":{"code":-32000,"message":"database is locked"}}
2024-05-27T15:48:53.401318+00:00 send_message END
2024-05-27T15:48:53.401322+00:00 execute END
2024-05-27T15:49:40.869893+00:00 [2024-05-27 15:49:40 +0000] [9] [DEBUG] Ignoring EPIPE

There's some other examples of database is locked errors as well. But the interesting part is it returns right after request.

There is no delay, so why would it cause a timeout.

@jakubgs
Copy link
Member Author

jakubgs commented May 27, 2024

Possibly relevant issue:

@jakubgs
Copy link
Member Author

jakubgs commented Jun 4, 2024

After migrating app-test environment to DigitalOcean in:

The problems have subsided, for now at least. So I'm closing this issue for now.

@jakubgs jakubgs closed this as completed Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants