New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k8s agent fails to tail step logs #2253
Comments
Is this still an issue? |
yes.... have the same problem (woodpecker 2.3.0).... making woodpecker-agent spin indefinitely using a lot of cpu... have to kill the agent frequently to make it work again.... it's annoying.... |
Can you get debug logs? |
I have the same issue (2.3.0 with k8s agent). This example workflow triggers this issue: steps:
import-gpg:
image: debian:stable-slim
commands:
- apt-get -qq update
- apt-get -qq install --yes --no-install-recommends ca-certificates wget gpg gpg-agent dirmngr
- wget --quiet https://downloads.apache.org/maven/KEYS
- gpg --import KEYS While importing the KEYS woodpecker agent logs: {"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-02-21T10:02:11Z","message":"grpc error: log(): code: Internal"} |
I wrote that above already: please provide debug logs if possible. They contain extra fields like the file and line where this is logged. Otherwise it's hard to help here (at least for me, maybe the people knowing more about k8s can tell you the issue). |
With WOODPECKER_DEBUG_PRETTY=true and WOODPECKER_LOG_LEVEL=trace i got: 1:07PM TRC src/pipeline/rpc/log_entry.go:82 > grpc write line: gpg: key 4F9BDB78DB596386: public key "Vincent Siveton <vsiveton@apache.org>" imported
gpg: key B95BBD3FA43C4492: 22 signatures not checked due to missing keys
gpg: key B95BBD3FA43C4492: public key "Carlos Sanchez <carlos.sanchez@mergere.com>" imported
gpg: key E181870DC625BAFB: public key "Maria Odea Ching <oching@apache.org>" imported
gpg: key D2A765937BA507E8: public key "Mauro Talevi <mauro@apache.org>" imported
gpg: key C92C5FEC70161C62: public key "Herve Boutemy <hboutemy@apache.org>" imported
gpg: key 873A8E86B4372146: public key "Olivier Lamy <olamy@apache.org>" imported
gpg: key 8F0FABD30F353251: public key "Dan Fabulich <dfabulich@apache.org>" imported
gpg: key 0B7B8BED064C851C: public key "Rapha�l Pi�roni <rafale@apache.org>" imported
gpg: key C2D879C8B0874707: public key "Milos Kleint <mkleint@apache.org>" imported
gpg: key 8FE7CB51365A46EF: public key "Mark Hobson <markh@apache.org>" imported
gpg: key A3F9CCC081C4177D: public key "Oleg Gusakov <ogusakov@apache.org>" imported
gpg: key 30E6F80434A72A7F: public key "John Dennis Casey <jdcasey@commonjava.org>" imported
gpg: key E8579C183571506F: public key "Rapha�l Pi�roni <rafale@apache.org>" imported
gpg: key A8AD2A7E9A25CE21: public key "Barrie Treloar <baerrach@apache.org>" imported
gpg: key 3B58205B9D7013A9: public key "Dan T. Tran <dantran@apache.org>" imported
step-uuid=01HQ5WD8SBBCTMPE81QTQJS272
1:07PM ERR src/agent/rpc/client_grpc.go:335 > grpc error: log(): code: Internal error="rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8" |
Proposal to fix woodpecker-ci#2253 We have observed several possibly-related issues on a Kubernetes backend: 1. Agents behave erractly when dealing with certain log payloads. A common observation here is that steps that produce a large volume of logs will cause some steps to be stuck "pending" forever. 2. Agents use way more CPU than should be expected, we often see 200-300 millicores of CPU per Workflow per agent (as reported on woodpecker-ci#2253). 3. We commonly see Agents displaying thousands of error lines about parsing logs, often with very close timestamps, which may explain issues 1 and 2 (as reported on woodpecker-ci#2253). ``` {"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"} {"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"} {"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"} ``` 4. We've also observed that agents will sometimes drop out of the worker queue, also as reported on woodpecker-ci#2253. Seeing as the logs point to `client_grpc.go:335`, this pull request fixes the issue by: 1. Removing codes.Internal from being a retryable GRPC status. Now agent GRPC calls that fail with codes. Internal will not be retried. There's not an agreement on what GRPC codes should be retried but Internal does not seem to be a common one to retry -- if ever. 2. Add a timeout of 30 seconds to any retries. Currently, the exponential retries have a maximum timeout of _15 minutes_. I assume this might be required by some other functions so Agents resume their operation in case the webserver restarts. Still this is likely the cause behind the large cpu increase as agents can be stuck trying thousands of requests for a large windown of time. The previous change alone should be enough to solve this issue but I think this might be a good idea to prevent similar problems from arising in the future.
This was a problem for me as well so I've attempted a fix here #3601 |
Indeed, WP doesn't print them. I've reproduced this with pipeline: skip_clone: true
steps:
logs:
image: alpine
commands:
- apk add openssl
- openssl rand 1000
- echo Done WP output:
wp-01hwa0ya0qqn1exr68j5ftj5tq.log Pod was in But I didn't notice high CPU usage (at least after I cancelled pipeline):
Tested on Woodpecker As you can see, Agent logs error even after pipeline was cancelled. Now my pipeline is in status |
I had the feeling we use base64 encoding for the log data, but I can't find it in the agent code anymore 🤔. Shouldn't base64-encoding the log data solve this issue? |
I have a feeling that the encoding happens on the server at some point. |
Same for me, but I have no idea where we do this 🙈 Searched for |
Nope: woodpecker/server/model/log.go Line 33 in d005773
postgres=# \c woodpecker_test
woodpecker_test=# \d log_entries
Table "public.log_entries"
Column | Type | Collation | Nullable | Default
---------+---------+-----------+----------+-----------------------------------------
id | bigint | | not null | nextval('log_entries_id_seq'::regclass)
step_id | bigint | | |
time | bigint | | |
line | integer | | |
data | bytea | | |
created | bigint | | |
type | integer | | |
Indexes:
"log_entries_pkey" PRIMARY KEY, btree (id)
"IDX_log_entries_step_id" btree (step_id)
woodpecker_test=# select data from log_entries where id=5;
data
------------------------------------
\x2b20676974206c66732066657463680a
(1 row)
woodpecker_test=# select convert_from(data, 'UTF-8') from log_entries where id=5;
convert_from
-----------------
+ git lfs fetch+
(1 row)
Yes, I think. |
Thanks for the help. Just had the feeling as we are getting the logs base64 encoded from the api 🤔
Anyways I guess the solution should be quite simple by base64 encoding in the agent code directly |
Should it be decoded after receiving / before storing in DB? Edit 1: If we do not want to store Base64 in DB, then I'm for solution mentioned by @fernandrone and @eliasscosta:
|
While doing some testing in #2072 I always got the |
Seems, I cannot reproduce this. But I've found other issues(?). Pipeline with 13MB "one"-line log: skip_clone: true
steps:
logs:
image: alpine
commands:
- apk add openssl
- openssl rand 10000000 | base64 -w 0
- echo Done Run number 182: Run number 184: Run number 185: Run number 187: Run number 188: Woodpecker 2.4.1, K3s 1.29.4. |
Component
agent
Describe the bug
For some of our repos, trivy scan step, triggers the agent to log this error. I noticed this not just with trivy but also with megalinter.
{ "level": "error", "error": "rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8", "time": "2023-08-20T07:18:46Z", "message": "grpc error: log(): code: Internal: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8" }
Trivy command executed as part of the step.
trivy fs \ --exit-code 1 \ --skip-dirs vendor,node_modules \ --report summary \ --severity CRITICAL \ --scanners vuln \ --ignore-unfixed \ --quiet \ .
System Info
Additional context
No response
Validations
next
version already [https://woodpecker-ci.org/faq#which-version-of-woodpecker-should-i-use]The text was updated successfully, but these errors were encountered: