Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s agent fails to tail step logs #2253

Open
5 tasks done
praveen-livspace opened this issue Aug 20, 2023 · 16 comments · May be fixed by #3601
Open
5 tasks done

k8s agent fails to tail step logs #2253

praveen-livspace opened this issue Aug 20, 2023 · 16 comments · May be fixed by #3601
Labels
backend/kubernetes bug Something isn't working
Milestone

Comments

@praveen-livspace
Copy link

praveen-livspace commented Aug 20, 2023

Component

agent

Describe the bug

For some of our repos, trivy scan step, triggers the agent to log this error. I noticed this not just with trivy but also with megalinter.

{ "level": "error", "error": "rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8", "time": "2023-08-20T07:18:46Z", "message": "grpc error: log(): code: Internal: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8" }

Trivy command executed as part of the step.

trivy fs \ --exit-code 1 \ --skip-dirs vendor,node_modules \ --report summary \ --severity CRITICAL \ --scanners vuln \ --ignore-unfixed \ --quiet \ .

System Info

{"source":"https://github.com/woodpecker-ci/woodpecker","version":"next-0c282e86e8"}

Additional context

No response

Validations

  • Read the Contributing Guidelines.
  • Read the docs.
  • Check that there isn't already an issue that reports the same bug to avoid creating a duplicate.
  • Checked that the bug isn't fixed in the next version already [https://woodpecker-ci.org/faq#which-version-of-woodpecker-should-i-use]
  • Check that this is a concrete bug. For Q&A join our Discord Chat Server or the Matrix room.
@praveen-livspace praveen-livspace added the bug Something isn't working label Aug 20, 2023
@qwerty287 qwerty287 added feedback this needs feedback backend/kubernetes labels Feb 11, 2024
@qwerty287
Copy link
Contributor

Is this still an issue?

@metawave
Copy link

yes.... have the same problem (woodpecker 2.3.0).... making woodpecker-agent spin indefinitely using a lot of cpu... have to kill the agent frequently to make it work again.... it's annoying....

@qwerty287 qwerty287 removed the feedback this needs feedback label Feb 16, 2024
@qwerty287
Copy link
Contributor

Can you get debug logs?

@sschnabe
Copy link

I have the same issue (2.3.0 with k8s agent). This example workflow triggers this issue:

steps:
  import-gpg:
    image: debian:stable-slim
    commands:
      - apt-get -qq update 
      - apt-get -qq install --yes --no-install-recommends ca-certificates wget gpg gpg-agent dirmngr
      - wget --quiet https://downloads.apache.org/maven/KEYS
      - gpg --import KEYS

While importing the KEYS woodpecker agent logs:

{"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-02-21T10:02:11Z","message":"grpc error: log(): code: Internal"}

@qwerty287
Copy link
Contributor

I wrote that above already: please provide debug logs if possible. They contain extra fields like the file and line where this is logged. Otherwise it's hard to help here (at least for me, maybe the people knowing more about k8s can tell you the issue).

@sschnabe
Copy link

With WOODPECKER_DEBUG_PRETTY=true and WOODPECKER_LOG_LEVEL=trace i got:

1:07PM TRC src/pipeline/rpc/log_entry.go:82 > grpc write line: gpg: key 4F9BDB78DB596386: public key "Vincent Siveton <vsiveton@apache.org>" imported
gpg: key B95BBD3FA43C4492: 22 signatures not checked due to missing keys
gpg: key B95BBD3FA43C4492: public key "Carlos Sanchez <carlos.sanchez@mergere.com>" imported
gpg: key E181870DC625BAFB: public key "Maria Odea Ching <oching@apache.org>" imported
gpg: key D2A765937BA507E8: public key "Mauro Talevi <mauro@apache.org>" imported
gpg: key C92C5FEC70161C62: public key "Herve Boutemy <hboutemy@apache.org>" imported
gpg: key 873A8E86B4372146: public key "Olivier Lamy <olamy@apache.org>" imported
gpg: key 8F0FABD30F353251: public key "Dan Fabulich <dfabulich@apache.org>" imported
gpg: key 0B7B8BED064C851C: public key "Rapha�l Pi�roni <rafale@apache.org>" imported
gpg: key C2D879C8B0874707: public key "Milos Kleint <mkleint@apache.org>" imported
gpg: key 8FE7CB51365A46EF: public key "Mark Hobson <markh@apache.org>" imported
gpg: key A3F9CCC081C4177D: public key "Oleg Gusakov <ogusakov@apache.org>" imported
gpg: key 30E6F80434A72A7F: public key "John Dennis Casey <jdcasey@commonjava.org>" imported
gpg: key E8579C183571506F: public key "Rapha�l Pi�roni <rafale@apache.org>" imported
gpg: key A8AD2A7E9A25CE21: public key "Barrie Treloar <baerrach@apache.org>" imported
gpg: key 3B58205B9D7013A9: public key "Dan T. Tran <dantran@apache.org>" imported
 step-uuid=01HQ5WD8SBBCTMPE81QTQJS272
1:07PM ERR src/agent/rpc/client_grpc.go:335 > grpc error: log(): code: Internal error="rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8"

@qwerty287 qwerty287 added this to the 2.x.x milestone Feb 21, 2024
fernandrone added a commit to quintoandar/woodpecker that referenced this issue Apr 8, 2024
Proposal to fix woodpecker-ci#2253

We have observed several possibly-related issues on a Kubernetes
backend:

1. Agents behave erractly when dealing with certain log payloads. A common
   observation here is that steps that produce a large volume of logs will cause
   some steps to be stuck "pending" forever.

2. Agents use way more CPU than should be expected, we often see 200-300
   millicores of CPU per Workflow per agent (as reported on woodpecker-ci#2253).

3. We commonly see Agents displaying thousands of error lines about
   parsing logs, often with very close timestamps, which may explain issues 1
   and 2 (as reported on woodpecker-ci#2253).

```
{"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"}
{"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"}
{"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"}
```

4. We've also observed that agents will sometimes drop out of the worker queue,
also as reported on woodpecker-ci#2253.

Seeing as the logs point to `client_grpc.go:335`, this pull request
fixes the issue by:

1. Removing codes.Internal from being a retryable GRPC status. Now agent GRPC
calls that fail with codes. Internal will not be retried. There's not an
agreement on what GRPC codes should be retried but Internal does not seem to be
a common one to retry -- if ever.

2. Add a timeout of 30 seconds to any retries. Currently, the exponential
retries have a maximum timeout of _15 minutes_. I assume this might be
required by some other functions so Agents resume their operation in
case the webserver restarts. Still this is likely the cause behind the
large cpu increase as agents can be stuck trying thousands of requests for
a large windown of time. The previous change alone should be enough to
solve this issue but I think this might be a good idea to prevent
similar problems from arising in the future.
@fernandrone
Copy link
Contributor

This was a problem for me as well so I've attempted a fix here #3601

@zc-devs
Copy link
Contributor

zc-devs commented Apr 25, 2024

#3601 (comment)

I haven't noticed any missing logs on our builds. However I think the true test would be to run a build that outputs non UTF-8 characters and see how Woodpecker behaves. Given the issue, I'd expect Woodpecker to not print them

Indeed, WP doesn't print them.

I've reproduced this with pipeline:

skip_clone: true
steps:
  logs:
    image: alpine
    commands:
      - apk add openssl
      - openssl rand 1000
      - echo Done

WP output:

+ apk add openssl
fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/community/x86_64/APKINDEX.tar.gz
(1/3) Upgrading libcrypto3 (3.1.4-r5 -> 3.1.4-r6)
(2/3) Upgrading libssl3 (3.1.4-r5 -> 3.1.4-r6)
(3/3) Installing openssl (3.1.4-r6)
Executing busybox-1.36.1-r15.trigger
OK: 8 MiB in 16 packages
+ openssl rand 1000

wp-01hwa0ya0qqn1exr68j5ftj5tq.log

Pod was in succeeded state before pipeline cancellation.

But I didn't notice high CPU usage (at least after I cancelled pipeline):

> kubectl top pods -n test-scm-cicd
NAME                                 CPU(cores)   MEMORY(bytes)
gitea-0                              3m           125Mi
woodpecker-agent-d9b864c76-lqdgj     1m           10Mi
woodpecker-server-75c9794774-nglv7   3m           29Mi

Tested on Woodpecker 2.3.0 in Kubernetes.
woodpecker-server.log
woodpecker-agent.log

As you can see, Agent logs error even after pipeline was cancelled.


Now my pipeline is in status killed, but the step stays in the running.

@anbraten
Copy link
Member

I had the feeling we use base64 encoding for the log data, but I can't find it in the agent code anymore 🤔. Shouldn't base64-encoding the log data solve this issue?

@golyalpha
Copy link

I have a feeling that the encoding happens on the server at some point.

@anbraten
Copy link
Member

I have a feeling that the encoding happens on the server at some point.

Same for me, but I have no idea where we do this 🙈 Searched for encoding/base64, but can't find where we do this for the log entries at all. Will try to find it using the debugger when I've some time. Any hints are more than welcome 😉

@zc-devs
Copy link
Contributor

zc-devs commented Apr 25, 2024

we use base64 encoding for the log data

Nope:

Data []byte `json:"data" xorm:"LONGBLOB"`

postgres=# \c woodpecker_test
woodpecker_test=# \d log_entries
                             Table "public.log_entries"
 Column  |  Type   | Collation | Nullable |                 Default
---------+---------+-----------+----------+-----------------------------------------
 id      | bigint  |           | not null | nextval('log_entries_id_seq'::regclass)
 step_id | bigint  |           |          |
 time    | bigint  |           |          |
 line    | integer |           |          |
 data    | bytea   |           |          |
 created | bigint  |           |          |
 type    | integer |           |          |
Indexes:
    "log_entries_pkey" PRIMARY KEY, btree (id)
    "IDX_log_entries_step_id" btree (step_id)
woodpecker_test=# select data from log_entries where id=5;
                data
------------------------------------
 \x2b20676974206c66732066657463680a
(1 row)
woodpecker_test=# select convert_from(data, 'UTF-8') from log_entries where id=5;
  convert_from
-----------------
 + git lfs fetch+

(1 row)

Shouldn't base64-encoding the log data solve this issue?

Yes, I think.

@anbraten
Copy link
Member

Thanks for the help.

Just had the feeling as we are getting the logs base64 encoded from the api 🤔

logs?.forEach((line) => writeLog({ index: line.line, text: decode(line.data), time: line.time }));

writeLog({ index: line.line, text: decode(line.data), time: line.time });

Anyways I guess the solution should be quite simple by base64 encoding in the agent code directly

@zc-devs
Copy link
Contributor

zc-devs commented Apr 25, 2024

base64 encoding in the agent code

Should it be decoded after receiving / before storing in DB?

Edit 1:

If we do not want to store Base64 in DB, then I'm for solution mentioned by @fernandrone and @eliasscosta:

edit the .proto and change the string fields to bytes

@anbraten
Copy link
Member

While doing some testing in #2072 I always got the grpc error: log(): code: Internal: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8 error when I was sending extremely long log lines.

@zc-devs
Copy link
Contributor

zc-devs commented Apr 27, 2024

While doing some testing I always got the error when I was sending extremely long log lines

Seems, I cannot reproduce this.

But I've found other issues(?). Pipeline with 13MB "one"-line log:

skip_clone: true
steps:
  logs:
    image: alpine
    commands:
      - apk add openssl
      - openssl rand 10000000 | base64 -w 0
      - echo Done

Run number 182:
I run pipeline. Do not remember status. Server pod restarted with panic. After that I see full log (414 lines).
woodpecker-server.log
woodpecker-agent.log

Run number 184:
I run pipeline. Step stayed in not started state, therefore no logs. When I saw pipeline is finished in K3s logs, I refreshed web page: success, full log (415 lines).

Run number 185:
I run pipeline. Step showed up in running state. When I saw in OpenLens, that some pod (Server or Agent) flashed orange, but there was no pod restarts. In UI there were 274 lines of truncated log. After refreshing the page - the same.

woodpecker-agent-2.log

Run number 187:
Pipeline started with step in status running. Then it moved to success. In UI there were 290 lines of truncated logs. After refreshing the page - 289 lines.
woodpecker-server.log
woodpecker-agent.log

Run number 188:
Right after 187. I run pipeline, then went to pipelines list page. Server pod restarted, Reason: Error - exit code: 2. Agent did not, but log is truncated (maybe some go selfhealing). After Server restarted, I see pipeline and step in success state. There is full log with 415 lines.
woodpecker-server.log
woodpecker-agent.log

Woodpecker 2.4.1, K3s 1.29.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend/kubernetes bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants