Resolve failures on PSI infra to unblock PR tests Run #4606

prietyc123 · 2021-04-12T06:00:40Z

/kind user-story
/area testing

User Story

As a user I want to run pr test successfully on linux psi Infra

Acceptance Criteria

Analyse the test failures on openshift linux psi env
Fix the issue and get a green job on prs

Links

Feature Request: NA
Related issue: NA

/kind user-story

kadel · 2021-04-13T09:32:35Z

As a user I want to run pr test successfully on linux psi Infra

User's don't care about PRs or tests. odo developers do.

But even odo developer should not need to run tests on PSI Infra. This needs to be automated. Tests should be run automatically for for every open PR.

prietyc123 · 2021-04-14T05:49:15Z

This needs to be automated. Tests should be run automatically for for every open PR.

yes.

Scope of the issue:

Run tests on jenkins for linux env with PSI
Collect all the failures on psi linux env.
Create related issues

Analyse the failure
Fix the issue one by one
Enable the job on pr

prietyc123 · 2021-04-19T12:17:40Z

Issues facing on PSI:

���  Waiting for component to start [5m] [WARNING x5: Failed]

[ssh:Fedora 32] [odo]  ���  Failed to start component with name gbfxmm. Error: Failed to create the component: error while waiting for deployment rollout: timeout while waiting for gbfxmm deployment roll out\nFor more information to help determine the cause of the error, re-run with '-v'.

[ssh:Fedora 32] [odo] See below for a list of failed events that occured more than 5 times during deployment:

[ssh:Fedora 32] [odo] 

[ssh:Fedora 32] [odo]  NAME                                      COUNT  REASON  MESSAGE                 

[ssh:Fedora 32] [odo] 

[ssh:Fedora 32] [odo]  gbfxmm-6b4df699f8-jvd7n.1676518e46a03792  5      Failed  Error: ImagePullBackOff 

[ssh:Fedora 32] [odo] 

[ssh:Fedora 32] [odo]

This is the most frequently failures on PSI, tracked in #3256

prietyc123 · 2021-04-22T12:53:28Z

Pods are not getting up for the services on PSI

$ odo project create cmd-service-test463hgq -w -v4

Running odo with args [odo catalog list services]
[...]
[odo] Services available through Operators
[ssh:Fedora 32] [odo] NAME                                CRDs
[ssh:Fedora 32] [odo] etcdoperator.v0.9.4-clusterwide     EtcdCluster, EtcdBackup, EtcdRestore
[ssh:Fedora 32] [odo] service-binding-operator.v0.7.1     ServiceBinding

Running odo with args [odo create nodejs kkedbn]
[...]

Running odo with args [odo push]
[...]
[odo]  ���  Changes successfully pushed to component

Running odo with args [odo catalog list services]
[...]
[ssh:Fedora 32] [odo] Services available through Operators
[ssh:Fedora 32] [odo] NAME                                CRDs
[ssh:Fedora 32] [odo] etcdoperator.v0.9.4-clusterwide     EtcdCluster, EtcdBackup, EtcdRestore
[ssh:Fedora 32] [odo] service-binding-operator.v0.7.1     ServiceBinding

Running odo with args [odo service create etcdoperator.v0.9.4-clusterwide/EtcdCluster --project cmd-service-test463hgq]
[...]
[odo]  ���  Service "example" was created
[ssh:Fedora 32] [odo] You can now link the service to a component using 'odo link'; check 'odo link -h'

Running oc with args [oc get pods -n cmd-service-test463hgq]
[ssh:Fedora 32] [oc] NAME                   READY   STATUS     RESTARTS   AGE
[ssh:Fedora 32] [oc] example-c2k8kkn4hs     0/1     Init:0/1   0          0s
[ssh:Fedora 32] [oc] kkedbn-767745f-kxkv6   1/1     Running    0          <invalid>

[ssh:Fedora 32] Running oc with args [oc get pods example-c2k8kkn4hs -o template="{{.status.phase}}" -n cmd-service-test463hgq]

[ssh:Fedora 32] [oc] "Pending"Running oc with args [oc get pods example-c2k8kkn4hs -o template="{{.status.phase}}" -n cmd-service-test463hgq]

As we don't see any failures on prow jobs, considering its only failing on PSI. However I need to debug it locally.

prietyc123 · 2021-04-26T08:26:17Z

Pods are not getting up for the services on PSI

$ odo project create cmd-service-test463hgq -w -v4

Running odo with args [odo catalog list services]
[...]
[odo] Services available through Operators
[ssh:Fedora 32] [odo] NAME                                CRDs
[ssh:Fedora 32] [odo] etcdoperator.v0.9.4-clusterwide     EtcdCluster, EtcdBackup, EtcdRestore
[ssh:Fedora 32] [odo] service-binding-operator.v0.7.1     ServiceBinding

Running odo with args [odo create nodejs kkedbn]
[...]

Running odo with args [odo push]
[...]
[odo]  ���  Changes successfully pushed to component

Running odo with args [odo catalog list services]
[...]
[ssh:Fedora 32] [odo] Services available through Operators
[ssh:Fedora 32] [odo] NAME                                CRDs
[ssh:Fedora 32] [odo] etcdoperator.v0.9.4-clusterwide     EtcdCluster, EtcdBackup, EtcdRestore
[ssh:Fedora 32] [odo] service-binding-operator.v0.7.1     ServiceBinding

Running odo with args [odo service create etcdoperator.v0.9.4-clusterwide/EtcdCluster --project cmd-service-test463hgq]
[...]
[odo]  ���  Service "example" was created
[ssh:Fedora 32] [odo] You can now link the service to a component using 'odo link'; check 'odo link -h'

Running oc with args [oc get pods -n cmd-service-test463hgq]
[ssh:Fedora 32] [oc] NAME                   READY   STATUS     RESTARTS   AGE
[ssh:Fedora 32] [oc] example-c2k8kkn4hs     0/1     Init:0/1   0          0s
[ssh:Fedora 32] [oc] kkedbn-767745f-kxkv6   1/1     Running    0          <invalid>

[ssh:Fedora 32] Running oc with args [oc get pods example-c2k8kkn4hs -o template="{{.status.phase}}" -n cmd-service-test463hgq]

[ssh:Fedora 32] [oc] "Pending"Running oc with args [oc get pods example-c2k8kkn4hs -o template="{{.status.phase}}" -n cmd-service-test463hgq]

As we don't see any failures on prow jobs, considering its only failing on PSI. However I need to debug it locally.

This particular scenario has been removed via #4650

prietyc123 · 2021-04-27T07:27:24Z

Pods are not getting up for the services on PSI
[...]
Running oc with args [oc get pods -n cmd-service-test463hgq]
[ssh:Fedora 32] [oc] NAME READY STATUS RESTARTS AGE
[ssh:Fedora 32] [oc] example-c2k8kkn4hs 0/1 Init:0/1 0 0s
[ssh:Fedora 32] [oc] kkedbn-767745f-kxkv6 1/1 Running 0

[ssh:Fedora 32] Running oc with args [oc get pods example-c2k8kkn4hs -o template="{{.status.phase}}" -n cmd-service-test463hgq]

[ssh:Fedora 32] [oc] "Pending"Running oc with args [oc get pods example-c2k8kkn4hs -o template="{{.status.phase}}" -n cmd-service-test463hgq]
As we don't see any failures on prow jobs, considering its only failing on PSI. However I need to debug it locally.

I can see lots of failure on operator hub due to pod status Init:0/1

prietyc123 · 2021-04-27T07:33:44Z

odo push fails on psi cluster, might be due to unavailability of the replicasets

[odo push --context /tmp/387587333/projectDir]

[...]
[ssh:Fedora 32] [odo] I0426 08:56:00.285576  453879 deployments.go:141] Deployment Condition: {"type":"Available","status":"False","lastUpdateTime":"2021-04-26T08:55:37Z","lastTransitionTime":"2021-04-26T08:55:37Z","reason":"MinimumReplicasUnavailable","message":"Deployment does not have minimum availability."}

[ssh:Fedora 32] [odo] I0426 08:56:00.285604  453879 deployments.go:141] Deployment Condition: {"type":"Progressing","status":"True","lastUpdateTime":"2021-04-26T08:55:37Z","lastTransitionTime":"2021-04-26T08:55:37Z","reason":"ReplicaSetUpdated","message":"ReplicaSet \"wghnte-5666944cbb\" is progressing."}

[ssh:Fedora 32] [odo] I0426 08:56:00.285612  453879 deployments.go:152] Waiting for deployment "wghnte" rollout to finish: 0 of 1 updated replicas are available...

[ssh:Fedora 32] [odo] I0426 08:56:00.285620  453879 deployments.go:159] Waiting for deployment spec update to be observed...

[ssh:Fedora 32] [odo]  ���  Waiting for component to start [5m]

[ssh:Fedora 32] [odo]  ���  Failed to start component with name "wghnte". Error: Failed to create the component: error while waiting for deployment rollout: timeout while waiting for wghnte deployment roll out

kadel · 2021-04-27T07:39:12Z

odo push fails on psi cluster, might be due to unavailability of the replicasets

you need to check why is this happening. You should be able to see this in Pod events.

Most often Pod can't download images (or it is taking to long) or PVC can't be created.

prietyc123 · 2021-04-27T07:40:16Z

Running odo with args [odo create java-quarkus --starter --project e2e-devfile-test76tkd mvfcuw --context /tmp/862530431/projectDir]

[...]
[ssh:Fedora 32] [odo] I0426 09:01:01.653315  454013 util.go:422] path /tmp/862530431/projectDir/devfile.yaml doesn't exist, skipping it

[ssh:Fedora 32] [odo] I0426 09:01:01.653324  454013 preference.go:217] The path for preference file is /tmp/862530431/preference.yaml

[ssh:Fedora 32] [odo] I0426 09:01:01.653388  454013 util.go:748] Response will be cached in /tmp/odohttpcache for 1h0m0s
[ssh:Fedora 32] [odo] I0426 09:01:01.653582  454013 util.go:761] Cached response used.

[ssh:Fedora 32] [odo] Devfile Object Validation
[ssh:Fedora 32] [odo]  ���  Checking devfile existence  ...
[...]
[ssh:Fedora 32] [odo]  ���  There are multiple projects in this devfile but none have been specified in --starter. Downloading the first: community

[ssh:Fedora 32] [odo]  ���  Downloading starter project community from https://code.quarkus.io/d?e=io.quarkus%3Aquarkus-resteasy&e=io.quarkus%3Aquarkus-micrometer&e=io.quarkus%3Aquarkus-smallrye-health&e=io.quarkus%3Aquarkus-openshift&cn=devfile [1ms]

[ssh:Fedora 32] [odo]  ���  Get "https://code.quarkus.io/d?e=io.quarkus%3Aquarkus-resteasy&e=io.quarkus%3Aquarkus-micrometer&e=io.quarkus%3Aquarkus-smallrye-health&e=io.quarkus%3Aquarkus-openshift&cn=devfile": dial tcp: lookup code.quarkus.io on 10.11.142.1:53: no such host

prietyc123 · 2021-04-27T07:42:59Z

Issue #2877 has also been observed on psi.

[ssh:Fedora 32] [odo] I0426 09:05:33.940279  455066 pods.go:67] Pod Conditions: {"type":"Initialized","status":"False","lastProbeTime":null,"lastTransitionTime":"2021-04-26T09:05:48Z","reason":"ContainersNotInitialized","message":"containers with incomplete status: [copy-files-to-volume copy-supervisord]"}

[ssh:Fedora 32] [odo] I0426 09:05:33.940288  455066 pods.go:67] Pod Conditions: {"type":"Ready","status":"False","lastProbeTime":null,"lastTransitionTime":"2021-04-26T09:05:48Z","reason":"ContainersNotReady","message":"containers with unready status: [sb-jar-test-app]"}

[ssh:Fedora 32] [odo] I0426 09:05:33.940301  455066 pods.go:67] Pod Conditions: {"type":"ContainersReady","status":"False","lastProbeTime":null,"lastTransitionTime":"2021-04-26T09:05:48Z","reason":"ContainersNotReady","message":"containers with unready status: [sb-jar-test-app]"}

[ssh:Fedora 32] [odo] I0426 09:05:33.940308  455066 pods.go:67] Pod Conditions: {"type":"PodScheduled","status":"True","lastProbeTime":null,"lastTransitionTime":"2021-04-26T09:05:48Z"}

[ssh:Fedora 32] [odo] I0426 09:05:33.940466  455066 pods.go:72] Container Status: {"name":"sb-jar-test-app","state":{"waiting":{"reason":"PodInitializing"}},"lastState":{},"ready":false,"restartCount":0,"image":"image-registry.openshift-image-registry.svc:5000/openshift/java@sha256:13140221122ce8b52187914c889704b9e9d160062816ce036813075a4760745a","imageID":"","started":false}

[ssh:Fedora 32] [odo]  ���  Waiting for component to start [4m]

[ssh:Fedora 32] [odo]  ���  waited 4m0s but couldn't find running pod matching selector: 'deploymentconfig=sb-jar-test-app'

[ssh:Fedora 32] [odo] I0426 09:09:27.316161  455066 events.go:36] Quitting collect events

prietyc123 · 2021-05-05T12:45:41Z

[ssh:Fedora 32] Running oc with args [oc get pods -n cmd-service-test137ude]

[ssh:Fedora 32] [oc] NAME                                      READY   STATUS     RESTARTS   AGE

[ssh:Fedora 32] [oc] fedvky-cdkkv5gfw7                         0/1     Init:0/1   0          10s

[ssh:Fedora 32] [oc] nodejs-x854355369-onef-5bf7694948-kksw6   1/1     Running    0          2m19s

[ssh:Fedora 32] Running oc with args [oc get pods -n cmd-service-test137ude]

[ssh:Fedora 32] [oc] NAME                                      READY   STATUS     RESTARTS   AGE

[ssh:Fedora 32] [oc] fedvky-cdkkv5gfw7                         0/1     Init:0/1   0          <invalid>

[ssh:Fedora 32] [oc] nodejs-x854355369-onef-5bf7694948-kksw6   1/1     Running    0          65s

[ssh:Fedora 32] Running oc with args [oc get pods fedvky-cdkkv5gfw7 -o template="{{.status.phase}}" -n cmd-service-test137ude]

[ssh:Fedora 32] [oc] "Pending"Running oc with args [oc get pods fedvky-cdkkv5gfw7 -o template="{{.status.phase}}" -n cmd-service-test137ude]

kadel · 2021-05-05T12:47:46Z

can you please try to add this command to the test after oc get pods is executed?

oc describe pods -n cmd-service-test137ude

prietyc123 · 2021-05-07T12:27:17Z

[ssh:Fedora 32] Running oc with args [oc get pods -n cmd-service-test137ude]

[ssh:Fedora 32] [oc] NAME                                      READY   STATUS     RESTARTS   AGE

[ssh:Fedora 32] [oc] fedvky-cdkkv5gfw7                         0/1     Init:0/1   0          10s

[ssh:Fedora 32] [oc] nodejs-x854355369-onef-5bf7694948-kksw6   1/1     Running    0          2m19s

[ssh:Fedora 32] Running oc with args [oc get pods -n cmd-service-test137ude]

[ssh:Fedora 32] [oc] NAME                                      READY   STATUS     RESTARTS   AGE

[ssh:Fedora 32] [oc] fedvky-cdkkv5gfw7                         0/1     Init:0/1   0          <invalid>

[ssh:Fedora 32] [oc] nodejs-x854355369-onef-5bf7694948-kksw6   1/1     Running    0          65s

[ssh:Fedora 32] Running oc with args [oc get pods fedvky-cdkkv5gfw7 -o template="{{.status.phase}}" -n cmd-service-test137ude]

[ssh:Fedora 32] [oc] "Pending"Running oc with args [oc get pods fedvky-cdkkv5gfw7 -o template="{{.status.phase}}" -n cmd-service-test137ude]

Pod is in initialisation state due to ImagePullBackOff and pull rate limit is causing this. Debugging pr #4701

Events:

[ssh:Fedora 32]   Type     Reason          Age        From               Message

[ssh:Fedora 32]   ----     ------          ----       ----               -------

[ssh:Fedora 32]   Normal   Scheduled       3m21s      default-scheduler  Successfully assigned cmd-service-test135edw/okamws-7s7kmsps7v to testocp47-b6pfr-worker-0-dfnxt

[ssh:Fedora 32]   Normal   AddedInterface  55s        multus             Add eth0 [<ip>/<port>]

[ssh:Fedora 32]   Normal   Pulling         54s        kubelet            Pulling image "busybox:1.28.0-glibc"

[ssh:Fedora 32]   Warning  Failed          <invalid>  kubelet            Failed to pull image "busybox:1.28.0-glibc": rpc error: 
code = Unknown desc = Error reading manifest 1.28.0-glibc in docker.io/library/busybox: 
toomanyrequests: You have reached your pull rate limit. You may increase 
the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

[ssh:Fedora 32]   Warning  Failed          <invalid>  kubelet            Error: ErrImagePull

[ssh:Fedora 32]   Normal   BackOff         <invalid>  kubelet            Back-off pulling image "busybox:1.28.0-glibc"

[ssh:Fedora 32]   Warning  Failed          <invalid>  kubelet            Error: ImagePullBackOff

[ssh:Fedora 32]

prietyc123 · 2021-05-07T12:30:36Z

@kadel @dharmit Could you please share your thoughts on possible solution for #4606 (comment)

Failed to pull image "busybox:1.28.0-glibc": rpc error: 
code = Unknown desc = Error reading manifest 1.28.0-glibc in docker.io/library/busybox: 
toomanyrequests: You have reached your pull rate limit. You may increase 
the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

Note: This is specifically failing on psi.

prietyc123 · 2021-05-11T10:28:34Z

Java-openliberty devfile images are also failing due docker rate limit issue.

Warning  Failed          80s (x3 over 4m2s)  kubelet, testocp47-b6pfr-worker-0-dfnxt  Failed to pull image "openliberty/application-stack:0.5": rpc error: code = Unknown desc = Error reading manifest 0.5 in docker.io/openliberty/application-stack: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
  Warning  Failed          80s (x3 over 4m2s)  kubelet, testocp47-b6pfr-worker-0-dfnxt  Error: ErrImagePull
  Normal   BackOff         54s (x4 over 4m1s)  kubelet, testocp47-b6pfr-worker-0-dfnxt  Back-off pulling image "openliberty/application-stack:0.5"

Seems we are testing more on devfile images rather than odo here , so IMO we can remove the test for the images which use docker.io . Note: I am only talking wrt e2e-devfile-test

prietyc123 · 2021-05-15T14:05:05Z

Issues found so far are all PSI related but not platform specific. So I think we should change the title of the issue to better define the scope.

prietyc123 · 2021-05-24T05:38:19Z

odo push fails due to #2877 and #3256 on PSI.

prietyc123 · 2021-07-07T10:29:56Z

We have resolved psi failure from our end. Its Infra issue and team are working on that. Unassigning myself from here

anandrkskd · 2021-09-29T11:26:19Z

Closing this issue as we are moving towards IBM Cloud OpenShift for tests.
/Close

openshift-ci · 2021-09-29T11:26:42Z

@anandrkskd: Closing this issue.

In response to this:

Closing this issue as we are moving towards IBM Cloud OpenShift for tests.
/Close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added kind/user-story An issue of user-story kind area/testing Issues or PRs related to testing, Quality Assurance or Quality Engineering labels Apr 12, 2021

dharmit assigned prietyc123 Apr 12, 2021

dharmit added the estimated-size/S (5-10) Rough sizing for Epics. Less then one sprint of work for one person label Apr 15, 2021

prietyc123 mentioned this issue Apr 16, 2021

Upgrading ci-firewall version for linix #4632

Merged

4 tasks

prietyc123 mentioned this issue Apr 29, 2021

PSI 4.7 cluster stability #4668

Closed

prietyc123 mentioned this issue May 7, 2021

[WIP] Debugging pod initialisation issue for services #4701

Closed

5 tasks

prietyc123 mentioned this issue May 11, 2021

Remove devfile images needs image pull from docker #4715

Merged

5 tasks

prietyc123 changed the title ~~Run pr test on linux psi Infra~~ Resolve failures on PSI infra to unblock PR tests Run May 15, 2021

prietyc123 added the area/infra Issues or PRs related to setting up or fixing things in infrastructure. Mostly CI infrastructure. label May 15, 2021

This was referenced May 15, 2021

Docker rate limit issue being hit on clusters without docker pull secret in PSI #4278

Closed

[WIP] Apply ocp global secrets to project created during tests #4731

Closed

prietyc123 removed their assignment Jul 7, 2021

openshift-ci bot closed this as completed Sep 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve failures on PSI infra to unblock PR tests Run #4606

Resolve failures on PSI infra to unblock PR tests Run #4606

prietyc123 commented Apr 12, 2021

kadel commented Apr 13, 2021

prietyc123 commented Apr 14, 2021 •

edited

Loading

prietyc123 commented Apr 19, 2021

prietyc123 commented Apr 22, 2021

prietyc123 commented Apr 26, 2021

prietyc123 commented Apr 27, 2021 •

edited

Loading

prietyc123 commented Apr 27, 2021

kadel commented Apr 27, 2021

prietyc123 commented Apr 27, 2021

prietyc123 commented Apr 27, 2021

prietyc123 commented May 5, 2021

kadel commented May 5, 2021

prietyc123 commented May 7, 2021 •

edited

Loading

prietyc123 commented May 7, 2021

prietyc123 commented May 11, 2021

prietyc123 commented May 15, 2021

prietyc123 commented May 24, 2021

prietyc123 commented Jul 7, 2021

anandrkskd commented Sep 29, 2021

openshift-ci bot commented Sep 29, 2021

Resolve failures on PSI infra to unblock PR tests Run #4606

Resolve failures on PSI infra to unblock PR tests Run #4606

Comments

prietyc123 commented Apr 12, 2021

User Story

Acceptance Criteria

Links

kadel commented Apr 13, 2021

prietyc123 commented Apr 14, 2021 • edited Loading

prietyc123 commented Apr 19, 2021

prietyc123 commented Apr 22, 2021

prietyc123 commented Apr 26, 2021

prietyc123 commented Apr 27, 2021 • edited Loading

prietyc123 commented Apr 27, 2021

kadel commented Apr 27, 2021

prietyc123 commented Apr 27, 2021

prietyc123 commented Apr 27, 2021

prietyc123 commented May 5, 2021

kadel commented May 5, 2021

prietyc123 commented May 7, 2021 • edited Loading

prietyc123 commented May 7, 2021

prietyc123 commented May 11, 2021

prietyc123 commented May 15, 2021

prietyc123 commented May 24, 2021

prietyc123 commented Jul 7, 2021

anandrkskd commented Sep 29, 2021

openshift-ci bot commented Sep 29, 2021

prietyc123 commented Apr 14, 2021 •

edited

Loading

prietyc123 commented Apr 27, 2021 •

edited

Loading

prietyc123 commented May 7, 2021 •

edited

Loading