Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve failures on PSI infra to unblock PR tests Run #4606

Closed
2 tasks
prietyc123 opened this issue Apr 12, 2021 · 20 comments
Closed
2 tasks

Resolve failures on PSI infra to unblock PR tests Run #4606

prietyc123 opened this issue Apr 12, 2021 · 20 comments
Labels
area/infra Issues or PRs related to setting up or fixing things in infrastructure. Mostly CI infrastructure. area/testing Issues or PRs related to testing, Quality Assurance or Quality Engineering estimated-size/S (5-10) Rough sizing for Epics. Less then one sprint of work for one person kind/user-story An issue of user-story kind

Comments

@prietyc123
Copy link
Contributor

/kind user-story
/area testing

User Story

As a user I want to run pr test successfully on linux psi Infra

Acceptance Criteria

  • Analyse the test failures on openshift linux psi env
  • Fix the issue and get a green job on prs

Links

  • Feature Request: NA
  • Related issue: NA

/kind user-story

@openshift-ci-robot openshift-ci-robot added kind/user-story An issue of user-story kind area/testing Issues or PRs related to testing, Quality Assurance or Quality Engineering labels Apr 12, 2021
@kadel
Copy link
Member

kadel commented Apr 13, 2021

As a user I want to run pr test successfully on linux psi Infra

User's don't care about PRs or tests. odo developers do.

But even odo developer should not need to run tests on PSI Infra. This needs to be automated. Tests should be run automatically for for every open PR.

@prietyc123
Copy link
Contributor Author

prietyc123 commented Apr 14, 2021

This needs to be automated. Tests should be run automatically for for every open PR.

yes.

Scope of the issue:

  • Run tests on jenkins for linux env with PSI
  • Collect all the failures on psi linux env.
  • Create related issues

  • Analyse the failure
  • Fix the issue one by one
  • Enable the job on pr

@dharmit dharmit added the estimated-size/S (5-10) Rough sizing for Epics. Less then one sprint of work for one person label Apr 15, 2021
@prietyc123
Copy link
Contributor Author

Issues facing on PSI:

���  Waiting for component to start [5m] [WARNING x5: Failed]

[ssh:Fedora 32] [odo]  ���  Failed to start component with name gbfxmm. Error: Failed to create the component: error while waiting for deployment rollout: timeout while waiting for gbfxmm deployment roll out\nFor more information to help determine the cause of the error, re-run with '-v'.

[ssh:Fedora 32] [odo] See below for a list of failed events that occured more than 5 times during deployment:

[ssh:Fedora 32] [odo] 

[ssh:Fedora 32] [odo]  NAME                                      COUNT  REASON  MESSAGE                 

[ssh:Fedora 32] [odo] 

[ssh:Fedora 32] [odo]  gbfxmm-6b4df699f8-jvd7n.1676518e46a03792  5      Failed  Error: ImagePullBackOff 

[ssh:Fedora 32] [odo] 

[ssh:Fedora 32] [odo] 

This is the most frequently failures on PSI, tracked in #3256

@prietyc123
Copy link
Contributor Author

Pods are not getting up for the services on PSI

$ odo project create cmd-service-test463hgq -w -v4

Running odo with args [odo catalog list services]
[...]
[odo] Services available through Operators
[ssh:Fedora 32] [odo] NAME                                CRDs
[ssh:Fedora 32] [odo] etcdoperator.v0.9.4-clusterwide     EtcdCluster, EtcdBackup, EtcdRestore
[ssh:Fedora 32] [odo] service-binding-operator.v0.7.1     ServiceBinding

Running odo with args [odo create nodejs kkedbn]
[...]

Running odo with args [odo push]
[...]
[odo]  ���  Changes successfully pushed to component

Running odo with args [odo catalog list services]
[...]
[ssh:Fedora 32] [odo] Services available through Operators
[ssh:Fedora 32] [odo] NAME                                CRDs
[ssh:Fedora 32] [odo] etcdoperator.v0.9.4-clusterwide     EtcdCluster, EtcdBackup, EtcdRestore
[ssh:Fedora 32] [odo] service-binding-operator.v0.7.1     ServiceBinding

Running odo with args [odo service create etcdoperator.v0.9.4-clusterwide/EtcdCluster --project cmd-service-test463hgq]
[...]
[odo]  ���  Service "example" was created
[ssh:Fedora 32] [odo] You can now link the service to a component using 'odo link'; check 'odo link -h'

Running oc with args [oc get pods -n cmd-service-test463hgq]
[ssh:Fedora 32] [oc] NAME                   READY   STATUS     RESTARTS   AGE
[ssh:Fedora 32] [oc] example-c2k8kkn4hs     0/1     Init:0/1   0          0s
[ssh:Fedora 32] [oc] kkedbn-767745f-kxkv6   1/1     Running    0          <invalid>

[ssh:Fedora 32] Running oc with args [oc get pods example-c2k8kkn4hs -o template="{{.status.phase}}" -n cmd-service-test463hgq]

[ssh:Fedora 32] [oc] "Pending"Running oc with args [oc get pods example-c2k8kkn4hs -o template="{{.status.phase}}" -n cmd-service-test463hgq]

As we don't see any failures on prow jobs, considering its only failing on PSI. However I need to debug it locally.

@prietyc123
Copy link
Contributor Author

Pods are not getting up for the services on PSI

$ odo project create cmd-service-test463hgq -w -v4

Running odo with args [odo catalog list services]
[...]
[odo] Services available through Operators
[ssh:Fedora 32] [odo] NAME                                CRDs
[ssh:Fedora 32] [odo] etcdoperator.v0.9.4-clusterwide     EtcdCluster, EtcdBackup, EtcdRestore
[ssh:Fedora 32] [odo] service-binding-operator.v0.7.1     ServiceBinding

Running odo with args [odo create nodejs kkedbn]
[...]

Running odo with args [odo push]
[...]
[odo]  ���  Changes successfully pushed to component

Running odo with args [odo catalog list services]
[...]
[ssh:Fedora 32] [odo] Services available through Operators
[ssh:Fedora 32] [odo] NAME                                CRDs
[ssh:Fedora 32] [odo] etcdoperator.v0.9.4-clusterwide     EtcdCluster, EtcdBackup, EtcdRestore
[ssh:Fedora 32] [odo] service-binding-operator.v0.7.1     ServiceBinding

Running odo with args [odo service create etcdoperator.v0.9.4-clusterwide/EtcdCluster --project cmd-service-test463hgq]
[...]
[odo]  ���  Service "example" was created
[ssh:Fedora 32] [odo] You can now link the service to a component using 'odo link'; check 'odo link -h'

Running oc with args [oc get pods -n cmd-service-test463hgq]
[ssh:Fedora 32] [oc] NAME                   READY   STATUS     RESTARTS   AGE
[ssh:Fedora 32] [oc] example-c2k8kkn4hs     0/1     Init:0/1   0          0s
[ssh:Fedora 32] [oc] kkedbn-767745f-kxkv6   1/1     Running    0          <invalid>

[ssh:Fedora 32] Running oc with args [oc get pods example-c2k8kkn4hs -o template="{{.status.phase}}" -n cmd-service-test463hgq]

[ssh:Fedora 32] [oc] "Pending"Running oc with args [oc get pods example-c2k8kkn4hs -o template="{{.status.phase}}" -n cmd-service-test463hgq]

As we don't see any failures on prow jobs, considering its only failing on PSI. However I need to debug it locally.

This particular scenario has been removed via #4650

@prietyc123
Copy link
Contributor Author

prietyc123 commented Apr 27, 2021

Pods are not getting up for the services on PSI
[...]
Running oc with args [oc get pods -n cmd-service-test463hgq]
[ssh:Fedora 32] [oc] NAME READY STATUS RESTARTS AGE
[ssh:Fedora 32] [oc] example-c2k8kkn4hs 0/1 Init:0/1 0 0s
[ssh:Fedora 32] [oc] kkedbn-767745f-kxkv6 1/1 Running 0

[ssh:Fedora 32] Running oc with args [oc get pods example-c2k8kkn4hs -o template="{{.status.phase}}" -n cmd-service-test463hgq]

[ssh:Fedora 32] [oc] "Pending"Running oc with args [oc get pods example-c2k8kkn4hs -o template="{{.status.phase}}" -n cmd-service-test463hgq]


As we don't see any failures on prow jobs, considering its only failing on PSI. However I need to debug it locally.

I can see lots of failure on operator hub due to pod status Init:0/1

@prietyc123
Copy link
Contributor Author

odo push fails on psi cluster, might be due to unavailability of the replicasets

[odo push --context /tmp/387587333/projectDir]

[...]
[ssh:Fedora 32] [odo] I0426 08:56:00.285576  453879 deployments.go:141] Deployment Condition: {"type":"Available","status":"False","lastUpdateTime":"2021-04-26T08:55:37Z","lastTransitionTime":"2021-04-26T08:55:37Z","reason":"MinimumReplicasUnavailable","message":"Deployment does not have minimum availability."}

[ssh:Fedora 32] [odo] I0426 08:56:00.285604  453879 deployments.go:141] Deployment Condition: {"type":"Progressing","status":"True","lastUpdateTime":"2021-04-26T08:55:37Z","lastTransitionTime":"2021-04-26T08:55:37Z","reason":"ReplicaSetUpdated","message":"ReplicaSet \"wghnte-5666944cbb\" is progressing."}

[ssh:Fedora 32] [odo] I0426 08:56:00.285612  453879 deployments.go:152] Waiting for deployment "wghnte" rollout to finish: 0 of 1 updated replicas are available...

[ssh:Fedora 32] [odo] I0426 08:56:00.285620  453879 deployments.go:159] Waiting for deployment spec update to be observed...

[ssh:Fedora 32] [odo]  ���  Waiting for component to start [5m]

[ssh:Fedora 32] [odo]  ���  Failed to start component with name "wghnte". Error: Failed to create the component: error while waiting for deployment rollout: timeout while waiting for wghnte deployment roll out

@kadel
Copy link
Member

kadel commented Apr 27, 2021

odo push fails on psi cluster, might be due to unavailability of the replicasets

you need to check why is this happening. You should be able to see this in Pod events.

Most often Pod can't download images (or it is taking to long) or PVC can't be created.

@prietyc123
Copy link
Contributor Author

Running odo with args [odo create java-quarkus --starter --project e2e-devfile-test76tkd mvfcuw --context /tmp/862530431/projectDir]

[...]
[ssh:Fedora 32] [odo] I0426 09:01:01.653315  454013 util.go:422] path /tmp/862530431/projectDir/devfile.yaml doesn't exist, skipping it

[ssh:Fedora 32] [odo] I0426 09:01:01.653324  454013 preference.go:217] The path for preference file is /tmp/862530431/preference.yaml

[ssh:Fedora 32] [odo] I0426 09:01:01.653388  454013 util.go:748] Response will be cached in /tmp/odohttpcache for 1h0m0s
[ssh:Fedora 32] [odo] I0426 09:01:01.653582  454013 util.go:761] Cached response used.

[ssh:Fedora 32] [odo] Devfile Object Validation
[ssh:Fedora 32] [odo]  ���  Checking devfile existence  ...
[...]
[ssh:Fedora 32] [odo]  ���  There are multiple projects in this devfile but none have been specified in --starter. Downloading the first: community

[ssh:Fedora 32] [odo]  ���  Downloading starter project community from https://code.quarkus.io/d?e=io.quarkus%3Aquarkus-resteasy&e=io.quarkus%3Aquarkus-micrometer&e=io.quarkus%3Aquarkus-smallrye-health&e=io.quarkus%3Aquarkus-openshift&cn=devfile [1ms]

[ssh:Fedora 32] [odo]  ���  Get "https://code.quarkus.io/d?e=io.quarkus%3Aquarkus-resteasy&e=io.quarkus%3Aquarkus-micrometer&e=io.quarkus%3Aquarkus-smallrye-health&e=io.quarkus%3Aquarkus-openshift&cn=devfile": dial tcp: lookup code.quarkus.io on 10.11.142.1:53: no such host

@prietyc123
Copy link
Contributor Author

Issue #2877 has also been observed on psi.

[ssh:Fedora 32] [odo] I0426 09:05:33.940279  455066 pods.go:67] Pod Conditions: {"type":"Initialized","status":"False","lastProbeTime":null,"lastTransitionTime":"2021-04-26T09:05:48Z","reason":"ContainersNotInitialized","message":"containers with incomplete status: [copy-files-to-volume copy-supervisord]"}

[ssh:Fedora 32] [odo] I0426 09:05:33.940288  455066 pods.go:67] Pod Conditions: {"type":"Ready","status":"False","lastProbeTime":null,"lastTransitionTime":"2021-04-26T09:05:48Z","reason":"ContainersNotReady","message":"containers with unready status: [sb-jar-test-app]"}

[ssh:Fedora 32] [odo] I0426 09:05:33.940301  455066 pods.go:67] Pod Conditions: {"type":"ContainersReady","status":"False","lastProbeTime":null,"lastTransitionTime":"2021-04-26T09:05:48Z","reason":"ContainersNotReady","message":"containers with unready status: [sb-jar-test-app]"}

[ssh:Fedora 32] [odo] I0426 09:05:33.940308  455066 pods.go:67] Pod Conditions: {"type":"PodScheduled","status":"True","lastProbeTime":null,"lastTransitionTime":"2021-04-26T09:05:48Z"}

[ssh:Fedora 32] [odo] I0426 09:05:33.940466  455066 pods.go:72] Container Status: {"name":"sb-jar-test-app","state":{"waiting":{"reason":"PodInitializing"}},"lastState":{},"ready":false,"restartCount":0,"image":"image-registry.openshift-image-registry.svc:5000/openshift/java@sha256:13140221122ce8b52187914c889704b9e9d160062816ce036813075a4760745a","imageID":"","started":false}

[ssh:Fedora 32] [odo]  ���  Waiting for component to start [4m]

[ssh:Fedora 32] [odo]  ���  waited 4m0s but couldn't find running pod matching selector: 'deploymentconfig=sb-jar-test-app'

[ssh:Fedora 32] [odo] I0426 09:09:27.316161  455066 events.go:36] Quitting collect events

@prietyc123
Copy link
Contributor Author

[ssh:Fedora 32] Running oc with args [oc get pods -n cmd-service-test137ude]

[ssh:Fedora 32] [oc] NAME                                      READY   STATUS     RESTARTS   AGE

[ssh:Fedora 32] [oc] fedvky-cdkkv5gfw7                         0/1     Init:0/1   0          10s

[ssh:Fedora 32] [oc] nodejs-x854355369-onef-5bf7694948-kksw6   1/1     Running    0          2m19s

[ssh:Fedora 32] Running oc with args [oc get pods -n cmd-service-test137ude]

[ssh:Fedora 32] [oc] NAME                                      READY   STATUS     RESTARTS   AGE

[ssh:Fedora 32] [oc] fedvky-cdkkv5gfw7                         0/1     Init:0/1   0          <invalid>

[ssh:Fedora 32] [oc] nodejs-x854355369-onef-5bf7694948-kksw6   1/1     Running    0          65s

[ssh:Fedora 32] Running oc with args [oc get pods fedvky-cdkkv5gfw7 -o template="{{.status.phase}}" -n cmd-service-test137ude]

[ssh:Fedora 32] [oc] "Pending"Running oc with args [oc get pods fedvky-cdkkv5gfw7 -o template="{{.status.phase}}" -n cmd-service-test137ude]

@kadel
Copy link
Member

kadel commented May 5, 2021

can you please try to add this command to the test after oc get pods is executed?

oc describe pods -n cmd-service-test137ude

@prietyc123
Copy link
Contributor Author

prietyc123 commented May 7, 2021

[ssh:Fedora 32] Running oc with args [oc get pods -n cmd-service-test137ude]

[ssh:Fedora 32] [oc] NAME                                      READY   STATUS     RESTARTS   AGE

[ssh:Fedora 32] [oc] fedvky-cdkkv5gfw7                         0/1     Init:0/1   0          10s

[ssh:Fedora 32] [oc] nodejs-x854355369-onef-5bf7694948-kksw6   1/1     Running    0          2m19s

[ssh:Fedora 32] Running oc with args [oc get pods -n cmd-service-test137ude]

[ssh:Fedora 32] [oc] NAME                                      READY   STATUS     RESTARTS   AGE

[ssh:Fedora 32] [oc] fedvky-cdkkv5gfw7                         0/1     Init:0/1   0          <invalid>

[ssh:Fedora 32] [oc] nodejs-x854355369-onef-5bf7694948-kksw6   1/1     Running    0          65s

[ssh:Fedora 32] Running oc with args [oc get pods fedvky-cdkkv5gfw7 -o template="{{.status.phase}}" -n cmd-service-test137ude]

[ssh:Fedora 32] [oc] "Pending"Running oc with args [oc get pods fedvky-cdkkv5gfw7 -o template="{{.status.phase}}" -n cmd-service-test137ude]

Pod is in initialisation state due to ImagePullBackOff and pull rate limit is causing this. Debugging pr #4701

Events:

[ssh:Fedora 32]   Type     Reason          Age        From               Message

[ssh:Fedora 32]   ----     ------          ----       ----               -------

[ssh:Fedora 32]   Normal   Scheduled       3m21s      default-scheduler  Successfully assigned cmd-service-test135edw/okamws-7s7kmsps7v to testocp47-b6pfr-worker-0-dfnxt

[ssh:Fedora 32]   Normal   AddedInterface  55s        multus             Add eth0 [<ip>/<port>]

[ssh:Fedora 32]   Normal   Pulling         54s        kubelet            Pulling image "busybox:1.28.0-glibc"

[ssh:Fedora 32]   Warning  Failed          <invalid>  kubelet            Failed to pull image "busybox:1.28.0-glibc": rpc error: 
code = Unknown desc = Error reading manifest 1.28.0-glibc in docker.io/library/busybox: 
toomanyrequests: You have reached your pull rate limit. You may increase 
the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

[ssh:Fedora 32]   Warning  Failed          <invalid>  kubelet            Error: ErrImagePull

[ssh:Fedora 32]   Normal   BackOff         <invalid>  kubelet            Back-off pulling image "busybox:1.28.0-glibc"

[ssh:Fedora 32]   Warning  Failed          <invalid>  kubelet            Error: ImagePullBackOff

[ssh:Fedora 32] 

@prietyc123
Copy link
Contributor Author

@kadel @dharmit Could you please share your thoughts on possible solution for #4606 (comment)

Failed to pull image "busybox:1.28.0-glibc": rpc error: 
code = Unknown desc = Error reading manifest 1.28.0-glibc in docker.io/library/busybox: 
toomanyrequests: You have reached your pull rate limit. You may increase 
the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

Note: This is specifically failing on psi.

@prietyc123
Copy link
Contributor Author

Java-openliberty devfile images are also failing due docker rate limit issue.

Warning  Failed          80s (x3 over 4m2s)  kubelet, testocp47-b6pfr-worker-0-dfnxt  Failed to pull image "openliberty/application-stack:0.5": rpc error: code = Unknown desc = Error reading manifest 0.5 in docker.io/openliberty/application-stack: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
  Warning  Failed          80s (x3 over 4m2s)  kubelet, testocp47-b6pfr-worker-0-dfnxt  Error: ErrImagePull
  Normal   BackOff         54s (x4 over 4m1s)  kubelet, testocp47-b6pfr-worker-0-dfnxt  Back-off pulling image "openliberty/application-stack:0.5"

Seems we are testing more on devfile images rather than odo here , so IMO we can remove the test for the images which use docker.io . Note: I am only talking wrt e2e-devfile-test

@prietyc123
Copy link
Contributor Author

Issues found so far are all PSI related but not platform specific. So I think we should change the title of the issue to better define the scope.

@prietyc123 prietyc123 changed the title Run pr test on linux psi Infra Resolve failures on PSI infra to unblock PR tests Run May 15, 2021
@prietyc123 prietyc123 added the area/infra Issues or PRs related to setting up or fixing things in infrastructure. Mostly CI infrastructure. label May 15, 2021
@prietyc123
Copy link
Contributor Author

odo push fails due to #2877 and #3256 on PSI.

@prietyc123
Copy link
Contributor Author

We have resolved psi failure from our end. Its Infra issue and team are working on that. Unassigning myself from here

@prietyc123 prietyc123 removed their assignment Jul 7, 2021
@anandrkskd
Copy link
Contributor

Closing this issue as we are moving towards IBM Cloud OpenShift for tests.
/Close

@openshift-ci
Copy link

openshift-ci bot commented Sep 29, 2021

@anandrkskd: Closing this issue.

In response to this:

Closing this issue as we are moving towards IBM Cloud OpenShift for tests.
/Close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot closed this as completed Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/infra Issues or PRs related to setting up or fixing things in infrastructure. Mostly CI infrastructure. area/testing Issues or PRs related to testing, Quality Assurance or Quality Engineering estimated-size/S (5-10) Rough sizing for Epics. Less then one sprint of work for one person kind/user-story An issue of user-story kind
Projects
None yet
Development

No branches or pull requests

5 participants