Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Step using Kaniko build does not wait for PVC to be bound #403

Closed
afrittoli opened this issue Jan 17, 2019 · 13 comments
Closed

Step using Kaniko build does not wait for PVC to be bound #403

afrittoli opened this issue Jan 17, 2019 · 13 comments
Labels
design This task is about creating and discussing a design help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. meaty-juicy-coding-work This task is mostly about implementation!!! And docs and tests of course but that's a given

Comments

@afrittoli
Copy link
Member

Expected Behavior

The container associated to the step should be retried until the PVC is bound (with a limit) and succeed once the PVC is ready.

Actual Behavior

Kubernetes attempted to schedule the kaniko container several times, but it always failed, even if the PVC became available.

Running describe on the pod:

Name:               source-to-image-health-api-pod-3203a3
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               10.160.228.217/10.160.228.217
Start Time:         Thu, 17 Jan 2019 18:40:40 +0000
Labels:             build.knative.dev/buildName=source-to-image-health-api
Annotations:        kubernetes.io/psp: ibm-privileged-psp
                    sidecar.istio.io/inject: false
Status:             Failed
IP:                 172.30.56.70
Controlled By:      TaskRun/source-to-image-health-api
Init Containers:
  build-step-credential-initializer:

(...)

  build-step-build-and-push:
    Container ID:  containerd://825316d0575e5fb9d2925676ab804668b053d8c62576cec2e4330fc35e9ab7e8
    Image:         gcr.io/kaniko-project/executor
    Image ID:      gcr.io/kaniko-project/executor@sha256:a3e1a4ac0fc9625ce0eb2f74094e4e801cb1164c93a1dd6b8d2306f6efee7d9a
    Port:          <none>
    Host Port:     <none>
    Command:
      /tools/entrypoint
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 17 Jan 2019 18:40:45 +0000
      Finished:     Thu, 17 Jan 2019 18:45:04 +0000
    Ready:          False
    Restart Count:  0
    Environment:
      HOME:                /builder/home
      ENTRYPOINT_OPTIONS:  {"args":["/kaniko/executor","--dockerfile=Dockerfile","--destination=registry.eu-gb.bluemix.net/andreaf/health-api","--context=/workspace/images/api"],"process_log":"/tools/process-log.txt","marker_file":"/tools/marker-file.txt"}
    Mounts:
      /builder/home from home (rw)
      /tools from tools (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-lt726 (ro)
      /workspace from workspace (rw)
Containers:
  nop:
    Container ID:
    Image:          registry.ng.bluemix.net/knative/nop-8138dea5c7f2e77549dee5f965401dd9@sha256:d0b91e82083d283451d8ddc317f49178274d796f7c4b23e10139b53addca6f3f
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-lt726 (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  tools:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  source-to-image-health-api
    ReadOnly:   false
  workspace:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  home:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  default-token-lt726:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-lt726
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                 From                     Message
  ----     ------            ----                ----                     -------
  Warning  FailedScheduling  11m (x25 over 12m)  default-scheduler        pod has unbound immediate PersistentVolumeClaims (repeated 3 times)
  Normal   Pulled            10m                 kubelet, 10.160.228.217  Container image "registry.ng.bluemix.net/knative/creds-init-4d3f1e062aee819de054755415941ee3@sha256:3e9b57c23fbdc3d1c5340423c93f4b667ac4127313d125544df57d08e27b16a6" already present on machine
  Normal   Created           10m                 kubelet, 10.160.228.217  Created container
  Normal   Started           10m                 kubelet, 10.160.228.217  Started container
  Normal   Pulled            10m                 kubelet, 10.160.228.217  Container image "registry.ng.bluemix.net/knative/git-init-afd2a379df7ac007f1e3a5fc75688a50@sha256:205a4d5563c616e6a351831deb9468f4fe5c1ef86333a88470e8b97b1c695678" already present on machine
  Normal   Created           10m                 kubelet, 10.160.228.217  Created container
  Normal   Started           10m                 kubelet, 10.160.228.217  Started container
  Normal   Pulled            10m                 kubelet, 10.160.228.217  Container image "gcr.io/k8s-prow/entrypoint@sha256:7c7cd8906ce4982ffee326218e9fc75da2d4896d53cabc9833b9cc8d2d6b2b8f" already present on machine
  Normal   Created           10m                 kubelet, 10.160.228.217  Created container
  Normal   Started           10m                 kubelet, 10.160.228.217  Started container
  Normal   Pulling           10m                 kubelet, 10.160.228.217  pulling image "gcr.io/kaniko-project/executor"
  Normal   Pulled            10m                 kubelet, 10.160.228.217  Successfully pulled image "gcr.io/kaniko-project/executor"
  Normal   Created           10m                 kubelet, 10.160.228.217  Created container
  Normal   Started           10m                 kubelet, 10.160.228.217  Started container

Describe on the PVC

Name:          source-to-image-health-api
Namespace:     default
StorageClass:  ibmc-file-bronze
Status:        Bound
Volume:        pvc-2a76fc3a-1a87-11e9-b6b7-8e54aedb0e1e
Labels:        region=us-south
               zone=sjc03
Annotations:   control-plane.alpha.kubernetes.io/leader:
                 {"holderIdentity":"15034e90-1a80-11e9-b015-f24ed2744150","leaseDurationSeconds":2100,"acquireTime":"2019-01-17T18:39:04Z","renewTime":"201...
               pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: ibm.io/ibmc-file
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      20Gi
Access Modes:  RWO
Events:        <none>
Mounted By:    source-to-image-health-api-pod-3203a3

Steps to Reproduce the Problem

  1. Define a Task and TaskRun following the tutorial in https://github.com/knative/build-pipeline/blob/master/docs/tutorial.md

  2. kubectl apply -f taskrun.yaml

Task:

apiVersion: pipeline.knative.dev/v1alpha1
kind: Task
metadata:
  name: source-to-image
spec:
  inputs:
    resources:
      - name: workspace
        type: git
    params:
      - name: pathToDockerFile
        description: The path to the dockerfile to build (relative to the context)
        default: Dockerfile
      - name: pathToContext
        description:
          The path to the build context, used by Kaniko - within the workspace
          (https://github.com/GoogleContainerTools/kaniko#kaniko-build-contexts).
          The git clone directory is set by the GIT init container which setup
          the git input resource - see https://github.com/knative/build-pipeline/blob/master/pkg/reconciler/v1alpha1/taskrun/resources/pod.go#L107
        default: .
  outputs:
    resources:
      - name: builtImage
        type: image
  steps:
    - name: build-and-push
      image: gcr.io/kaniko-project/executor
      command:
        - /kaniko/executor
      args:
        - --dockerfile=${inputs.params.pathToDockerFile}
        - --destination=${outputs.resources.builtImage.url}
        - --context=/workspace/${inputs.params.pathToContext}

TaskRun:

apiVersion: pipeline.knative.dev/v1alpha1
kind: TaskRun
metadata:
  name: source-to-image-health-api
spec:
  taskRef:
    name: source-to-image
  trigger:
    type: manual
  inputs:
    resources:
      - name: workspace
        resourceRef:
          name: health-helm-git
    params:
      - name: pathToContext
        value: images/api
  outputs:
    resources:
      - name: builtImage
        resourceRef:
          name: health-api-image

Additional Info

Kubernetes 1.12.3_1531
Knative build pipeline installed from master using ko: HEAD 41d513b
Using IKS and IBM Cloud Registry (the service account kaniko-build-controller is extended with the imagePullSecret to use the registry)

@afrittoli
Copy link
Member Author

Service account:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: build-pipeline-controller
  namespace: knative-build-pipeline
imagePullSecrets:
- name: bluemix-knative-build-pipeline-secret-regional

afrittoli added a commit to afrittoli/health-helm that referenced this issue Jan 18, 2019
Add Pipeline and PipelineRun definition.
Since TaskRuns do not work because of tektoncd/pipeline#403,
I added a PVC yaml to preprovision the PVC. As long as it has the
right name, the TaskRun picks it up, which is a decent workaround.
When the TaskRun is deleted, the PVC stays there too.

NOTE: This does not work as it is! Main open issues:
- kaniko is missing the credentials to push images
- the pipeline run does nothing
@afrittoli
Copy link
Member Author

Pre-provisioning the PVC with the same name knative would setup is a workaround to this issue.
With the PVC in place I can get a working taskrun.

@afrittoli
Copy link
Member Author

I could reproduce the same issue using a PipelineRun instead of a TaskRun.

@bobcatfish
Copy link
Collaborator

Thanks for all the detail @afrittoli !! I wonder if we are starting to see differences in how the different cloud providers are treating PVCs 🤔 afaik we don't see this behavior with GKE

Pre-provisioning the PVC with the same name knative would setup is a workaround to this issue.

I wonder if we could change the controller logic to wait for PVCs to be up and available before attempting to schedule the TaskRun pod

(In the long run, I'm thinking in one of our next sprints we need to get serious about testing on other cloud providers...)

@bobcatfish
Copy link
Collaborator

I wonder if we could change the controller logic to wait for PVCs to be up and available before attempting to schedule the TaskRun pod

Even though this doesn't seem to be the kubernetes way 🤔

@afrittoli a couple of follow up questions:

  1. how long does it take for the PVC to become available?
  2. any idea why the pod would start executing before the PVC is available? (this is definitely not the behavior I would expect

@bobcatfish bobcatfish added kind/bug Categorizes issue or PR as related to a bug. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. design This task is about creating and discussing a design meaty-juicy-coding-work This task is mostly about implementation!!! And docs and tests of course but that's a given labels Jan 18, 2019
@afrittoli
Copy link
Member Author

The PVC usually takes a little longer than 60s to be ready, I can get some better numbers if you need them.

I wonder if this might be an issue on k8s side, I took the latest version available in IKS, I could try with a cluster an older version.

@afrittoli
Copy link
Member Author

I tried with a cluster running k8s v1.10.11_1536 - and I didn't hit this issue there, so it might be an issue on k8s side, or some behaviour that changed on k8s side.

@afrittoli
Copy link
Member Author

Looking at the change logs, there definitely have been changes in the PVC attach/detach logic in 1.11 and 1.12: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.12.md https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.11.md

For instance: kubernetes/kubernetes#66863 might have an impact on knative.
Which version of k8s is used when testing on google cloud?

@bobcatfish
Copy link
Collaborator

Whoa, interesting! Thanks for tracking this down @afrittoli - I'm a bit confused by kubernetes/kubernetes#66863 , it's not clear to me why we'd want to start running pods that need resources before those resources are actually able to be used 🤔 but that's neither here nor there, looks like we're going to need to update our logic like you described regardless!

Which version of k8s is used when testing on google cloud?

It looks like it's 1.11.6 right now, from what I can tell in the logs:

I0122 00:37:18.358] 2019/01/22 00:37:18 process.go:153: Running: gcloud container clusters create --quiet --enable-autoscaling --min-nodes=1 --max-nodes=3 --scopes=cloud-platform --enable-basic-auth --no-issue-client-certificate --project=knative-boskos-20 --region=us-central1 --machine-type=n1-standard-4 --image-type=cos --num-nodes=1 --network=kbuild-pipeline-e2e-1087508932722692098 --cluster-version=1.11.6 kbuild-pipeline-e2e-1087508932722692098

@vdemeester
Copy link
Member

@afrittoli @bobcatfish is this still an issue ?

@afrittoli
Copy link
Member Author

@vdemeester I'm not sure, I stopped using the PVC in favour of the bucket a long time ago :P
I might not have that version of k8s at hand anymore, but I can try to reproduce on later versions of k8s.

@bobcatfish
Copy link
Collaborator

I'm really surprised this isn't happening more often :O

@bobcatfish
Copy link
Collaborator

I haven't heard of this bothering anyone since it was originally opened, so I'm going to close it for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design This task is about creating and discussing a design help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. meaty-juicy-coding-work This task is mostly about implementation!!! And docs and tests of course but that's a given
Projects
None yet
Development

No branches or pull requests

3 participants