Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference the k8s provisioner instead of the outdated in-tree provisioner #2522

Merged
merged 1 commit into from Feb 8, 2019

Conversation

travisn
Copy link
Member

@travisn travisn commented Jan 18, 2019

Signed-off-by: travisn tnielsen@redhat.com

Description of your changes:
The k8s external provisioner had been committed in-tree with rook due to dependency issues. This provisioner is outdated and causing issues such as hanging PVC requests. Now we are referencing the external provisioner code and there is no more need to copy the code in-tree. This change does allow us now to create many PVCs without hitting a hang.

The Rook go dependencies now include K8s v12.0 and the external provisioner v1.0. These were the most recent tags I could find to work. The K8s v13.0 tag and the provisioner v2.1 tags did compile. However, at runtime there was an issue with conflicting flags in glog. It looks like glog has been removed from all k8s code in favor of the package k8s.io/klog. However, the external provisioner has not yet been updated to shed glog. We will need that update before we can get on the latest version.

Which issue is resolved by this Pull Request:
Resolves #2487 #2007

Checklist:

  • Documentation has been updated, if necessary.
  • Pending release notes updated with breaking and/or notable changes, if necessary.
  • Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
  • Code generation (make codegen) has been run to update object specifications, if necessary.
  • Comments have been added or updated based on the standards set in CONTRIBUTING.md

@travisn travisn force-pushed the reference-k8s-provisioner branch 3 times, most recently from 43fb038 to 6a4b19e Compare January 19, 2019 13:49
…oner

Signed-off-by: travisn <tnielsen@redhat.com>
Copy link
Member

@jbw976 jbw976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this Travis! It'll be great to get the benefits of the stability efforts on the provisioner from the k8s storage community.

My biggest concerns would be around compatibility:

  • were you able to test the scenario where we have existing PVs/PVCs created and then upgrade to a Rook build with this newer provisioner? do the existing PVs/PVCs remain OK?
  • were you able to diff the code from the updated provisioner vs the old outdated provisioner to see what functional changes were made? maybe look at the commits made in the new provisioner repository to see what changes have been made for clues about anything that may not be compatible? (there could be some commits in the old tree also that were made before the code was moved to its own repo)

@@ -66,6 +66,7 @@ rules:
# PVs and PVCs are managed by the Rook provisioner
- persistentvolumes
- persistentvolumeclaims
- endpoints
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is endpoints something that the updated external provisioner code is using?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was required with the change to the v12 client

@humblec
Copy link
Contributor

humblec commented Jan 22, 2019

The K8s v13.0 tag and the provisioner v2.1 tags did compile. However, at runtime there was an issue with conflicting flags in glog. It looks like glog has been removed from all k8s code in favor of the package k8s.io/klog>

@travisn This is true.

It looks like glog has been removed from all k8s code in favor of the package k8s.io/klog. However, the external provisioner has not yet been updated to shed glog. We will need that update before we can get on the latest version.

Its happening and PR is already out kubernetes-retired/external-storage#1106 :)

Copy link
Member Author

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbw976 the upgrade integration test does succeed for previously provisioned mounts and then it continues to work after the upgrade. I don't see the upgrade as being an issue so far, but I'll dig into the diff and see what else might impact us. So far it's been much more stable in testing though than what we had.

@@ -66,6 +66,7 @@ rules:
# PVs and PVCs are managed by the Rook provisioner
- persistentvolumes
- persistentvolumeclaims
- endpoints
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was required with the change to the v12 client

@travisn
Copy link
Member Author

travisn commented Feb 5, 2019

@humblec Is the klog change also in progress for https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner? I see the change has merged in kubernetes-retired/external-storage#1106, though rook is not referencing that one.

@travisn
Copy link
Member Author

travisn commented Feb 5, 2019

@jbw976 After looking through the external provision library here are my observations:

  • A full diff between the old and new provisioner is 1750 lines changed.
  • The diff showed a lot of changes around locks, timing, and creation of PVCs that would represent the significant improvement we've seen testing in a cluster where we pushed to 2K PVCs
  • I did not see incompatible changes in the PR, but was a lot of changes to dig through. At least no API changes were apparent.
  • The upgrade test does check that a PVC continues to work after upgrade.

@travisn travisn mentioned this pull request Feb 5, 2019
5 tasks
@travisn
Copy link
Member Author

travisn commented Feb 6, 2019

@jbw976 i also tested that the failover and fencing is working with these changes. When a pod is failed over to another node, I see messages such as the following in the agent log until the old pod releases the lock:

019-02-06 20:18:05.767231 I | flexvolume: volume attachment record rook-ceph-system/pvc-bd2fe8bc-2a4b-11e9-b15a-08002709ffb0 exists for pod: default/wordpress-mysql-6887bf844f-xnf52
2019-02-06 20:18:05.773319 E | flexdriver: Attach volume replicapool/pvc-bd2fe8bc-2a4b-11e9-b15a-08002709ffb0 failed: failed to attach volume pvc-bd2fe8bc-2a4b-11e9-b15a-08002709ffb0 for pod default/wordpress-mysql-6887bf844f-g2hr8. Volume is already attached by pod default/wordpress-mysql-6887bf844f-xnf52. Status Running

After the old pod releases the lock, the new pod comes up fine.

@jbw976
Copy link
Member

jbw976 commented Feb 7, 2019

@travisn RE: those Volume is already attached messages, how long did you see those for? are we talking seconds or minutes? any difference from failover scenario without these changes?

@travisn
Copy link
Member Author

travisn commented Feb 7, 2019

@jbw976 No differences that I could see. I tried a couple scenarios where it took many minutes for the old pod to terminate, or where it only took a matter of seconds. Both times, I saw the message already attached in the log until the lock was freed, then the new pod started up. I also hit #1507 where we need to determine how to handle fencing when a node is completely lost, but that's a separate issue from this change.

Copy link
Member

@jbw976 jbw976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you for your thoroughness to vet this change with respect to verifying there are no breaking changes. great work @travisn! 🎉

@travisn travisn merged commit 8a2001c into rook:master Feb 8, 2019
@travisn travisn deleted the reference-k8s-provisioner branch February 8, 2019 02:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Quickly creating many PVCs results gets stuck after binding only a few PVCs
3 participants