-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Helmify the monitoring #723
Conversation
Due to the large number of files included in this PR, here is an explanation of what is going on here: 1. All the infrastructure is code.
2. Kubernetes is upgraded to 1.17
3. Automatic cluster monitoring.
4. Automatic plan dashboards
5. Not really automatic plan downloads
There are a couple of dashboards with missing queries -- they didn't survive the upgrade. I'll save those for the demo tomorrow so I can show how dashboard import and export works. |
-f ./sidecar.yaml | ||
|
||
echo "Install Redis..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why Redis is removed here?
@coryschwartz I am not sure why we are moving all community supported helm charts to our testground repo? We don't have resources to maintain all these helm charts or fork them. Are we modifying them? Why not use the community provided helm charts directly, rather than copying them under our repo? Overall I like that we are getting rid of |
This was just for version control, so we can roll back to previous images, etc. if something is found not to be working. I don't have a strong preference for placing them in a single chart like this, and it would work fine just to replace the values. There is some benefit to having them in a single chart -- you can describe the dependency tree. For example, the serviceMonitor is a CRD created by the prometheus operator, and the weave network serviceMonitor would fail if the operator were not installed first. I haven't added weave to the testground-infra so the dependency is implicit and might be confusing. I would also move the sidecar and weave into the the chart, so that the whole infra is described. With regard to the additional chore for the updates, I agree with you, that is the downside of this approach. The way I created these was just by running |
I think this is more reviewable now that I've removed the huge number of sub-charts. In its place, I added a small bash script |
@@ -41,8 +41,8 @@ | |||
"styles": null | |||
}, | |||
{ | |||
"datasource": "prometheus", | |||
"editable": false, | |||
"datasource": "Prometheus", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I initially uploaded this to the bitnami grafana, it couldn't find the datasource "prometheus", and I went through and capitolized these.
"colorScheme": "interpolateOranges", | ||
"exponent": 0.5, | ||
"mode": "spectrum" | ||
"CustomPanel": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes were generated with the dashboard importing tool, just uploading them to the new grafana and downloading them again
@@ -4,7 +4,7 @@ metadata: | |||
name: weave-net | |||
labels: | |||
k8s-app: weave-net | |||
namespace: monitoring | |||
namespace: default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't re-create the monitoring namespace.
@@ -0,0 +1,8 @@ | |||
apiVersion: v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a configmap containing all of the dashboards in /dashboards.
Something to think about here is that there is a maximum size a configmap can be, which is limited by etcd. We can probably not grow this configmap to any more than 1MB.
At present time, the sum size of all dashboards is about 80k, so we are still well away from this limitation. If we do get close to the 1MB limit, we will have to switch this template a little bit.
Rather than 1 configmap with all dashboards, we would do one configmap per dashboard such that the limit for each dashboard is 1MB. So long as they all have the appropriate label, this works fine.
pkg/runner/cluster_k8s.go
Outdated
@@ -342,14 +342,16 @@ func (c *ClusterK8sRunner) healthcheckRedis() (redisCheck api.HealthcheckItem) { | |||
redisCheck.Message = err.Error() | |||
return | |||
} | |||
if len(pods.Items) != 1 { | |||
redisCheck.Message = fmt.Sprintf("expected 1 redis pod. found %d.", len(pods.Items)) | |||
// one master, and slaves. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For simplicity sake, we should probably run just one Redis master, without slaves, for the time being. Having seen benchmarks together with @raulk , we don't seem to be anywhere near scale where we need multiple Redis containers, and I think this will just increase complexity for now, considering that we expect strong consistency for the sync service.
This is failing on my machine with:
I think we are missing a call to |
|
||
for repo in "${thirdparty[@]}" | ||
do | ||
helm pull "$repo" --untar --untardir ./testground-infra/charts/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are mixing our source code with something we pull here, which will result in not-clean repo. We should probably pull these in a gitignored
directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention was to eventually add these to source control so we have the ability to roll back in the event of a bad update. I hadn't intended on running update_helm_thirdparty.sh during normal installation, but just when we want to pull in updates to those charts.
After manually adding the
|
The
The environment variable is set correctly for the container:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure that Redis is one node (so that we have strong consistency) and that network/ping-pong continues to work.
Also make sure that ./install.sh
runs flawlessly and installs monitoring and dashboards.
oh, I had intended to change the redis host only for kubernetes, not for local runners! Alright, disabling redis cluster mode, and will correct the other suggestions as well.
oh, right! So there is a difference in helm2 and helm 3 CRD installation. These charts are fully helm-3 capable. We can skip (or ignore) those messages. I added an option to disable the CRD install hook, but since we are using helm3, the CRDs do still get installed. |
I got the ping-pong test working. I'm adding the charts back in. If there is disagreement about keeping these in source control, I don't mind removing them in the future. |
@coryschwartz the network/ping-pong test is not working for me - I am not sure if we have some obscure bug that only manifests on occasion, or the configs to Redis in this PR triggered it (we are changing the DNS name of Redis here, so it might be the case that we have a bug that this is triggering).
|
@coryschwartz we need all Testground services to be on the
Otherwise when a testplan is started, it has access only to the |
Alternatively we have to somehow fix the networking to always attach new pods to |
On my cluster right now I see:
Multus CNI has a default network, but CNI Genie doesn't seem to have one, so it seems to attach pods randomly to a given network, if there are no specific |
I am not sure what problem this addressed. It should be just a simple set of I think the community is doing a great job at keeping important charts like those we use maintained and not introduce problems so that we can always pretty much rely on the latest version. |
@coryschwartz this seems to be reverting all the configs we have been doing to Redis:
We want Redis to occupy a full node, and this is an easy way to make sure it happens. Obviously it should be improved with annotations at some point and affinity, but for now it does the trick.
|
Fix: #594 (upgrade to kubernetes 1.17)
Partially fix now-re-opend: #228 (effective cluster monitoring)
In this PR, I have removed the kops addon, which used apis which were depreciated in kubernetes 1.16. In particular, apps/v1beta2 APIs, which are missing in 1.17 and still required by the kops addon.
In its place, I've constructed a helm chart which consists of the prometheus operator and pushgateway with appropriate value changes.
I have upgraded the kubernetes version in this PR to the latest version (1.17) and the new dashboards provided by this work better than the buggy kops addon dashboards
At present time, I haven't migrated the custom dashboards, but I'll do that in this PR before its finished.
@Robmat05 this is a draft PR, not ready to review yet, but open for comments about the strategy and direction