CSPL-592 Trigger app install on defaultsURL change #199

jryb · 2020-12-08T21:54:13Z

Problem

When applying an updated CRD, check the environment variables and push those changes to the running pods. For example, if the defaultsUrl parameter is changed in the CRD yaml and applied; the change is accepted however it is not reflected in the stateful set or the pod itself:

$ kubectl apply -f standalone6.yml 
standalone.enterprise.splunk.com/s6 created
$ kubectl describe pod/splunk-s6-standalone-0 
Name:         splunk-s6-standalone-0
Namespace:    default
...
Containers:
  splunk:
    Container ID:   containerd://0982abc2b231545a364eb1da6f4ac3d04358002b75942cb0c01609c7ac5bbe08
    Image:          docker.io/splunk/splunk:8.1.0
...
    Environment:
      SPLUNK_HOME:                        /opt/splunk
      SPLUNK_START_ARGS:                  --accept-license
      SPLUNK_DEFAULTS_URL:                /mnt/apps/defaults_apps2.yml,/mnt/splunk-secrets/default.yml
...

Edit standalone6.yml:

  defaultsUrl: /mnt/apps/defaults_apps.yml

Reapply:

$ kubectl apply -f standalone6.yml 
standalone.enterprise.splunk.com/s6 configured
$ kubectl describe pod/splunk-s6-standalone-0 
Name:         splunk-s6-standalone-0
Namespace:    default
Containers:
  splunk:
    Container ID:   containerd://0982abc2b231545a364eb1da6f4ac3d04358002b75942cb0c01609c7ac5bbe08
    Image:          docker.io/splunk/splunk:8.1.0
...
    Environment:
      SPLUNK_HOME:                        /opt/splunk
      SPLUNK_START_ARGS:                  --accept-license
      SPLUNK_DEFAULTS_URL:                /mnt/apps/defaults_apps2.yml,/mnt/splunk-secrets/default.yml
      SPLUNK_HOME_OWNERSHIP_ENFORCEMENT:  false
      SPLUNK_ROLE:                        splunk

History

This change was actually put in place and removed during the development of the monitor console integration for separate reasons ( 74abe92#diff-f051fa8775a356c08c90ac411f88b50e7455ec17e8f3241a810b64eb7f00a747 ). It was removed because it was no longer needed for the MC ( 2ca799f ), however it is still needed for app installation via defaults.yaml.

Solution

When merging the changes from the pod spec, check the environment variables. If they have changed, mark the specs as differing.

Testing

Validated that the defaultsUrl parameter is propagated to the pods using the method described above. Validate that the stateful set parameters are updated and the correct apps are installed when a new defaults.yaml is applied.

romain-bellanger · 2020-12-09T16:07:45Z

Thanks a lot for the PR! Just commenting to cross reference as this partially addresses issue #126

romain-bellanger · 2020-12-10T09:05:53Z

Only for awareness... I don't mean to disturb the merge of this PR which is really needed on our side:
I think with the last commit "Fix MC pod reconcile issues", the MC won't work anymore for multisite clusters. This was anyway very tricky... multisite requires the property multisite_master to be passed to ansible (ref. documentation). To get this parameter right, a SHC had to trigger the creation of the MC (not a ClusterMaster) and pass it's defaultsUrl pointing to a file containing this property to the MC.

smohan-splunk · 2020-12-10T16:19:21Z

Only for awareness... I don't mean to disturb the merge of this PR which is really needed on our side:
I think with the last commit "Fix MC pod reconcile issues", the MC won't work anymore for multisite clusters. This was anyway very tricky... multisite requires the property multisite_master to be passed to ansible (ref. documentation). To get this parameter right, a SHC had to trigger the creation of the MC (not a ClusterMaster) and pass it's defaultsUrl pointing to a file containing this property to the MC.

@romain-bellanger Thank you for the thorough review & testing of the changes so far. We will create an internal bug for the MC issues with Multisite clusters. Just FYI - in the future, we plan to remove MC's dependency on Ansible.

kashok-splunk · 2020-12-10T17:41:00Z

Hi @romain-bellanger This should not affect MC with the multi-site cluster as we are never using DefaultsURL to provide multisite_master info to MC. On creation or update of MC, we enquire CM and check if it has multisite configuration set to true then we pass the parameters required to MC through its (MC's) configmap https://github.com/splunk/splunk-operator/blob/develop/pkg/splunk/enterprise/monitoringconsole.go#L82 cc @smohan-splunk

jryb · 2020-12-10T22:32:09Z

Hi @romain-bellanger This should not affect MC with the multi-site cluster as we are never using DefaultsURL to provide multisite_master info to MC. On creation or update of MC, we enquire CM and check if it has multisite configuration set to true then we pass the parameters required to MC through its (MC's) configmap https://github.com/splunk/splunk-operator/blob/develop/pkg/splunk/enterprise/monitoringconsole.go#L82 cc @smohan-splunk

I verified this with a multisite cluster and the MC comes up correctly:

NAME                                  READY   STATUS    RESTARTS   AGE
splunk-default-monitoring-console-0   1/1     Running   0          2m53s
splunk-example-cluster-master-0       1/1     Running   0          4m2s
splunk-example-site1-indexer-0        1/1     Running   0          2m52s
splunk-example-site1-indexer-1        1/1     Running   0          2m52s
splunk-example-site2-indexer-0        1/1     Running   0          2m51s
splunk-example-site2-indexer-1        1/1     Running   0          2m51s
splunk-example-site3-indexer-0        1/1     Running   0          2m50s
splunk-example-site3-indexer-1        1/1     Running   0          2m50s
splunk-operator-7d868fb689-sc5r8      1/1     Running   0          27m```

romain-bellanger · 2020-12-11T10:13:32Z

Hi,

@kashok-splunk I trust you about the ConfigMap which I think is rather used for the configuration of the MC app. The impact I was speaking about is rather related to the clustering configuration of the monitoring console as a searchhead. I've done many tests around this, and I could see for instance MC failing to start at the multisite clustering step when inheriting its defaultUrl from a ClusterMaster resource, as cluster master must have its parameter multisite_master set to localhost. Please see the impact on server.conf below.

@jryb after seeing the result of your test, I tried the same. I also observed the pod starting and reported ready in kubernetes. However looking at splunkd logs, I had streaming errors:

12-11-2020 09:05:39.731 +0000 ERROR ClusteringMgr - Master has multisite enabled but the search head is missing the 'multisite' attribute.
12-11-2020 09:05:39.731 +0000 ERROR CMSearchHead - 'Master has multisite enabled but the search head is missing the 'multisite' attribute.' for master=https://splunk-hosiad-dev-cluster-master-service:8089

And Splunk was reporting red state on Search Head Connectivity (here from health.log, but same visible from UI):

12-11-2020 09:16:09.232 +0000 INFO  PeriodicHealthReporter - feature="Search Head Connectivity" color=red indicator="master_connectivity" boolean_indicator=true measured_value=true reason="Cannot connect to master node (https://splunk-hosiad-dev-cluster-master-service:8089). This may be caused by network problem or master node is down." node_type=indicator node_path=splunkd.indexer_clustering.search_head_connectivity.master_connectivity

Looking at the server.conf, this is what I have when my MC inherits its defaultsUrl from the SearchHeadCluster resource:

>oc rsh splunk-splunk-dev-monitoring-console-0 grep -A4 'clustering\|clustermaster:' etc/system/local/server.conf
[clustering]
master_uri = clustermaster:splunk-hosiad-dev-cluster-master-service:8089
mode = searchhead

[clustermaster:splunk-hosiad-dev-cluster-master-service:8089]
master_uri = https://splunk-hosiad-dev-cluster-master-service:8089
multisite = true
pass4SymmKey = XXXXXXXXXXXXXXXX
site = site0

And then with the commit from this PR preventing the inheritance of defaultsUrl:

>oc rsh splunk-splunk-dev-monitoring-console-0 grep -A4 'clustering\|clustermaster:' etc/system/local/server.conf
[clustering]
master_uri = https://splunk-hosiad-dev-cluster-master-service:8089
mode = searchhead
pass4SymmKey = XXXXXXXXXXXXXXXX

Additional impact noticed during the test:

We have requirements for full encryption in public cloud, especially for connections such as DMC UI through which cluster admin credentials are sent. So we use the defaultsUrl to pass the parameter splunk.http_enableSSL: true to ansible. I think this commit also removes the possibility to configure SSL encryption on the MC UI.

I'm sorry if these comments are causing some disruption to the merge of this PR, that's really not the goal. I just mean to share information about possible impacts, in case other people would experience them, and about gaps to fix later around the MC. We very much need the first commit of this PR and are thankful for this change! And we don't have any good proposal to avoid such impacts, as discussed the only way would probably be to create a dedicated CRD for MC. On our side, we patch the operator to have the MC StatefulSet creation only triggered by SHC so that it has stable specs with correct parameters, but that's not suitable for all users.

When applying an updated CRD, check the environment variables and push those changes to the running pods.

jryb · 2020-12-11T23:39:12Z

Thanks for the comments and diligent testing @romain-bellanger. It sound like blocking the defaultsUrl from being passed to the MC pod will break the splunk.http_enableSSL: true or any other defaults.yml config from being passed in. Due to that, I've changed the fix to only check for differences in pod envs in non-MC pods. This will keep the behavior of the MC at its current state, allowing passing env variables such as defaultsUrl or other defaults and not recycling the pod if the env changes. However if a change is made to the defaultsUrl for another CR, this change will be noticed and propagated to the pod, allowing new apps to be installed, etc. I've added a new log to track when the MC's pod env changes since we not recycle the pod:

{"level":"info","ts":1607716203.4881542,"logger":"splunk.reconcile.MergePodUpdates","msg":"Ignoring Pod Container Envs differences for MC pods","name":"splunk-default-monitoring-console","name":"splunk-default-monitoring-console"}

I tried this out on a deployment with multiple standalones with differing defaultUrl parameters and it worked correctly. I also verified this with a multisite cluster with @kashok-splunk help and verified the MC there is running with no health report errors.

romain-bellanger · 2020-12-14T09:50:23Z

Thanks a lot @jryb! I agree it's probably the best solution in the current situation.

jryb requested review from gaurav-splunk, akondur, kashok-splunk and sgontla December 8, 2020 22:03

akondur approved these changes Dec 8, 2020

View reviewed changes

sgontla approved these changes Dec 8, 2020

View reviewed changes

jryb force-pushed the bugfix/CSPL-592 branch 2 times, most recently from 6bf02d5 to 7eb7436 Compare December 8, 2020 22:50

kashok-splunk approved these changes Dec 9, 2020

View reviewed changes

gaurav-splunk approved these changes Dec 9, 2020

View reviewed changes

gaurav-splunk mentioned this pull request Dec 9, 2020

Update StatefulSets on environment variable change #200

Closed

CSPL-592 Trigger app install on defaultsURL change

7642f9f

When applying an updated CRD, check the environment variables and push those changes to the running pods.

jryb force-pushed the bugfix/CSPL-592 branch from b4e83ff to 7642f9f Compare December 11, 2020 23:39

smohan-splunk self-requested a review December 14, 2020 23:15

smohan-splunk approved these changes Dec 14, 2020

View reviewed changes

smohan-splunk merged commit e428a21 into develop Dec 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CSPL-592 Trigger app install on defaultsURL change #199

CSPL-592 Trigger app install on defaultsURL change #199

Uh oh!

jryb commented Dec 8, 2020 •

edited

Loading

Uh oh!

romain-bellanger commented Dec 9, 2020 •

edited

Loading

Uh oh!

romain-bellanger commented Dec 10, 2020

Uh oh!

smohan-splunk commented Dec 10, 2020

Uh oh!

kashok-splunk commented Dec 10, 2020

Uh oh!

jryb commented Dec 10, 2020

Uh oh!

romain-bellanger commented Dec 11, 2020

Uh oh!

jryb commented Dec 11, 2020

Uh oh!

romain-bellanger commented Dec 14, 2020

Uh oh!

Uh oh!

CSPL-592 Trigger app install on defaultsURL change #199

CSPL-592 Trigger app install on defaultsURL change #199

Uh oh!

Conversation

jryb commented Dec 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

History

Solution

Testing

Uh oh!

romain-bellanger commented Dec 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

romain-bellanger commented Dec 10, 2020

Uh oh!

smohan-splunk commented Dec 10, 2020

Uh oh!

kashok-splunk commented Dec 10, 2020

Uh oh!

jryb commented Dec 10, 2020

Uh oh!

romain-bellanger commented Dec 11, 2020

Uh oh!

jryb commented Dec 11, 2020

Uh oh!

romain-bellanger commented Dec 14, 2020

Uh oh!

Uh oh!

jryb commented Dec 8, 2020 •

edited

Loading

romain-bellanger commented Dec 9, 2020 •

edited

Loading