Skip to content

Conversation

jryb
Copy link
Contributor

@jryb jryb commented Dec 8, 2020

Problem

When applying an updated CRD, check the environment variables and push those changes to the running pods. For example, if the defaultsUrl parameter is changed in the CRD yaml and applied; the change is accepted however it is not reflected in the stateful set or the pod itself:

$ kubectl apply -f standalone6.yml 
standalone.enterprise.splunk.com/s6 created
$ kubectl describe pod/splunk-s6-standalone-0 
Name:         splunk-s6-standalone-0
Namespace:    default
...
Containers:
  splunk:
    Container ID:   containerd://0982abc2b231545a364eb1da6f4ac3d04358002b75942cb0c01609c7ac5bbe08
    Image:          docker.io/splunk/splunk:8.1.0
...
    Environment:
      SPLUNK_HOME:                        /opt/splunk
      SPLUNK_START_ARGS:                  --accept-license
      SPLUNK_DEFAULTS_URL:                /mnt/apps/defaults_apps2.yml,/mnt/splunk-secrets/default.yml
...

Edit standalone6.yml:

  defaultsUrl: /mnt/apps/defaults_apps.yml

Reapply:

$ kubectl apply -f standalone6.yml 
standalone.enterprise.splunk.com/s6 configured
$ kubectl describe pod/splunk-s6-standalone-0 
Name:         splunk-s6-standalone-0
Namespace:    default
Containers:
  splunk:
    Container ID:   containerd://0982abc2b231545a364eb1da6f4ac3d04358002b75942cb0c01609c7ac5bbe08
    Image:          docker.io/splunk/splunk:8.1.0
...
    Environment:
      SPLUNK_HOME:                        /opt/splunk
      SPLUNK_START_ARGS:                  --accept-license
      SPLUNK_DEFAULTS_URL:                /mnt/apps/defaults_apps2.yml,/mnt/splunk-secrets/default.yml
      SPLUNK_HOME_OWNERSHIP_ENFORCEMENT:  false
      SPLUNK_ROLE:                        splunk

History

This change was actually put in place and removed during the development of the monitor console integration for separate reasons ( 74abe92#diff-f051fa8775a356c08c90ac411f88b50e7455ec17e8f3241a810b64eb7f00a747 ). It was removed because it was no longer needed for the MC ( 2ca799f ), however it is still needed for app installation via defaults.yaml.

Solution

When merging the changes from the pod spec, check the environment variables. If they have changed, mark the specs as differing.

Testing

Validated that the defaultsUrl parameter is propagated to the pods using the method described above. Validate that the stateful set parameters are updated and the correct apps are installed when a new defaults.yaml is applied.

@jryb jryb force-pushed the bugfix/CSPL-592 branch 2 times, most recently from 6bf02d5 to 7eb7436 Compare December 8, 2020 22:50
@romain-bellanger
Copy link
Contributor

romain-bellanger commented Dec 9, 2020

Thanks a lot for the PR! Just commenting to cross reference as this partially addresses issue #126

@romain-bellanger
Copy link
Contributor

Only for awareness... I don't mean to disturb the merge of this PR which is really needed on our side:
I think with the last commit "Fix MC pod reconcile issues", the MC won't work anymore for multisite clusters. This was anyway very tricky... multisite requires the property multisite_master to be passed to ansible (ref. documentation). To get this parameter right, a SHC had to trigger the creation of the MC (not a ClusterMaster) and pass it's defaultsUrl pointing to a file containing this property to the MC.

@smohan-splunk
Copy link
Contributor

Only for awareness... I don't mean to disturb the merge of this PR which is really needed on our side:
I think with the last commit "Fix MC pod reconcile issues", the MC won't work anymore for multisite clusters. This was anyway very tricky... multisite requires the property multisite_master to be passed to ansible (ref. documentation). To get this parameter right, a SHC had to trigger the creation of the MC (not a ClusterMaster) and pass it's defaultsUrl pointing to a file containing this property to the MC.

@romain-bellanger Thank you for the thorough review & testing of the changes so far. We will create an internal bug for the MC issues with Multisite clusters. Just FYI - in the future, we plan to remove MC's dependency on Ansible.

@kashok-splunk
Copy link
Contributor

Hi @romain-bellanger This should not affect MC with the multi-site cluster as we are never using DefaultsURL to provide multisite_master info to MC. On creation or update of MC, we enquire CM and check if it has multisite configuration set to true then we pass the parameters required to MC through its (MC's) configmap https://github.com/splunk/splunk-operator/blob/develop/pkg/splunk/enterprise/monitoringconsole.go#L82 cc @smohan-splunk

@jryb
Copy link
Contributor Author

jryb commented Dec 10, 2020

Hi @romain-bellanger This should not affect MC with the multi-site cluster as we are never using DefaultsURL to provide multisite_master info to MC. On creation or update of MC, we enquire CM and check if it has multisite configuration set to true then we pass the parameters required to MC through its (MC's) configmap https://github.com/splunk/splunk-operator/blob/develop/pkg/splunk/enterprise/monitoringconsole.go#L82 cc @smohan-splunk

I verified this with a multisite cluster and the MC comes up correctly:

NAME                                  READY   STATUS    RESTARTS   AGE
splunk-default-monitoring-console-0   1/1     Running   0          2m53s
splunk-example-cluster-master-0       1/1     Running   0          4m2s
splunk-example-site1-indexer-0        1/1     Running   0          2m52s
splunk-example-site1-indexer-1        1/1     Running   0          2m52s
splunk-example-site2-indexer-0        1/1     Running   0          2m51s
splunk-example-site2-indexer-1        1/1     Running   0          2m51s
splunk-example-site3-indexer-0        1/1     Running   0          2m50s
splunk-example-site3-indexer-1        1/1     Running   0          2m50s
splunk-operator-7d868fb689-sc5r8      1/1     Running   0          27m```

@romain-bellanger
Copy link
Contributor

Hi,

@kashok-splunk I trust you about the ConfigMap which I think is rather used for the configuration of the MC app. The impact I was speaking about is rather related to the clustering configuration of the monitoring console as a searchhead. I've done many tests around this, and I could see for instance MC failing to start at the multisite clustering step when inheriting its defaultUrl from a ClusterMaster resource, as cluster master must have its parameter multisite_master set to localhost. Please see the impact on server.conf below.

@jryb after seeing the result of your test, I tried the same. I also observed the pod starting and reported ready in kubernetes. However looking at splunkd logs, I had streaming errors:

12-11-2020 09:05:39.731 +0000 ERROR ClusteringMgr - Master has multisite enabled but the search head is missing the 'multisite' attribute.
12-11-2020 09:05:39.731 +0000 ERROR CMSearchHead - 'Master has multisite enabled but the search head is missing the 'multisite' attribute.' for master=https://splunk-hosiad-dev-cluster-master-service:8089

And Splunk was reporting red state on Search Head Connectivity (here from health.log, but same visible from UI):

12-11-2020 09:16:09.232 +0000 INFO  PeriodicHealthReporter - feature="Search Head Connectivity" color=red indicator="master_connectivity" boolean_indicator=true measured_value=true reason="Cannot connect to master node (https://splunk-hosiad-dev-cluster-master-service:8089). This may be caused by network problem or master node is down." node_type=indicator node_path=splunkd.indexer_clustering.search_head_connectivity.master_connectivity

Looking at the server.conf, this is what I have when my MC inherits its defaultsUrl from the SearchHeadCluster resource:

>oc rsh splunk-splunk-dev-monitoring-console-0 grep -A4 'clustering\|clustermaster:' etc/system/local/server.conf
[clustering]
master_uri = clustermaster:splunk-hosiad-dev-cluster-master-service:8089
mode = searchhead

[clustermaster:splunk-hosiad-dev-cluster-master-service:8089]
master_uri = https://splunk-hosiad-dev-cluster-master-service:8089
multisite = true
pass4SymmKey = XXXXXXXXXXXXXXXX
site = site0

And then with the commit from this PR preventing the inheritance of defaultsUrl:

>oc rsh splunk-splunk-dev-monitoring-console-0 grep -A4 'clustering\|clustermaster:' etc/system/local/server.conf
[clustering]
master_uri = https://splunk-hosiad-dev-cluster-master-service:8089
mode = searchhead
pass4SymmKey = XXXXXXXXXXXXXXXX

Additional impact noticed during the test:

We have requirements for full encryption in public cloud, especially for connections such as DMC UI through which cluster admin credentials are sent. So we use the defaultsUrl to pass the parameter splunk.http_enableSSL: true to ansible. I think this commit also removes the possibility to configure SSL encryption on the MC UI.

I'm sorry if these comments are causing some disruption to the merge of this PR, that's really not the goal. I just mean to share information about possible impacts, in case other people would experience them, and about gaps to fix later around the MC. We very much need the first commit of this PR and are thankful for this change! And we don't have any good proposal to avoid such impacts, as discussed the only way would probably be to create a dedicated CRD for MC. On our side, we patch the operator to have the MC StatefulSet creation only triggered by SHC so that it has stable specs with correct parameters, but that's not suitable for all users.

When applying an updated CRD, check the environment variables and
push those changes to the running pods.
@jryb
Copy link
Contributor Author

jryb commented Dec 11, 2020

Thanks for the comments and diligent testing @romain-bellanger. It sound like blocking the defaultsUrl from being passed to the MC pod will break the splunk.http_enableSSL: true or any other defaults.yml config from being passed in. Due to that, I've changed the fix to only check for differences in pod envs in non-MC pods. This will keep the behavior of the MC at its current state, allowing passing env variables such as defaultsUrl or other defaults and not recycling the pod if the env changes. However if a change is made to the defaultsUrl for another CR, this change will be noticed and propagated to the pod, allowing new apps to be installed, etc. I've added a new log to track when the MC's pod env changes since we not recycle the pod:

{"level":"info","ts":1607716203.4881542,"logger":"splunk.reconcile.MergePodUpdates","msg":"Ignoring Pod Container Envs differences for MC pods","name":"splunk-default-monitoring-console","name":"splunk-default-monitoring-console"}

I tried this out on a deployment with multiple standalones with differing defaultUrl parameters and it worked correctly. I also verified this with a multisite cluster with @kashok-splunk help and verified the MC there is running with no health report errors.

@romain-bellanger
Copy link
Contributor

Thanks a lot @jryb! I agree it's probably the best solution in the current situation.

@smohan-splunk smohan-splunk self-requested a review December 14, 2020 23:15
@smohan-splunk smohan-splunk merged commit e428a21 into develop Dec 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants