Logs may not be searchable #431

yukionec · 2022-12-01T07:05:53Z

We dropped the pod, but the log doesn't exist.
After dropping the pod with "kubectl -n delete pod" , I performed a search by specifying the ID of the pod dropped in ElasticSearch's Discover. Some logs are hit and some are not.

(1) 11/10 13:30　sas-connect-58bb448bbc-mkrmw : exist
(2) 11/10 13:40　sas-web-data-access-746b5bb9f7-mbmgz　　: exist
(3) 11/10 18:45　sas-connect-spawner-5646fdb7d9-8vcmr : not exist
(4) 11/10 18:50　sas-web-data-access-746b5bb9f7-8nx5m : not exist

Can you tell me if there is a possible cause?

編集済み

gsmith-sas · 2022-12-01T14:27:40Z

@yukionec Deleting the pod should not impact which log messages are available in OpenSearch. Once a log message has been collected and saved in OpenSearch, it should be available until it is deleted based on the log retention period setting (i.e. 3 day, by default). I suspect the log messages from the deleted pod never made it into OpenSearch for some reason.

The Viya Monitoring solution deploys a Fluent Bit pod on every Kubernetes node in the cluster. This pod collects the log messages from every pod running on that node and sends it to OpenSearch. I suspect that the Fluent Bit pod on one or more of the Kubernetes nodes stopped working for some reason. That would result in log messages from the pods running on that node NOT being sent to OpenSearch.

You can check on the health of the Fluent Bit pods in a couple of ways. You can use kubectl to confirm there is a (healthy) Fluent Bit pod running on every Kubernetes node in your cluster. The Fluent Bit pods will (by default) be running in the 'logging' namespace and have names that look like v4m-fb-5c4rj. You can also look at the Grafana dashboard called "Logging-Fluent Bit" to check on the health of your Fluent Bit pods. At the top left of that dashboard, the number of "Available" Fluent Bit pods and the number "Total" Fluent Bit pods is shown. These two numbers should match; if they don't, one or more Fluent Bit pods are not running. If Fluent Bit is not running on a Kubernetes node, you will not get log messages from any pods running on that node.

You should also look at the "Retries and Failures" metrics on that dashboard. Having a few retries and failures is not a problem, but if you see a significant number of failures, that means a large number of log messages are not getting into OpenSearch. There is a 2nd chart showing the "Retries and Failures" metrics on the bottom row of the dashboard. This version of the chart shows these metrics by node. That can help you identify the Kubernetes node where Fluent Bit is having a problem. Sometimes deleting the Fluent Bit pod having problems and having it restart itself will resolve problems.

It sounds like you might have some log messages from these pods but not all of the log messages. If that's the case, I suggest you look at the LOGSOURCE field. If the LOGSOURCE field has a value of "KUBE_EVENT" then the message shown is not actually a log message, it is a Kubernetes event. Kubernetes events are collected via a slightly different process. So, even if Fluent Bit is not running on a Kubernetes node, you will see Kubernetes events for the pods running on that node (which will have a LOGSOURCE of "KUBE_EVENT") but there will no "real" log messages collected from those pods.

I hope that helps. I suspect you will find that the Fluent Bit pods on one or more of your Kubernetes nodes were not running or were not healthy. Resolving that should ensure that all of the log messages are collected and sent to OpenSearch. Please take a look and let us know if that fits what you are seeing.

Regards,
Greg

yukionec · 2022-12-06T23:50:58Z

Hi Greg,
Thank you for your reply.

If that's the case, I suggest you look at the LOGSOURCE field.
If the LOGSOURCE field has a value of "KUBE_EVENT" then the message shown is not actually a log message, it is a

Kubernetes event.
After checking the above, I confirmed that "logsource": "KUBE_EVENT".

Kubernetes events are collected via a slightly different process.
In addition to the above, I would like to ask the following three questions.

What exactly is the "slightly different process"?
Am I correct in thinking that kubernetes events were not forwarded to Elastic because this process was dead?
Could you tell me how to check if the transfer to Elastic is successful?

Also, when realizing an operation that detects a Pod down named sas-* with an alert notification, is there no problem in designing an alert for such a kubernetes event log with Elastic-Alerting monitor?
Should I monitor log messages collected from Fluent Bit pods (including pod down messages?)?

gsmith-sas · 2022-12-07T00:19:21Z

Hi Yukio,
The path "real" log messages take is:

application pod/container emits a log message to stdout or stderr
Kubernetes writes that log message to a file on the Kubernetes node
Our Fluent Bit Pod running on that node is "tailing" that file, sees the new log messages, processes the log message and sends it to OpenSearch.

If the Fluent Bit pod that is supposed to be run on that node is down (or up but not working properly), no log messages are collected/processed/sent to OpenSearch. It sounds like that's what you are seeing.

Kubernetes Events (LOGSOURCE="KUBE_EVENT") follow a slightly different path:

Kubernetes emits an event
We deploy a component called Event Router that runs in its own pod. There is only one of these pods in the cluster. It interacts with Kubernetes and captures every event coming from Kubernetes, regardless of what Kubernetes resource the event relates to. Event Router captures the event and writes it out to its log file.
At this point the processing is the same as it is for any other log message: Fluent Bit collects it, processes it and sends it on to OpenSearch.

This means that as long as the Event Router pod is running on a Kubernetes node with a healthy Fluent Bit pod, all Kubernetes events that are generated will show up in OpenSearch.

So, if you are seeing Kubernetes Events, the Fluent Bit pod on the Kubernetes node where Event Router is running is probably healthy. That's good news. But it does NOT mean every Fluent Bit pod is healthy.

To check on the health of the various Fluent Bit pods, there are a things you can do. The easiest is probably to run a kubectl -n logging get pods to see all of the pods running in the "logging"namespace (assuming you have deployed the log monitoring components of Viya Monitoring in the "logging" namespace). If any of the Fluent Bit pods are failing, that should be obvious in the kubectl output. If you see any failing Fluent Bit pods, you should review the pod log for that pod (if it is available) and do a kubectl -n logging describe pod v4m-fb-1234 (where v4m-fb-1234 is the name of the failing pod) to get more information about the pod. The output can often explain why the pod is failing.

As I mentioned earlier, you can also look at the Logging - Fluent Bit dashboard in Grafana to check on the health of Fluent Bit pods. Even if all of the Fluent Bit pods are "up", if you see a flat line for Out metric in the Message Rate (In/Out) chart near the top of that dashboard, it indicates that the Fluent Bit pods could NOT send messages to OpenSearch. If that's the case, the problem may be on the OpenSearch side. So, once again, check the pod logs (for OpenSearch) and/or do a kubectl describe pod on the OpenSearch pods to see if that helps explain the problem.

I hope that makes sense.
Regards,
Greg

UshimaruYoshihiro · 2022-12-12T07:48:36Z

Hi @gsmith-sas
Thank you for your reply.
Posting on behalf of @yukionec

The easiest is probably to run a kubectl -n logging get pods to see all of the pods running in the "logging"namespace (assuming you have deployed the log monitoring components of Viya Monitoring in the "logging" namespace).

I confirmed that the STATUS of Fluent Bit Pod (v4m-fb-XXXXX) is Running as shown below.

$ kubectl -n logging get pods
NAME                               READY   STATUS    RESTARTS   AGE
eventrouter-XXXXXXXXXX-YYYYY       1/1     Running   0          23h
: omit
v4m-fb-XXXXX                       1/1     Running   4          8h
v4m-fb-YYYYY                       1/1     Running   3          8h
v4m-fb-ZZZZZ                       1/1     Running   3          8h
v4m-fb-AAAAA                       1/1     Running   1          8h
v4m-fb-BBBBB                       1/1     Running   4          8h
v4m-fb-CCCCC                       1/1     Running   4          8h

Also, when checking the logs of Fluent Bit Pods (v4m-fb-XXXXX) with the command "kubectl logs --timestamps=true v4m-fb-XXXXX -n logging", there are many messages below in some Fluent Bit Pods I have confirmed that it is output.

  2022-12-06T23:15:07.265164167Z [2022/12/06 23:15:07] [ warn] [engine] failed to flush chunk '1-1670368506.241971654.flb', retry in 9 seconds: task_id=0, input=tail.0 > output=eslogs (out_id=0)

This state was determined to be unhealthy from the information on the site https://docs.fluentbit.io/manual/administration/monitoring#health-check-for-fluent-bit.

Therefore, I also checked fluent/fluent-bit#5145 and so on, and in order to investigate the cause of the unhealthy state, the [SERVICE] in fluent-bit_config.configmap_open.yaml I changed the Log_Level from info to debug and reflected the Configmap with the following command.

$ kubectl -n logging create configmap fb-fluent-bit-config --from-file=$USER_DIR/logging/fb/fluent-bit_config.configmap_open.yaml --save-config --dry-run=client -o yaml | kubectl -n logging apply -f -

However, when checking with "kubectl -n logging describe configmap fb-fluent-bit-config", parts other than Log_Level have also changed, such as Labels: becoming .

▼before
$ kubectl -n logging describe configmap fb-fluent-bit-config
Name:         fb-fluent-bit-config
Namespace:    logging
Labels:       app=fluent-bit
Annotations:  <none>

Data
====
fluent-bit.conf:
----
[FILTER]
   Name modify

：omit

[SERVICE]
：omit

   Log_Level    info

：omit

parsers.conf:
----


BinaryData
====

Events:  <none>

▼after
$ kubectl -n logging describe configmap fb-fluent-bit-config
Name:         fb-fluent-bit-config
Namespace:    logging
Labels:       <none>
Annotations:  <none>

Data
====
fluent-bit_config.configmap_open.yaml:
----
apiVersion: v1
data:
  fluent-bit.conf: |
    [FILTER]
       Name modify

：omit

    [SERVICE]

：omit

       Log_Level    debug

：omit

  parsers.conf: ""
kind: ConfigMap
metadata:
  labels:
    app: fluent-bit
  name: fb-fluent-bit-config


BinaryData
====

Events:  <none>

Excuse me, could you tell me about the following?
・Kubernetes command to restore configmap settings
・The kubernetes command to change only the log level from info to debug
・How to normalize a Fluent Bit Pod that outputs a lot of "[ warn] [engine] failed to flush chunk ~" logs

gsmith-sas · 2022-12-12T19:03:53Z

The WARNING messages (i.e. "[ warn] [engine] failed to flush chunk ~" ) indicates that Fluent Bit encountered a problem when trying to send a log message to OpenSearch. As the message says, it will retry sending that message again and it will do this multiple times. In our default configuration, Fluent Bit will try again up to 5 times which can take up to ~40 minutes. If the problem was caused by an intermittent problem with OpenSearch, one of the retries will generally succeed. So, this message by itself is not necessarily something to worry about. However, if all 5 attempts fail, Fluent Bit will discard the log message and that log message will never get to OpenSearch. You can see if this happens by looking at the Failures metric in the Grafana dashboard "Logging - Fluent Bit". If the Failures metric is always "0", it means no log messages have been lost.

The easiest way to restore the original configuration of Fluent Bit is to return to the original version of the fluent-bit_config.configmap_open.yaml file. If the only change you made was to change the Log_level from "info" to "debug", undo that change, save the file and redeploy the Fluent Bit pods by running the deploy_fluentbit_open.sh script in the logging/bin directory of your copy of the the project repo. We update the project repo frequently, so if you clone the repo again, we might have a newer version available. If you want to ensure you redeploy the same version, you can use the git checkout command to identify which version you want to deploy (e.g. git checkout 1.2.7). The "stable" branch of the repo always has the latest release. The "main" branch is our active development branch, so it is always changing. Therefore, you will almost always want to deploy from the "stable" branch or using a tag pointing to an older version.

From the kubectl get pods output you shared, I can see that most of the Fluent Bit pods have restarted 3 or 4 times over the last 8 hours. That is unusual. This might be because you modified the configuration and redeployed multiple times. Or, it could mean that the Fluent Bit pods are running into a problem.

The log messages from Fluent Bit are collected and sent to OpenSearch. However, we store them in a different OpenSearch index. The log messages from Fluent Bit and the other components of Viya Monitoring are stored in indices with a name fitting the pattern: viya_ops-* rather than the viya_logs-* pattern used for the SAS Viya log messages. You can review the log messages by switching which index pattern you are using in the Discover page within OpenSearch Dashboards.

Please let me know the problem you are trying to solve. Your original question asked about some "missing" log messages. I explained that these log messages might be missing because the Fluent Bit pods running on one or more of your Kubernetes nodes was not healthy and described how you might determine whether that was the case.

Were you able to determine if some of the Fluent Bit pods were unhealthy?
If so:
- did restarting those pods get things working again?
- Are you still missing some log messages?

In general, there should be no need to alter the Fluent Bit configuration. So, I'd like to understand what you are seeing and what you are trying to do.

Regards,
Greg

UshimaruYoshihiro · 2022-12-16T08:22:15Z

Hi @gsmith-sas
Thank you for your reply.

For Fluent Bit, I was able to revert to the original configuration by running the deploy_fluentbit_open.sh script.

Were you able to determine if some of the Fluent Bit pods were unhealthy?
If so:
did restarting those pods get things working again?
Are you still missing some log messages?

After reconfirming with the procedure below, the [Failures] metric in [Logging - Fluent Bit] on the Grafana dashboard was 0, and no log messages were lost, so it was determined that the Fluent Bit pod was normal.

After restarting the entire SAS environment, I deleted the "sas-xxx" pods using the following procedure and checked for logs on Discover.

Confirmed that Fluent Bit Pod STATUS is Running with "kubectl -n logging get pods" command.
At 2022/12/14 10:12, I deleted the Pod named "sas-connect-58bb448bbc-cdt5s" with the following command.

$ kubectl -n sasviya4awssvml delete pod sas-connect-58bb448bbc-cdt5s

At 2022/12/14 10:20, I deleted the Pod named "sas-web-data-access-746b5bb9f7-tlrwm" with the following command.

$ kubectl -n sasviya4awssvml delete pod sas-web-data-access-746b5bb9f7-tlrwm

When I searched the "viya-logs-*" index in Discover, I found the following logs with the field information logsource:KUBE_EVENT and properties.reason:Killing.
Log of Pod "sas-connect-58bb448bbc-cdt5s" deleted at 10:12
Log of Pod "sas-web-data-access-746b5bb9f7-tlrwm" deleted at 10:20
When I searched with the "viya-ops-*" index in Discover, I could not find any messages in Discover that indicated that the Pod had been deleted.
(The log related to the deleted Pod name was not hit, and the .log file name related to the aliases "sas-connect-58bb448bbc-qkk7" and "sas-web-data-access-746b5bb9f7-zkkdj" started after the deletion could be searched.)
Checked the "Failures" metric under "Logging - Fluent Bit" in the Grafana dashboard.
During the time period (9:00 to 11:40) before and after Pod deletion and log confirmation, Failures was 0 and it was confirmed that no logs were lost.

Based on the above confirmation results, the log is not currently lost.
In order to confirm that the "sas-xxx" pod has been deleted, I understand that I can search the log for logsource:KUBE_EVENT in the "viya-logs-*" index on Discover.
If there is a difference in recognition, please point it out.

gsmith-sas · 2023-02-17T22:59:44Z

@UshimaruYoshihiro I believe you are correct: looking for the Kubernetes event is probably the most reliable way of determining if a pod was deleted. This is because the pod itself may not be able to write out any log messages indicating this before it is terminated. I will close this issue now. Please feel free to open a new issue if you run into a new problem or have additional questions.

Regards,
Greg

gsmith-sas closed this as completed Feb 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logs may not be searchable #431

Logs may not be searchable #431

yukionec commented Dec 1, 2022

gsmith-sas commented Dec 1, 2022

yukionec commented Dec 6, 2022 •

edited

Loading

gsmith-sas commented Dec 7, 2022

UshimaruYoshihiro commented Dec 12, 2022 •

edited

Loading

gsmith-sas commented Dec 12, 2022

UshimaruYoshihiro commented Dec 16, 2022

gsmith-sas commented Feb 17, 2023

Logs may not be searchable #431

Logs may not be searchable #431

Comments

yukionec commented Dec 1, 2022

gsmith-sas commented Dec 1, 2022

yukionec commented Dec 6, 2022 • edited Loading

gsmith-sas commented Dec 7, 2022

UshimaruYoshihiro commented Dec 12, 2022 • edited Loading

gsmith-sas commented Dec 12, 2022

UshimaruYoshihiro commented Dec 16, 2022

gsmith-sas commented Feb 17, 2023

yukionec commented Dec 6, 2022 •

edited

Loading

UshimaruYoshihiro commented Dec 12, 2022 •

edited

Loading