-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logs may not be searchable #431
Comments
@yukionec Deleting the pod should not impact which log messages are available in OpenSearch. Once a log message has been collected and saved in OpenSearch, it should be available until it is deleted based on the log retention period setting (i.e. 3 day, by default). I suspect the log messages from the deleted pod never made it into OpenSearch for some reason. The Viya Monitoring solution deploys a Fluent Bit pod on every Kubernetes node in the cluster. This pod collects the log messages from every pod running on that node and sends it to OpenSearch. I suspect that the Fluent Bit pod on one or more of the Kubernetes nodes stopped working for some reason. That would result in log messages from the pods running on that node NOT being sent to OpenSearch. You can check on the health of the Fluent Bit pods in a couple of ways. You can use kubectl to confirm there is a (healthy) Fluent Bit pod running on every Kubernetes node in your cluster. The Fluent Bit pods will (by default) be running in the 'logging' namespace and have names that look like You should also look at the "Retries and Failures" metrics on that dashboard. Having a few retries and failures is not a problem, but if you see a significant number of failures, that means a large number of log messages are not getting into OpenSearch. There is a 2nd chart showing the "Retries and Failures" metrics on the bottom row of the dashboard. This version of the chart shows these metrics by node. That can help you identify the Kubernetes node where Fluent Bit is having a problem. Sometimes deleting the Fluent Bit pod having problems and having it restart itself will resolve problems. It sounds like you might have some log messages from these pods but not all of the log messages. If that's the case, I suggest you look at the LOGSOURCE field. If the LOGSOURCE field has a value of "KUBE_EVENT" then the message shown is not actually a log message, it is a Kubernetes event. Kubernetes events are collected via a slightly different process. So, even if Fluent Bit is not running on a Kubernetes node, you will see Kubernetes events for the pods running on that node (which will have a LOGSOURCE of "KUBE_EVENT") but there will no "real" log messages collected from those pods. I hope that helps. I suspect you will find that the Fluent Bit pods on one or more of your Kubernetes nodes were not running or were not healthy. Resolving that should ensure that all of the log messages are collected and sent to OpenSearch. Please take a look and let us know if that fits what you are seeing. Regards, |
Hi Greg,
Also, when realizing an operation that detects a Pod down named sas-* with an alert notification, is there no problem in designing an alert for such a kubernetes event log with Elastic-Alerting monitor? |
Hi Yukio,
If the Fluent Bit pod that is supposed to be run on that node is down (or up but not working properly), no log messages are collected/processed/sent to OpenSearch. It sounds like that's what you are seeing. Kubernetes Events (LOGSOURCE="KUBE_EVENT") follow a slightly different path:
This means that as long as the Event Router pod is running on a Kubernetes node with a healthy Fluent Bit pod, all Kubernetes events that are generated will show up in OpenSearch. So, if you are seeing Kubernetes Events, the Fluent Bit pod on the Kubernetes node where Event Router is running is probably healthy. That's good news. But it does NOT mean every Fluent Bit pod is healthy. To check on the health of the various Fluent Bit pods, there are a things you can do. The easiest is probably to run a As I mentioned earlier, you can also look at the Logging - Fluent Bit dashboard in Grafana to check on the health of Fluent Bit pods. Even if all of the Fluent Bit pods are "up", if you see a flat line for Out metric in the Message Rate (In/Out) chart near the top of that dashboard, it indicates that the Fluent Bit pods could NOT send messages to OpenSearch. If that's the case, the problem may be on the OpenSearch side. So, once again, check the pod logs (for OpenSearch) and/or do a I hope that makes sense. |
Hi @gsmith-sas
I confirmed that the STATUS of Fluent Bit Pod (v4m-fb-XXXXX) is Running as shown below.
Also, when checking the logs of Fluent Bit Pods (v4m-fb-XXXXX) with the command "kubectl logs --timestamps=true v4m-fb-XXXXX -n logging", there are many messages below in some Fluent Bit Pods I have confirmed that it is output.
This state was determined to be unhealthy from the information on the site https://docs.fluentbit.io/manual/administration/monitoring#health-check-for-fluent-bit. Therefore, I also checked fluent/fluent-bit#5145 and so on, and in order to investigate the cause of the unhealthy state, the [SERVICE] in fluent-bit_config.configmap_open.yaml I changed the Log_Level from info to debug and reflected the Configmap with the following command.
However, when checking with "kubectl -n logging describe configmap fb-fluent-bit-config", parts other than Log_Level have also changed, such as Labels: becoming .
Excuse me, could you tell me about the following? |
The WARNING messages (i.e. "[ warn] [engine] failed to flush chunk ~" ) indicates that Fluent Bit encountered a problem when trying to send a log message to OpenSearch. As the message says, it will retry sending that message again and it will do this multiple times. In our default configuration, Fluent Bit will try again up to 5 times which can take up to ~40 minutes. If the problem was caused by an intermittent problem with OpenSearch, one of the retries will generally succeed. So, this message by itself is not necessarily something to worry about. However, if all 5 attempts fail, Fluent Bit will discard the log message and that log message will never get to OpenSearch. You can see if this happens by looking at the Failures metric in the Grafana dashboard "Logging - Fluent Bit". If the Failures metric is always "0", it means no log messages have been lost. The easiest way to restore the original configuration of Fluent Bit is to return to the original version of the fluent-bit_config.configmap_open.yaml file. If the only change you made was to change the Log_level from "info" to "debug", undo that change, save the file and redeploy the Fluent Bit pods by running the deploy_fluentbit_open.sh script in the logging/bin directory of your copy of the the project repo. We update the project repo frequently, so if you clone the repo again, we might have a newer version available. If you want to ensure you redeploy the same version, you can use the git checkout command to identify which version you want to deploy (e.g. git checkout 1.2.7). The "stable" branch of the repo always has the latest release. The "main" branch is our active development branch, so it is always changing. Therefore, you will almost always want to deploy from the "stable" branch or using a tag pointing to an older version. From the kubectl get pods output you shared, I can see that most of the Fluent Bit pods have restarted 3 or 4 times over the last 8 hours. That is unusual. This might be because you modified the configuration and redeployed multiple times. Or, it could mean that the Fluent Bit pods are running into a problem. The log messages from Fluent Bit are collected and sent to OpenSearch. However, we store them in a different OpenSearch index. The log messages from Fluent Bit and the other components of Viya Monitoring are stored in indices with a name fitting the pattern: viya_ops-* rather than the viya_logs-* pattern used for the SAS Viya log messages. You can review the log messages by switching which index pattern you are using in the Discover page within OpenSearch Dashboards. Please let me know the problem you are trying to solve. Your original question asked about some "missing" log messages. I explained that these log messages might be missing because the Fluent Bit pods running on one or more of your Kubernetes nodes was not healthy and described how you might determine whether that was the case.
In general, there should be no need to alter the Fluent Bit configuration. So, I'd like to understand what you are seeing and what you are trying to do. Regards, |
Hi @gsmith-sas For Fluent Bit, I was able to revert to the original configuration by running the deploy_fluentbit_open.sh script.
After reconfirming with the procedure below, the [Failures] metric in [Logging - Fluent Bit] on the Grafana dashboard was 0, and no log messages were lost, so it was determined that the Fluent Bit pod was normal. After restarting the entire SAS environment, I deleted the "sas-xxx" pods using the following procedure and checked for logs on Discover.
Based on the above confirmation results, the log is not currently lost. |
@UshimaruYoshihiro I believe you are correct: looking for the Kubernetes event is probably the most reliable way of determining if a pod was deleted. This is because the pod itself may not be able to write out any log messages indicating this before it is terminated. I will close this issue now. Please feel free to open a new issue if you run into a new problem or have additional questions. Regards, |
We dropped the pod, but the log doesn't exist.
After dropping the pod with "kubectl -n delete pod" , I performed a search by specifying the ID of the pod dropped in ElasticSearch's Discover. Some logs are hit and some are not.
(1) 11/10 13:30 sas-connect-58bb448bbc-mkrmw : exist
(2) 11/10 13:40 sas-web-data-access-746b5bb9f7-mbmgz : exist
(3) 11/10 18:45 sas-connect-spawner-5646fdb7d9-8vcmr : not exist
(4) 11/10 18:50 sas-web-data-access-746b5bb9f7-8nx5m : not exist
Can you tell me if there is a possible cause?
編集済み
The text was updated successfully, but these errors were encountered: