<h1>Kubernetes Healthcheck Runbook</h1>
<p>This runbook grabs all of your K8s pods, reads the logs from them, and then output any WARNING logs from the last hour.</p>
<p>&nbsp;</p>
<ul>
<li>Step 1: Get all of the pods</li>
<li>Step 2: get all of the lopgs for each pod</li>
<li>Step 3 parse the logs for warnings in the last hour</li>
<li>Step 4: if there are warnings - send a Slack alert.</li>
</ul>

<p>The input required is the namespace - from the input parameters.</p>
<p>&nbsp;</p>
<p>This will then query the namespace and return a list of pods in the Output variable 'podList.'</p>

In [24]:
from pydantic import BaseModel, Field
import pandas as pd
import io

from beartype import beartype
@beartype
def k8s_kubectl_list_pods_printer(data: list):
    if data is None:
        return

    print("POD List:")

    for pod in data:
        print(f"\t {pod}")

@beartype
def k8s_kubectl_list_pods(handle, k8s_cli_string: str, namespace: str) -> list:
    """k8s_kubectl_list_pods executes the given kubectl command

        :type handle: object
        :param handle: Object returned from the Task validate method

        :type k8s_cli_string: str
        :param k8s_cli_string: kubectl get pods -n {namespace}.

        :type namespace: str
        :param namespace: Namespace.

        :rtype:
    """
    k8s_cli_string = k8s_cli_string.format(namespace=namespace)
    result = handle.run_native_cmd(k8s_cli_string)
    df = pd.read_fwf(io.StringIO(result.stdout))
    all_pods = []
    for index, row in df.iterrows():
        all_pods.append(row['NAME'])
    return all_pods


task = Task(Workflow())
task.configure(inputParamsJson='''{
    "k8s_cli_string": "\\"kubectl get pods -n {namespace}\\"",
    "namespace": "namespace"
    }''')
task.configure(outputName="podList")
task.configure(printOutput=True)
(err, hdl, args) = task.validate(vars=vars())
if err is None:
    task.execute(k8s_kubectl_list_pods, lego_printer=k8s_kubectl_list_pods_printer, hdl=hdl, args=args)

<p>Step 2 takes the list of pod 'pod"list' from Step one, and the namespace input parameter, and obtains the logs for all of the Pods.</p>
<p>&nbsp;</p>
<p>We use the Iterator to iterate through the list.&nbsp; This can take a while if you have a lot of pods.</p>
<p>&nbsp;</p>
<p>The output is saved in a Dict called `allTheLogs'</p>

In [25]:
from pydantic import BaseModel, Field
from pprint import pprint

from beartype import beartype
@beartype
def k8s_kubectl_get_logs_printer(data: str):
    if data is None:
        return

    print("Logs:")

    pprint (data)

@beartype
def k8s_kubectl_get_logs(handle, k8s_cli_string: str, pod_name: str, namespace:str) -> str:
    """k8s_kubectl_get_logs executes the given kubectl command

        :type handle: object
        :param handle: Object returned from the Task validate method

        :type k8s_cli_string: str
        :param k8s_cli_string: kubectl logs {pod_name} -n {namespace}.

        :type pod_name: str
        :param pod_name: Pod Name.

        :type namespace: str
        :param namespace: Namespace.

        :rtype: String, Output of the command in python string format or Empty String in case of Error.
    """
    k8s_cli_string = k8s_cli_string.format(pod_name=pod_name, namespace=namespace)
    result = handle.run_native_cmd(k8s_cli_string)
    data = result.stdout
    return data


task = Task(Workflow())
task.configure(continueOnError=False)
task.configure(inputParamsJson='''{
    "k8s_cli_string": "\\"kubectl logs {pod_name} -n {namespace}\\"",
    "namespace": "namespace",
    "pod_name": "iter_item"
    }''')
task.configure(iterJson='''{
    "iter_enabled": true,
    "iter_list_is_const": false,
    "iter_list": "podList",
    "iter_parameter": "pod_name"
    }''')
task.configure(outputName="allTheLogs")

task.configure(printOutput=True)
(err, hdl, args) = task.validate(vars=vars())
if err is None:
    task.execute(k8s_kubectl_get_logs, lego_printer=k8s_kubectl_get_logs_printer, hdl=hdl, args=args)

<p>'allTheLogs' is a pretty big file.&nbsp; Loop through each log file, and extract any WARNING messages.&nbsp;&nbsp;<br><br><br></p>
<p>We use the input parameter hoursToExamine to filter for logs back that many hours.</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>

In [26]:
import re
from datetime import datetime, timedelta

#get all warnings
#only report warnings fournd in the x hours
timeDiff = datetime.now()- timedelta(hours=hoursToExamine)
#if there are warnings that are ok to supress, add them to this list
stringsToIgnore = ["arerqewreqwr" ]
#this will hold all the warnings
warning_text_all = {}

#Specific issues we can deal with
primaryShardIsNotActive = False

#we've collected a bunch of logs, lets loop through them for Warnings
for instance in allTheLogs:
    #print(instance)
    log = allTheLogs[instance]
    #find the position of all instances of '[WARN' in the logs
    warning_start = [m.start() for m in re.finditer(re.escape('[WARN'), log)]
    
    for i in warning_start:
        warningtime = log[i-24:i-5]
        issue  = log[i:i+400]
        warningtimeDT = datetime.strptime(warningtime, '%Y-%m-%dT%H:%M:%S')
        if warningtimeDT > timeDiff:
            if issue not in stringsToIgnore:
                warning_text_all[instance] = { warningtime:issue}
                #test for specific issues
                if issue.find("primary shard is not active Timeout")>0:
                    primaryShardIsNotActive = True
                
print(warning_text_all, len(warning_text_all))

<p>Only send a slack message if there is a problem.&nbsp;&nbsp;</p>
<p>&nbsp;</p>
<p>To facilitate this, we use the Start Condition</p>
<p>```</p>
<p>len(warning_text_all) &gt;0</p>
<p>```</p>
<p>If there are warnings, a Slack message is sent. If there are no warnings, there is no message.</p>

In [27]:
##
# Copyright (c) 2021 unSkript, Inc
# All rights reserved.
##

import pprint

from pydantic import BaseModel, Field
from slack_sdk import WebClient
from slack_sdk.errors import SlackApiError

pp = pprint.PrettyPrinter(indent=2)

from beartype import beartype

from beartype import beartype
@beartype
def slack_post_message_printer(output):
    if output is not None:
        pprint.pprint(output)
    else:
        return


@beartype
@beartype
def slack_post_message(
        handle: WebClient,
        channel: str,
        message: str) -> str:

    try:
        response = handle.chat_postMessage(
            channel=channel,
            text=message)
        return f"Successfuly Sent Message on Channel: #{channel}"
    except SlackApiError as e:
        pp.pprint(
            f"Failed sending message to slack channel {channel}, Error: {e.response['error']}")
        if e.response['error'] == 'channel_not_found':
            raise Exception('Channel Not Found')
        elif e.response['error'] == 'duplicate_channel_not_found':
            raise Exception('Channel associated with the message_id not valid')
        elif e.response['error'] == 'not_in_channel':
            raise Exception('Cannot post message to channel user is not in')
        elif e.response['error'] == 'is_archived':
            raise Exception('Channel has been archived')
        elif e.response['error'] == 'msg_too_long':
            raise Exception('Message text is too long')
        elif e.response['error'] == 'no_text':
            raise Exception('Message text was not provided')
        elif e.response['error'] == 'restricted_action':
            raise Exception('Workspace preference prevents user from posting')
        elif e.response['error'] == 'restricted_action_read_only_channel':
            raise Exception('Cannot Post message, read-only channel')
        elif e.response['error'] == 'team_access_not_granted':
            raise Exception('The token used is not granted access to the workspace')
        elif e.response['error'] == 'not_authed':
            raise Exception('No Authtnecition token provided')
        elif e.response['error'] == 'invalid_auth':
            raise Exception('Some aspect of Authentication cannot be validated. Request denied')
        elif e.response['error'] == 'access_denied':
            raise Exception('Access to a resource specified in the request denied')
        elif e.response['error'] == 'account_inactive':
            raise Exception('Authentication token is for a deleted user')
        elif e.response['error'] == 'token_revoked':
            raise Exception('Authentication token for a deleted user has been revoked')
        elif e.response['error'] == 'no_permission':
            raise Exception('The workspace toekn used does not have necessary permission to send message')
        elif e.response['error'] == 'ratelimited':
            raise Exception('The request has been ratelimited. Retry sending message later')
        elif e.response['error'] == 'service_unavailable':
            raise Exception('The service is temporarily unavailable')
        elif e.response['error'] == 'fatal_error':
            raise Exception('The server encountered catostrophic error while sending message')
        elif e.response['error'] == 'internal_error':
            raise Exception('The server could not complete operation, likely due to transietn issue')
        elif e.response['error'] == 'request_timeout':
            raise Exception('Sending message error via POST: either message was missing or truncated')
        else:
            raise Exception(f'Failed Sending Message to slack channel {channel} Error: {e.response["error"]}')

    except Exception as e:
        print("\n\n")
        pp.pprint(
            f"Failed sending message to slack channel {channel}, Error: {e.__str__()}")
        return f"Unable to send message on {channel}"


task = Task(Workflow())
task.configure(inputParamsJson='''{
 
    "message": "warning_text_all"
    }''')
task.configure(conditionsJson='''{
    "condition_enabled": true,
    "condition_cfg": "len(warning_text_all) >0",
    "condition_result": true
    }''')
task.configure(printOutput=True)
(err, hdl, args) = task.validate(vars=vars())
if err is None:
    task.execute(slack_post_message, lego_printer=slack_post_message_printer, hdl=hdl, args=args)