In [45]:
from langchain_community.document_loaders import DirectoryLoader, UnstructuredMarkdownLoader
import os
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFaceEndpoint, ChatHuggingFace
from langchain_community.vectorstores import FAISS
from langchain_core.messages import AIMessage, HumanMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from dotenv import load_dotenv
from IPython.display import display, Markdown


In [46]:
path = os.path.join("..", "docs") 
loader = DirectoryLoader(
    path,
    glob="**/*.md",
    loader_cls=UnstructuredMarkdownLoader,
    show_progress=True 
)

documents = loader.load()

  0%|          | 0/20 [00:00<?, ?it/s]

100%|██████████| 20/20 [00:00<00:00, 35.93it/s]


In [47]:

embedding_model = HuggingFaceEmbeddings(model_name = "sentence-transformers/all-MiniLM-L6-v2")
# I chose not to use text splitter at this point because document size is small.
# Generating Embeddings and storing in vector Database FAISS
vector_store = FAISS.from_documents(documents=documents,embedding=embedding_model)
vector_store.save_local("faiss_index")

In [48]:
llm = HuggingFaceEndpoint(
    repo_id="meta-llama/Meta-Llama-3-8B-Instruct",
    task="conversational"
) # type: ignore

chat_model = ChatHuggingFace(llm=llm)

In [49]:
prompt_template = ChatPromptTemplate.from_messages(
    [
        ("system", """You are an expert Site Reliability Engineer. 
        Your task is to create a clear, step-by-step troubleshooting runbook based on the provided context. 
        The output must be actionable, not just summary text. For example: 
Input: 
“I’m seeing increased memory usage on pod X” 
Output: 
○ Step 1: Check container memory limits and requests using kubectl describe 
pod 
○ Step 2: Compare historical memory usage from Prometheus or Datadog 
○ Step 3: Look for memory leaks in the application logs 
○ Step 4: Consider restarting the pod if memory is consistently breaching limits"""),
        
        ("human", """Here is the context from our internal documents:
        ---
        {context}
        ---
        Now, please generate a runbook for the following problem: {question}""")
    ]
)

In [50]:
query = None

db = FAISS.load_local(
    os.path.join("..","faiss_index"),
    embedding_model,
    allow_dangerous_deserialization=True 
)


while True:
    query = input("Enter your query here : ")
    if query == "exit":
        break

    query_embedding = embedding_model.embed_query(query)
    similar_docs = db.similarity_search_by_vector(query_embedding, k=3)
    context = "\n\n".join([doc.page_content for doc in similar_docs])

    prompt = prompt_template.invoke({"question":HumanMessage(query),"context":context})
    display(Markdown(chat_model.invoke(prompt).content))
        

**Runbook for "My pod keeps restarting and the events show 'liveness probe failed'."**

**Step 1: Check Liveness Probe Configuration**

1. Run `kubectl describe pod <pod-name>` to view the events and configuration of the pod.
2. Look for the liveness probe configuration under the "Readiness Gates" or "Liveness" section.
3. Check the probe's timeout, period, and command to determine if it's too aggressive.

**Step 2: Verify Downstream Dependencies**

1. Inspect the application code to determine if it's making any downstream calls.
2. Check if the downstream call is slow and if it's causing the liveness probe to fail.
3. Consider implementing a mechanism to cache or throttle these calls to improve performance.

**Step 3: Analyze Application Logs**

1. Run `kubectl logs <pod-name>` to view the application logs.
2. Search for errors or exceptions related to the liveness probe failure.
3. Check for any signs of memory leaks or resource exhaustion.

**Step 4: Consider Increasing Liveness Probe Timeout**

1. If the liveness probe is too aggressive, consider increasing the timeout or period.
2. Run `kubectl patch deployment <deployment-name> --patch '{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","livenessProbe":{"timeoutSeconds":5,"periodSeconds":10}}]}}}}'` to update the liveness probe configuration.
3. Monitor the pod's behavior and adjust the configuration as needed.

**Step 5: Force a Rolling Restart (Optional)**

1. If the problem persists after increasing the liveness probe timeout, consider forcing a rolling restart of the deployment.
2. Run `kubectl rollout restart deployment/<deployment-name>` to restart all pods in the deployment.
3. Monitor the pod's behavior and adjust the configuration as needed.

**Step 6: Verify HPA Configuration (Optional)**

1. If the HPA is not scaling the deployment correctly, check the HPA configuration.
2. Run `kubectl describe hpa <hpa-name>` to view the HPA configuration.
3. Verify that the HPA is configured correctly and that the metrics-server is running correctly.

**Step 7: Re-Install Metrics-Server (Optional)**

1. If the metrics-server is not running correctly, re-install it.
2. Run `kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.6.1/components.yaml` to re-install the metrics-server.
3. Verify that the HPA is now scaling the deployment correctly.

Based on the provided context and the problem description, here's a step-by-step troubleshooting runbook:

**Problem:** Pod memory usage is consistently high, potentially causing memory pressure on the node and pod eviction.

**Runbook:**

○ **Step 1: Check container memory limits and requests using kubectl describe pod**

 Run the following command to check the pod's configuration:
```bash
kubectl describe pod <pod_name>
```
Look for the "Resources" section and verify that the container has memory limits and requests set. Make a note of the values.

○ **Step 2: Compare historical memory usage from Prometheus or Datadog**

Access the Prometheus or Datadog UI to view historical memory usage data for the pod. Compare the current memory usage to previous values to identify any patterns or anomalies.

○ **Step 3: Look for memory leaks in the application logs**

 Review the application logs to check for any signs of memory leaks. Look for error messages or patterns that suggest excessive memory usage.

○ **Step 4: Check if the pod is configured to use excessive resources**

 Verify that the pod's configuration is not intentionally set to use excessive resources. Check the pod's YAML file and the deployment or replica set configuration to ensure that the memory limits and requests are set correctly.

○ **Step 5: Consider restarting the pod if memory is consistently breaching limits**

 If the memory usage is consistently high and causing issues, consider restarting the pod to reset its memory usage. Use the following command to restart the pod:
```bash
kubectl rollout restart deployment <deployment_name>
```
Replace `<deployment_name>` with the actual name of the deployment associated with the pod.

○ **Step 6: Review and adjust resource requests and limits**

 Based on the previous steps, review and adjust the resource requests and limits for the pod as necessary. Make sure to set the requests equal to the limits for critical workloads to get a Guaranteed QoS class.

○ **Step 7: Run a memory leak detection tool (optional)**

 If the issue persists, consider running a memory leak detection tool, such as Valgrind or AddressSanitizer, to identify any memory leaks in the application code.

By following these steps, you should be able to troubleshoot and resolve the issue with high pod memory usage.