In [1]:
pip install google-generativeai




In [2]:

import os
import google.generativeai as genai
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
import smtplib
import base64

In [3]:
# Set your Gemini API key
os.environ["GEMINI_API_KEY"] = "YOUR_API_KEY"
genai.configure(api_key=os.environ["GEMINI_API_KEY"])

# Initialize the model
model = genai.GenerativeModel("gemini-2.5-flash")


In [4]:
 #Example test query
response = model.generate_content("Hello Gemini, are you working?")
print(response.text)


Hello! Yes, I am here and ready to assist you. How can I help you today?


In [5]:
# Sample log data
log_data = """
2025-09-18 02:10:21 INFO Starting server on port 8080
2025-09-18 02:10:25 WARNING High memory usage detected: 85%
2025-09-18 02:10:27 ERROR Database connection failed
2025-09-18 02:10:31 INFO Retrying database connection
2025-09-18 02:10:35 ERROR Database connection failed
2025-09-18 02:10:40 CRITICAL Service unavailable due to repeated DB failures
"""

In [6]:
with open("system_logs.txt", "w") as f:
    f.write(log_data)


In [7]:
# Read logs
with open("system_logs.txt", "r") as f:
    logs = f.read()
print("Raw logs:\n", logs)


Raw logs:
 
2025-09-18 02:10:21 INFO Starting server on port 8080
2025-09-18 02:10:27 ERROR Database connection failed
2025-09-18 02:10:31 INFO Retrying database connection
2025-09-18 02:10:35 ERROR Database connection failed
2025-09-18 02:10:40 CRITICAL Service unavailable due to repeated DB failures



In [8]:
# Log explanation
prompt = f"""
You are an expert IT assistant who helps analyze server logs.
Here are some logs:
{logs}

Please explain in simple terms what is happening.
"""
response = model.generate_content(prompt)
print(response.text)

Alright, let's take a look at these logs. Here's a simple breakdown of what's happening:

1.  **Server Starts Up (02:10:21):** Everything begins normally, the server successfully starts.
4.  **Attempted Recovery (02:10:31):** The system tries to be smart and automatically attempts to reconnect to the database.
5.  **Database Fails Again (02:10:35):** Unfortunately, the retry attempt also **fails**.
6.  **Service Goes Down (02:10:40):** Because the server couldn't connect to its database even after trying twice, it has given up. The system has reached a **critical state**, and the **entire service is now unavailable** to users.

**In essence:** Your server started, quickly ran into memory problems, then completely lost its ability to talk to its database. Because the database is crucial, the entire service has now stopped working. The high memory usage is a strong candidate for *why* the database connection is failing.


In [9]:
# Severity classification
classification_prompt = f"""
You are an expert Site Reliability Engineer.
Read the following logs and classify the incident severity as:
- P1 (Critical): service down or customer impact
- P2 (High): service degraded, limited impact
- P3 (Low): minor issue, warning only

Explain rationale behind final classification.

Logs:
{logs}
"""
response = model.generate_content(classification_prompt)
print(response.text)


**Incident Severity: P1 (Critical)**

**Rationale:**

The logs clearly indicate a critical service outage, which aligns directly with the P1 definition of "service down or customer impact."

Here's a breakdown of the escalation:

2.  **`ERROR Database connection failed` (repeated twice)**: This indicates a fundamental failure in a core component. The service cannot perform its function without database access. Repeated failures, especially after a retry, point to a persistent and serious issue.
3.  **`CRITICAL Service unavailable due to repeated DB failures`**: This is the definitive statement. The service explicitly declares itself "unavailable," directly confirming that it is down and customers are impacted. The `CRITICAL` log level further emphasizes the severity from the application's perspective.

The combination of persistent database connection failures and the final declaration of "Service unavailable" leaves no doubt that the service is non-functional, leading to a direct and 

In [10]:
# Summary + root cause
summary_prompt = f"""
You are an expert Incident Response Assistant.
Read the following logs and provide:
1. A short summary (2-3 sentences) in plain English.
2. Possible root cause hints (bullet points).

Logs:
{logs}
"""
response = model.generate_content(summary_prompt)
print(response.text)


Here's a summary and possible root cause hints based on the logs:

### 1. Short Summary

The server started successfully, but within seconds, it reported critically high memory usage. This was immediately followed by repeated failures to establish a database connection, ultimately leading to the service becoming unavailable.

### 2. Possible Root Cause Hints

*   **Memory Exhaustion on Application Host:** The high memory usage (85%) detected very early suggests the application server might be running out of resources. This could be due to a memory leak in the application, insufficient allocated memory for the server, or another process on the same host consuming excessive resources, thus preventing the application from properly connecting to the database.
*   **Database Server Issues:** The database server itself might be down, unresponsive, or overloaded, reaching its connection limits, which would prevent the application from establishing a connection.
*   **Network Connectivity Prob

In [11]:
def send_email_alert(message: str):
    sender_email = "itskillsupgrade@gmail.com"
    receiver_email = "itskillsupgrade@gmail.com"
    app_password = "miydxcnopqjrpsyg"  # Google App Password

    subject = "Incident Alert from Gemini"

    msg = MIMEMultipart()
    msg['From'] = sender_email
    msg['To'] = receiver_email
    msg['Subject'] = subject
    msg.attach(MIMEText(message, 'plain'))

    try:
        server = smtplib.SMTP("smtp.gmail.com", 587)
        server.starttls()
        server.login(sender_email, app_password)
        server.sendmail(sender_email, receiver_email, msg.as_string())
        server.quit()
        print("[EMAIL SENT] " + message)
    except Exception as e:
        print("[ERROR] Failed to send email:", e)

In [12]:
# Escalation workflow
def escalation_workflow(severity: str, summary: str):
    if severity == "P3":
        return "[NO ACTION] Logged only."
    elif severity == "P2":
        send_email_alert(f"[P2 Incident] {summary}")
        return "[ACTION] Emailed ops team."
    elif severity == "P1":
        send_email_alert(f"[P1 CRITICAL] {summary} | Escalating to manager.")
        return "[ACTION] Emailed ops + manager. War room escalation triggered."
    else:
        return "[UNKNOWN] No matching workflow."

In [13]:
incident_logs = """
2025-09-18 02:10:27 ERROR Database connection failed
2025-09-18 02:10:35 ERROR Database connection failed
2025-09-18 02:10:40 CRITICAL Service unavailable due to repeated DB failures
"""

classification_prompt = f"""
You are an expert incident classifier.
Classify the severity as P1, P2, or P3 and give a 1-line summary.

Logs:
{incident_logs}
"""
response = model.generate_content(classification_prompt)
ai_output = response.text
print("Gemini Output:", ai_output)


Gemini Output: P1: Service is unavailable due to repeated database connection failures.


In [14]:
# Example severity extraction
severity = "P1"
summary = "Database unavailable due to repeated failures"

result = escalation_workflow(severity, summary)
print(result)


[EMAIL SENT] [P1 CRITICAL] Database unavailable due to repeated failures | Escalating to manager.
[ACTION] Emailed ops + manager. War room escalation triggered.


In [15]:
# Large logs example
large_logs = "\n".join([
    f"2025-09-18 02:{i:02d}:00 ERROR Database connection failed"
    for i in range(100)
]) + "\n2025-09-18 03:00:00 CRITICAL Service unavailable"

with open("large_logs.txt", "w") as f:
    f.write(large_logs)

with open("large_logs.txt", "r") as f:
    logs = f.read()


In [16]:

summary_prompt = f"""
You are an expert incident responder.
Read the following logs and provide:
1. A 3-sentence summary in plain English.
2. The final incident severity (P1, P2, P3).
3. Root cause hints in bullet points.

Logs:
{logs}
"""
response = model.generate_content(summary_prompt)
print(response.text)

Here's the incident analysis:

---

1.  **3-sentence summary in plain English:**
    Starting at 02:00:00, the system experienced continuous "Database connection failed" errors every minute for a full hour. This prolonged and unresolvable issue with database connectivity escalated to a complete system outage. By 03:00:00, the service was deemed "CRITICAL Service unavailable," indicating a total loss of functionality.

2.  **Final incident severity:**
    P1 (Critical)

3.  **Root cause hints:**
    *   **Database Server Status:** The database server itself may be down, crashed, or unresponsive.
    *   **Network Connectivity:** Issues with network paths, firewalls, or routing preventing the application from reaching the database.
    *   **Database Resource Exhaustion:** The database might have run out of resources such as available connections, disk space, or memory.
    *   **Authentication/Authorization:** Incorrect, expired, or revoked database credentials used by the application.


In [18]:
# Optional: Analyze image (Gemini multimodal)
image_path = "/content/Screenshot 2025-10-15 183200.png"

def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

base64_image = encode_image(image_path)

response = model.generate_content([
    "Analyze this graph and explain what's happening.",
    {"mime_type": "image/png", "data": base64.b64decode(base64_image)}
])

print(response.text)

This graph displays the memory utilization and swap usage of a system over approximately 20 minutes, from 05:40 to 06:00 on September 21, 2025.

Here's a breakdown of what's happening:

1.  **Initial State (05:40 - ~05:44):**
    *   **Memory Utilization:** The system is using approximately **8.5-9 GB of "used" memory** (teal/blue) for active applications and the OS. There's also about **4-4.5 GB of "cached" memory** (orange), which the OS uses to store frequently accessed data for faster retrieval. "Buffer" memory (green) is negligible. The total physical memory being utilized is around **13-13.5 GB**.
    *   **Swap Utilization:** During this period, swap usage (red line) is very low, around **9-10 MB**. This indicates that the system has ample physical RAM and doesn't need to offload data to the slower disk-based swap space.

2.  **Increased Memory Pressure (~05:44 - ~05:59):**
    *   **Memory Utilization:** Around 05:44, there's a noticeable increase in **"used" memory**, rising t