Reading all the content from wikitext files. These are Root Cause Analysis (RCA) reports focusing on techincal incidents related to connectivity issues and DDoS attacks affecting Wikimedia sites and cloud services.

Link of the dataset : https://wikitech.wikimedia.org/w/index.php?title=Category:Incident_documentation&pageuntil=Incidents/2025

In [None]:
pip install langchain pinecone-client sentence-transformers tqdm langchain-community


Step 0: Install all the requirements:

In [2]:
!pip install langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.16-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.7.1-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.0-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading mypy_extensions-1.0.0-py3-no

In [14]:
# Function to clean wikitext
import mwparserfromhell 

def clean_wikitext(content):
    # Parse wikitext and remove unwanted tags and formatting
    wikicode = mwparserfromhell.parse(content)
    text = wikicode.strip_code()  # Removes all wikitext formatting
    return text

In [None]:
import os
from tqdm import tqdm
from pinecone import Pinecone
from langchain.embeddings import HuggingFaceEmbeddings

# Initialize Pinecone Client
pc = Pinecone(api_key="")  # Replace with your API key
index = pc.Index("rca-reports")

# Ensure index exists (create it if necessary, can also be created from the pinecone portal)
dimension = 768
existing_indexes = [idx["name"] for idx in pc.list_indexes()]
if "rca-reports" not in existing_indexes:
    pc.create_index("rca-reports", dimension=dimension)

# Load open-source embedding model
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/multi-qa-mpnet-base-dot-v1")

# Path to RCA Reports folder
folder_path = "RCA_Reports"
documents = []

for filename in tqdm(os.listdir(folder_path)):
    file_path = os.path.join(folder_path, filename)
    if os.path.isfile(file_path) and file_path.endswith(('.txt', '.md', '.wikitext')):
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()
            cleaned_content = clean_wikitext(content)
            documents.append({"id": filename, "content": cleaned_content})

# Generate embeddings and prepare upsert payload
vectors = []
for doc in tqdm(documents):
    embedding = embedding_model.embed_query(doc["content"])  # Generate 768-dimensional embedding
    vectors.append(
        {
            "id": doc["id"],
            "values": embedding,
            "metadata": {"filename": doc["id"], "content": doc["content"]}
        }
    )

# Upsert vectors into the Pinecone index
index.upsert(vectors=vectors, namespace="rca-namespace")

print(f"Successfully upserted {len(vectors)} vectors to Pinecone.")


100%|██████████| 200/200 [00:00<00:00, 683.87it/s]
100%|██████████| 199/199 [00:10<00:00, 18.15it/s]


Successfully upserted 199 vectors to Pinecone.


In [None]:
# perform vector search on the pinecone data base and return the top 5 results // this can be changed accordingly

import numpy as np
from langchain.embeddings import HuggingFaceEmbeddings
from pinecone import Pinecone

# Initialize Pinecone Client
pc = Pinecone(api_key="")  # Replace with your API key
index = pc.Index("rca-reports")

# Load open-source embedding model
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/multi-qa-mpnet-base-dot-v1")

# Perform semantic search
def semantic_search(query, top_k=5):
    # Generate embedding for the query
    query_embedding = embedding_model.embed_query(query)
    
    # Query Pinecone index
    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        include_metadata=True,
        namespace="rca-namespace"
    )
    
    # Process results
    search_results = []
    for result in results["matches"]:
        search_results.append({
            "id": result["id"],
            "score": result["score"],  # Similarity score
            "metadata": result["metadata"]  # Metadata (filename and content)
        })
    
    return search_results

# Example search
query = "What is the most common cause of incidents in Wikimedia's infrastructure?"
results = semantic_search(query)

# Print results
for result in results:
    print(f"ID: {result['id']}")
    print(f"Score: {result['score']}")
    print(f"Filename: {result['metadata']['filename']}")
    print(f"Content: {result['metadata']['content'][:500]}...")  # Display first 500 chars of content
    print("\n---\n")



In [5]:
print(len(embeddings[0]))

768


In [22]:
# To run the the query you will need to have ollama installed in your system to run the llm model locally and then make use of yout desired model and change it here accordingly to which ever model you have running locally

# pass the context to the api with a single shot prompt to respond to the user query

import ollama

message = {'role': 'user', 'content': f'You are a RCA expert that understands incidents, focusing on techincal incidents related to connectivity issues and DDoS attacks affecting Wikimedia sites and cloud services. You will use only the context provided to you to answer the user question else you will not respond with maximum of 5 sentences as a summary answering like an expert. Using the context {results}. User Query:{query}'}

for part in ollama.chat(model='mistral', messages=[message], stream=True):
  print(part['message']['content'], end='', flush=True)

 The most common cause of incidents in Wikimedia's infrastructure, as evidenced from the provided log, appears to be a combination of factors such as high load on the database (potentially due to large image tables or queries like GlobalImageLinks), internal state corruption, and external factors like network issues. However, it's important to note that these are not definitive causes and further investigation might reveal other contributing factors. The incident also highlights the importance of having a robust monitoring system and being prepared for rapid response with the right personnel available during critical periods.