By focusing specifically on precision and recall, we can objectively measure how well our retrieval system is performing and support several initial use cases early on. 

$$ \text{Precision} = \frac{\text{Number of Relevant Items Retrieved}}{\text{Total Number of Retrieved Items}} $$ 


$$ \text{Recall} = \frac{\text{Number of Relevant Items Retrieved}}{\text{Total Number of Relevant Items}} $$ 

Precision and Recall are two metrics that help us to understand the trade-offs in our retrieval system. They're relatively inexpensive and straightforward to compute. This means that we can iterate much faster, more efficiently and establish a strong foundation for our RAG application.


1. Precision measures how often we get useful results. If your system retrieves ten documents but only five matter, that's 50% precision. Low precision means your system wastes time on irrelevant information. Your LLM processes useless documents, and your answers suffer.
2. Recall tells us if we're missing anything important. If twenty relevant documents exist but you only find ten, that's 50% recall. Low recall means incomplete answers. Your users miss critical information they need to make decisions.

In [1]:
import os
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from typing import List, Optional, Literal
import json


# Base document chunk model
class MiningDocumentChunk(BaseModel):
    chunk_id: str
    text: str
    equipment_model: str  # e.g., "TH663i", "LH517i"
    equipment_type: Literal["truck", "loader", "drill", "crusher"]
    document_type: Literal["manual", "specs", "maintenance", "safety", "parts"]
    section: Optional[str] = None  # e.g., "Engine", "Hydraulics", "Electrical"
    
    class Config:
        frozen = True

# Synthetic question model with mining context
class MiningQuestion(BaseModel):
    chain_of_thought: str  # Reasoning about why this question is relevant
    question: str
    equipment_model: str  # Which equipment this question is about
    question_type: Literal[
        "technical_specs",
        "maintenance",
        "safety",
        "operation",
        "troubleshooting"
    ]
    expected_doc_types: List[str] = Field(
        description="Types of documents that should contain the answer"
    )

# Evaluation pair model
class MiningChunkEval(BaseModel):
    chunk_id: str
    question: str
    chunk: str
    relevance_score: Optional[float] = Field(
        default=None,
        ge=0.0,
        le=1.0,
        description="How relevant this chunk is to the question (0-1)"
    )
    equipment_match: bool = Field(
        description="Whether this chunk is about the same equipment as the question"
    )
    contains_answer: bool = Field(
        description="Whether this chunk contains information that answers the question"
    )

In [2]:
import json
from pydantic import BaseModel
from typing import Dict, List
from asyncio import Semaphore, timeout
from tqdm.asyncio import tqdm_asyncio
from tenacity import retry, stop_after_attempt, wait_fixed
import asyncio
import logging
import cohere

class Question(BaseModel):
    chain_of_thought: str
    question: str

class MiningQuestionGenerator:
    def __init__(self, client: cohere.ClientV2, max_concurrent: int = 3):
        self.client = client
        self.semaphore = Semaphore(max_concurrent)
        self.logger = logging.getLogger(__name__)

    @retry(
        stop=stop_after_attempt(3),
        wait=wait_fixed(2),
        retry_error_callback=lambda retry_state: retry_state.outcome.result()
    )
    async def _generate_single_question(self, chunk: Dict[str, str]) -> Question:
        """Generate a single question with retry logic"""
        async with self.semaphore:  # Rate limit concurrent requests
            try:
                async with timeout(30):  # Timeout after 30 seconds
                    prompt = f"""
                    Generate a realistic question that could be answered using the following mining equipment documentation snippet.

                    Documentation Snippet:
                    {chunk['text']}

                    Equipment Context:
                    - Type: {chunk['equipment_type']}
                    - Model: {chunk['equipment_model']}
                    - Document Type: {chunk['document_type']}

                    Rules:
                    - If there are specific measurements, capacities, or values in the snippet, try to make the question more general
                    - The question should be at most 2 sentences long
                    - If there are maintenance intervals or time periods mentioned, consider varying them slightly
                    - The question must be answerable using the documentation snippet or with minimal modification
                    - Focus on practical, operator-relevant questions
                    - For safety documentation, emphasize critical procedures and requirements

                    Format your response strictly as a JSON object with these exact keys:
                    {{
                        "chain_of_thought": "Brief explanation of why this question is relevant for operators",
                        "question": "The actual question"
                    }}
                    """

                    response = await asyncio.to_thread(
                        self.client.chat,
                        model="command-r-plus-08-2024",
                        messages=[
                            {
                                "role": "system",
                                "content": "You are a mining equipment operator"
                            },
                            {
                                "role": "user",
                                "content": prompt
                            }
                        ],
                        response_format={"type": "json_object"},
                    )
                response_text = response.message.content[0].text
                    
                    # Parse JSON from response
                try:
                    data = json.loads(response_text)
                    return Question(
                        chain_of_thought=data["chain_of_thought"],
                        question=data["question"]
                    )
                except json.JSONDecodeError as e:
                    self.logger.error(f"JSON parsing error: {str(e)}")
                    self.logger.error(f"Response text: {response_text}")
                    raise
                except Exception as e:
                    self.logger.error(f"Unexpected error: {str(e)}")
                    raise

            except Exception as e:
                self.logger.error(f"Error generating question: {str(e)}")
                raise

    async def generate_questions(self, chunks: List[Dict[str, str]]) -> List[Question]:
        """Generate questions for multiple chunks in parallel"""
        async def process_chunk(chunk: Dict[str, str]) -> tuple[Dict[str, str], Question]:
            try:
                question = await self._generate_single_question(chunk)
                return chunk, question
            except Exception as e:
                self.logger.error(f"Failed to process chunk {chunk.get('equipment_model')}: {str(e)}")
                return chunk, None

        # Process chunks with progress bar
        results = await tqdm_asyncio.gather(
            *[process_chunk(chunk) for chunk in chunks],
            desc="Generating questions"
        )
        print(results)

        # Filter out failed generations
        successful_results = [(chunk, question) for chunk, question in results if question is not None]
        
        if len(successful_results) < len(chunks):
            self.logger.warning(
                f"Generated {len(successful_results)} questions out of {len(chunks)} chunks"
            )

        return successful_results

In [3]:

# Initialize Cohere client
client = cohere.ClientV2(os.getenv("COHERE_API_KEY"))
generator = MiningQuestionGenerator(client)


chunks = json.loads(open("data/technical_manuals.json", "r").read())


results = await generator.generate_questions(chunks)

for chunk, question in results:
    print(f"\nEquipment: {chunk['equipment_model']}")
    print(f"Document Type: {chunk['document_type']}")
    print(f"Chain of Thought: {question.chain_of_thought}")
    print(f"Generated Question: {question.question}")

Generating questions: 100%|██████████| 71/71 [00:53<00:00,  1.34it/s]

[({'text': 'The TH860 underground truck is designed with a reinforced chassis for handling heavy loads and rough terrain in mining operations.', 'equipment_type': 'truck', 'equipment_model': 'TH860', 'document_type': 'specs'}, Question(chain_of_thought="Understanding the design features of mining equipment is crucial for operators to ensure they are using the machinery as intended and maximizing its potential. This question focuses on a key design aspect that directly impacts the truck's performance and suitability for specific mining tasks.", question="How does the TH860 truck's chassis design contribute to its overall functionality and durability in underground mining environments?")), ({'text': "Operators must ensure the LH410 loader's bucket pins are lubricated every 250 operating hours to prevent wear and tear.", 'equipment_type': 'loader', 'equipment_model': 'LH410', 'document_type': 'maintenance'}, Question(chain_of_thought="This question is crucial for operators to understand r




In [4]:
import braintrust
from typing import List, Tuple
from pydantic import BaseModel


braintrust.login(
    api_key=os.environ.get("BRAINTRUST_API_KEY")
)

def insert_questions_to_braintrust(results: List[Tuple[dict, Question]]):
    """Insert generated questions into Braintrust"""
    
    # Initialize Braintrust Dataset
    dataset = braintrust.init_dataset(
        project="industrial_rag",
        name="Equipment-Questions-V1"
    )

    # Insert questions row by row
    for chunk, question in results:
        if question:  # Check if question was generated successfully
            dataset.insert(
                input=question.question,
                expected=[chunk['text']],  # The chunk text is our expected content
                metadata={
                    "chunk_id": f"{chunk['equipment_model']}-{chunk['document_type']}-{id(chunk)}",
                    "chunk": chunk['text'],
                    "equipment_model": chunk['equipment_model'],
                    "equipment_type": chunk['equipment_type'],
                    "document_type": chunk['document_type']
                }
            )

    # Print summary
    print(dataset.summarize())
    return dataset


# Insert into Braintrust
dataset = insert_questions_to_braintrust(results)


Total records: 81 (71 new or updated records)
See results for all datasets in industrial_rag at https://www.braintrust.dev/app/shubham/p/industrial_rag
See results for Equipment-Questions-V1 at https://www.braintrust.dev/app/shubham/p/industrial_rag/datasets/Equipment-Questions-V1
