# 03 - Knowledge Layer

**CLARISSA's** knowledge layer provides the contextual intelligence that transforms a generic LLM into a reservoir simulation expert. This notebook covers:

1. Vector database architecture with pgvector
2. Embedding generation for technical documentation
3. Semantic search for keyword assistance
4. Reservoir analog database for intelligent defaults
5. Hybrid search strategies (semantic + keyword)
6. Knowledge ingestion pipelines

---

In [None]:
# Colab Setup — API Keys & Dependencies
import sys, os
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install -q openai
    from google.colab import userdata
    # Set keys from Colab Secrets (Settings → Secrets → Add)
    try: os.environ['ANTHROPIC_API_KEY'] = userdata.get('ANTHROPIC_API_KEY')
    except: pass
    try: os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
    except: pass
    try: os.environ['GITLAB_TOKEN'] = userdata.get('GITLAB_TOKEN')
    except: pass
    # Fallback: manual input
    if not os.environ.get('ANTHROPIC_API_KEY'):
        import getpass
        for key in ['ANTHROPIC_API_KEY', 'OPENAI_API_KEY', 'GITLAB_TOKEN']:
            if not os.environ.get(key):
                try: os.environ[key] = getpass.getpass(f'{key}: ')
                except: pass

print(f'Environment: {"Colab" if IN_COLAB else "Local"}')
for k in ['ANTHROPIC_API_KEY','OPENAI_API_KEY','GITLAB_TOKEN']:
    print(f'  {k}: {"✓" if os.environ.get(k) else "✗"}')

## 1. Why a Knowledge Layer?

LLMs have general knowledge but lack:

- **Current simulator documentation** (OPM Flow specifics)
- **Keyword syntax details** (exact formats, defaults)
- **Reservoir analogs** (typical values for Permian, Bakken, etc.)
- **User corrections** (learned fixes from past sessions)

The Knowledge Layer fills these gaps through **Retrieval-Augmented Generation (RAG)**:

```
User Query → Embed → Search Vector DB → Retrieve Context → LLM + Context → Response
```

## 2. Database Schema

We use PostgreSQL with the **pgvector** extension for efficient similarity search.

In [None]:
# SQL Schema for CLARISSA Knowledge Layer

SCHEMA_SQL = '''
-- ============================================================
-- CLARISSA Knowledge Layer Schema
-- PostgreSQL + pgvector
-- ============================================================

-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- ============================================================
-- 1. Simulator Knowledge Base
-- General documentation, tutorials, FAQs
-- ============================================================
CREATE TABLE simulator_knowledge (
    id SERIAL PRIMARY KEY,
    
    -- Content
    source_type VARCHAR(50) NOT NULL,  -- 'manual', 'keyword_ref', 'tutorial', 'faq'
    source_file VARCHAR(255),
    section_title VARCHAR(500),
    content TEXT NOT NULL,
    
    -- Embedding (1536 dimensions for OpenAI ada-002 or similar)
    embedding vector(1536),
    
    -- Metadata for filtering
    keywords TEXT[],              -- Associated keywords mentioned
    related_keywords TEXT[],      -- Keywords this content helps explain
    simulator VARCHAR(50),        -- 'opm', 'eclipse', 'both'
    
    -- Timestamps
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Indexes for efficient search
CREATE INDEX idx_simknow_embedding ON simulator_knowledge 
    USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
CREATE INDEX idx_simknow_keywords ON simulator_knowledge USING GIN (keywords);
CREATE INDEX idx_simknow_source ON simulator_knowledge (source_type);

-- Full-text search index
CREATE INDEX idx_simknow_fts ON simulator_knowledge 
    USING GIN (to_tsvector('english', content));

-- ============================================================
-- 2. ECLIPSE Keywords Reference
-- Structured keyword information
-- ============================================================
CREATE TABLE eclipse_keywords (
    id SERIAL PRIMARY KEY,
    
    -- Keyword identification
    keyword VARCHAR(20) UNIQUE NOT NULL,
    section VARCHAR(20) NOT NULL,  -- 'RUNSPEC', 'GRID', 'PROPS', etc.
    
    -- Documentation
    description TEXT NOT NULL,
    syntax_template TEXT,          -- Example syntax
    parameters JSONB,              -- Parameter definitions
    examples JSONB,                -- Usage examples
    
    -- Relationships
    required_keywords TEXT[],      -- Keywords that must also be present
    incompatible_keywords TEXT[],  -- Keywords that conflict
    related_keywords TEXT[],       -- Semantically related keywords
    
    -- OPM Flow compatibility
    opm_supported BOOLEAN DEFAULT true,
    opm_notes TEXT,                -- Compatibility notes
    
    -- Embedding for semantic search
    description_embedding vector(1536),
    
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_keywords_section ON eclipse_keywords (section);
CREATE INDEX idx_keywords_opm ON eclipse_keywords (opm_supported);
CREATE INDEX idx_keywords_embedding ON eclipse_keywords 
    USING ivfflat (description_embedding vector_cosine_ops) WITH (lists = 50);

-- ============================================================
-- 3. Reservoir Analogs Database
-- Typical property values by basin/formation
-- ============================================================
CREATE TABLE reservoir_analogs (
    id SERIAL PRIMARY KEY,
    
    -- Location/identification
    name VARCHAR(100) NOT NULL,
    basin VARCHAR(100) NOT NULL,
    formation VARCHAR(100),
    region VARCHAR(100),           -- e.g., 'Delaware Basin', 'Midland Basin'
    country VARCHAR(100) DEFAULT 'USA',
    
    -- Rock properties (ranges)
    permeability_min FLOAT,        -- mD
    permeability_max FLOAT,
    permeability_typical FLOAT,
    
    porosity_min FLOAT,            -- fraction
    porosity_max FLOAT,
    porosity_typical FLOAT,
    
    -- Depth and pressure
    depth_min FLOAT,               -- ft TVD
    depth_max FLOAT,
    depth_typical FLOAT,
    pressure_gradient FLOAT,       -- psi/ft (normal ~0.465)
    temperature_gradient FLOAT,    -- °F/100ft
    
    -- Fluid properties
    api_gravity FLOAT,             -- °API
    gor FLOAT,                     -- scf/stb (gas-oil ratio)
    water_salinity FLOAT,          -- ppm TDS
    
    -- Recovery factors
    primary_rf FLOAT,              -- Primary recovery factor
    waterflood_rf FLOAT,           -- Waterflood recovery factor
    
    -- Additional properties as JSON
    properties JSONB,
    
    -- Searchable description
    description TEXT,
    description_embedding vector(1536),
    
    -- Source/quality
    data_source VARCHAR(255),
    confidence VARCHAR(20),        -- 'high', 'medium', 'low'
    
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_analogs_basin ON reservoir_analogs (basin);
CREATE INDEX idx_analogs_formation ON reservoir_analogs (formation);
CREATE INDEX idx_analogs_embedding ON reservoir_analogs 
    USING ivfflat (description_embedding vector_cosine_ops) WITH (lists = 50);

-- ============================================================
-- 4. User Corrections Database
-- Learn from user feedback
-- ============================================================
CREATE TABLE user_corrections (
    id SERIAL PRIMARY KEY,
    
    -- Context
    session_id VARCHAR(50),
    user_id VARCHAR(100),
    
    -- The correction
    original_response TEXT NOT NULL,
    corrected_response TEXT NOT NULL,
    correction_type VARCHAR(50),   -- 'factual', 'syntax', 'physics', 'preference'
    
    -- What triggered this
    query_context TEXT,
    keywords_involved TEXT[],
    
    -- Embedding for retrieval
    context_embedding vector(1536),
    
    -- Status
    incorporated BOOLEAN DEFAULT false,  -- Has this been learned?
    verified BOOLEAN DEFAULT false,      -- Has an expert verified?
    
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_corrections_embedding ON user_corrections 
    USING ivfflat (context_embedding vector_cosine_ops) WITH (lists = 50);
CREATE INDEX idx_corrections_type ON user_corrections (correction_type);

-- ============================================================
-- 5. Conversation Sessions (for context)
-- ============================================================
CREATE TABLE conversation_sessions (
    id VARCHAR(50) PRIMARY KEY,
    
    -- Session state (JSON blob)
    state JSONB NOT NULL DEFAULT '{}',
    
    -- Current deck being built
    current_deck TEXT,
    deck_valid BOOLEAN DEFAULT false,
    
    -- Timestamps
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    last_activity TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- ============================================================
-- 6. Generated Decks History
-- ============================================================
CREATE TABLE generated_decks (
    id SERIAL PRIMARY KEY,
    
    session_id VARCHAR(50) REFERENCES conversation_sessions(id),
    
    -- The deck
    deck_content TEXT NOT NULL,
    deck_hash VARCHAR(64),         -- SHA256 for deduplication
    
    -- Specification used
    specification JSONB,
    
    -- Validation results
    validation_result JSONB,
    
    -- Simulation results (if run)
    simulation_job_id VARCHAR(50),
    simulation_result JSONB,
    
    -- What assumptions were made
    assumptions TEXT[],
    analog_used VARCHAR(100),
    
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_decks_session ON generated_decks (session_id);
CREATE INDEX idx_decks_hash ON generated_decks (deck_hash);
''';

print("Schema SQL defined")
print(f"Total length: {len(SCHEMA_SQL)} characters")

## 3. Embedding Generation

We need to convert text into vector embeddings for semantic search.

In [None]:
from typing import List, Optional
from dataclasses import dataclass
import numpy as np
from abc import ABC, abstractmethod

@dataclass
class EmbeddingResult:
    """Result of embedding generation"""
    text: str
    embedding: np.ndarray
    model: str
    dimensions: int

class EmbeddingProvider(ABC):
    """Abstract base for embedding providers"""
    
    @abstractmethod
    def embed(self, text: str) -> np.ndarray:
        pass
    
    @abstractmethod
    def embed_batch(self, texts: List[str]) -> List[np.ndarray]:
        pass
    
    @property
    @abstractmethod
    def dimensions(self) -> int:
        pass

class OpenAIEmbeddings(EmbeddingProvider):
    """
    OpenAI text-embedding-ada-002 (or newer models).
    
    Best quality, but requires API key and costs money.
    """
    
    def __init__(self, api_key: str, model: str = "text-embedding-ada-002"):
        self.api_key = api_key
        self.model = model
        self._dimensions = 1536  # ada-002 dimensions
    
    @property
    def dimensions(self) -> int:
        return self._dimensions
    
    def embed(self, text: str) -> np.ndarray:
        """Embed a single text"""
        import openai
        
        client = openai.OpenAI(api_key=self.api_key)
        response = client.embeddings.create(
            model=self.model,
            input=text
        )
        return np.array(response.data[0].embedding)
    
    def embed_batch(self, texts: List[str]) -> List[np.ndarray]:
        """Embed multiple texts efficiently"""
        import openai
        
        client = openai.OpenAI(api_key=self.api_key)
        response = client.embeddings.create(
            model=self.model,
            input=texts
        )
        return [np.array(item.embedding) for item in response.data]

class SentenceTransformerEmbeddings(EmbeddingProvider):
    """
    Local embeddings using sentence-transformers.
    
    Free, runs locally, good quality for technical text.
    """
    
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model_name = model_name
        self._model = None
    
    @property
    def model(self):
        if self._model is None:
            from sentence_transformers import SentenceTransformer
            self._model = SentenceTransformer(self.model_name)
        return self._model
    
    @property
    def dimensions(self) -> int:
        # Depends on model
        model_dims = {
            "all-MiniLM-L6-v2": 384,
            "all-mpnet-base-v2": 768,
            "multi-qa-mpnet-base-dot-v1": 768,
        }
        return model_dims.get(self.model_name, 384)
    
    def embed(self, text: str) -> np.ndarray:
        return self.model.encode(text)
    
    def embed_batch(self, texts: List[str]) -> List[np.ndarray]:
        embeddings = self.model.encode(texts)
        return [emb for emb in embeddings]

class MockEmbeddings(EmbeddingProvider):
    """
    Mock embeddings for testing.
    
    Generates deterministic random vectors based on text hash.
    """
    
    def __init__(self, dimensions: int = 1536):
        self._dimensions = dimensions
    
    @property
    def dimensions(self) -> int:
        return self._dimensions
    
    def embed(self, text: str) -> np.ndarray:
        # Use text hash as seed for reproducibility
        seed = hash(text) % (2**32)
        rng = np.random.RandomState(seed)
        vec = rng.randn(self._dimensions)
        # Normalize to unit length
        return vec / np.linalg.norm(vec)
    
    def embed_batch(self, texts: List[str]) -> List[np.ndarray]:
        return [self.embed(t) for t in texts]

# Example usage
mock_embedder = MockEmbeddings(dimensions=1536)

test_texts = [
    "WELSPECS defines well specifications including name and location",
    "COMPDAT specifies well completion data and perforations",
    "The Permian Basin has typical permeability of 50-200 mD"
]

embeddings = mock_embedder.embed_batch(test_texts)

print(f"Generated {len(embeddings)} embeddings")
print(f"Dimensions: {embeddings[0].shape}")
print(f"Sample values: {embeddings[0][:5]}")

## 4. Knowledge Service

The main interface for CLARISSA to interact with the knowledge layer.

In [None]:
from dataclasses import dataclass, field
from typing import List, Dict, Any, Optional, Tuple
from enum import Enum
import json

@dataclass
class SearchResult:
    """A single search result from the knowledge base"""
    content: str
    source_type: str
    similarity: float
    metadata: Dict[str, Any] = field(default_factory=dict)

@dataclass
class KeywordInfo:
    """Structured information about an ECLIPSE keyword"""
    keyword: str
    section: str
    description: str
    syntax_template: Optional[str] = None
    parameters: List[Dict[str, Any]] = field(default_factory=list)
    examples: List[str] = field(default_factory=list)
    opm_supported: bool = True
    opm_notes: Optional[str] = None
    related_keywords: List[str] = field(default_factory=list)

@dataclass
class AnalogData:
    """Reservoir analog data for default values"""
    name: str
    basin: str
    formation: Optional[str] = None
    
    # Rock properties
    permeability_typical: Optional[float] = None
    porosity_typical: Optional[float] = None
    
    # Depth/pressure
    depth_typical: Optional[float] = None
    pressure_gradient: Optional[float] = None
    
    # Fluids
    api_gravity: Optional[float] = None
    gor: Optional[float] = None
    
    confidence: str = "medium"
    
    def to_defaults(self) -> Dict[str, Any]:
        """Convert to dictionary of default values for deck generation"""
        defaults = {}
        if self.permeability_typical:
            defaults['permx'] = self.permeability_typical
        if self.porosity_typical:
            defaults['poro'] = self.porosity_typical
        if self.depth_typical:
            defaults['top_depth'] = self.depth_typical
        if self.pressure_gradient and self.depth_typical:
            defaults['datum_pressure'] = self.pressure_gradient * self.depth_typical
        if self.api_gravity:
            defaults['api_gravity'] = self.api_gravity
        return defaults

class KnowledgeService:
    """
    Main interface to CLARISSA's knowledge layer.
    
    Provides:
    - Semantic search over documentation
    - Keyword lookup and assistance
    - Analog-based default values
    - Hybrid search (semantic + keyword)
    """
    
    def __init__(
        self,
        db_connection,  # asyncpg or psycopg2 connection
        embedder: EmbeddingProvider
    ):
        self.db = db_connection
        self.embedder = embedder
    
    async def search_documentation(
        self,
        query: str,
        limit: int = 5,
        source_types: List[str] = None,
        min_similarity: float = 0.7
    ) -> List[SearchResult]:
        """
        Semantic search over simulator documentation.
        
        Args:
            query: Natural language query
            limit: Maximum results to return
            source_types: Filter by source ('manual', 'tutorial', etc.)
            min_similarity: Minimum cosine similarity threshold
        """
        # Generate query embedding
        query_embedding = self.embedder.embed(query)
        
        # Build SQL query
        sql = """
            SELECT 
                content,
                source_type,
                section_title,
                keywords,
                1 - (embedding <=> $1::vector) as similarity
            FROM simulator_knowledge
            WHERE 1 - (embedding <=> $1::vector) > $2
        """
        
        params = [query_embedding.tolist(), min_similarity]
        
        if source_types:
            sql += " AND source_type = ANY($3)"
            params.append(source_types)
        
        sql += " ORDER BY similarity DESC LIMIT $" + str(len(params) + 1)
        params.append(limit)
        
        # Execute query
        rows = await self.db.fetch(sql, *params)
        
        results = []
        for row in rows:
            results.append(SearchResult(
                content=row['content'],
                source_type=row['source_type'],
                similarity=row['similarity'],
                metadata={
                    'section_title': row['section_title'],
                    'keywords': row['keywords']
                }
            ))
        
        return results
    
    async def get_keyword_info(self, keyword: str) -> Optional[KeywordInfo]:
        """
        Get detailed information about an ECLIPSE keyword.
        """
        sql = """
            SELECT *
            FROM eclipse_keywords
            WHERE keyword = $1
        """
        
        row = await self.db.fetchrow(sql, keyword.upper())
        
        if not row:
            return None
        
        return KeywordInfo(
            keyword=row['keyword'],
            section=row['section'],
            description=row['description'],
            syntax_template=row['syntax_template'],
            parameters=row['parameters'] or [],
            examples=row['examples'] or [],
            opm_supported=row['opm_supported'],
            opm_notes=row['opm_notes'],
            related_keywords=row['related_keywords'] or []
        )
    
    async def find_similar_keywords(
        self,
        description: str,
        limit: int = 5
    ) -> List[Tuple[str, float]]:
        """
        Find keywords by semantic similarity to a description.
        
        Useful when user describes what they want but doesn't know the keyword.
        """
        query_embedding = self.embedder.embed(description)
        
        sql = """
            SELECT 
                keyword,
                1 - (description_embedding <=> $1::vector) as similarity
            FROM eclipse_keywords
            WHERE opm_supported = true
            ORDER BY similarity DESC
            LIMIT $2
        """
        
        rows = await self.db.fetch(sql, query_embedding.tolist(), limit)
        return [(row['keyword'], row['similarity']) for row in rows]
    
    async def find_analog(
        self,
        description: str,
        limit: int = 3
    ) -> List[AnalogData]:
        """
        Find reservoir analogs by semantic matching.
        
        Example: "Permian Basin tight oil formation" → Delaware Basin analog
        """
        query_embedding = self.embedder.embed(description)
        
        sql = """
            SELECT *,
                1 - (description_embedding <=> $1::vector) as similarity
            FROM reservoir_analogs
            ORDER BY similarity DESC
            LIMIT $2
        """
        
        rows = await self.db.fetch(sql, query_embedding.tolist(), limit)
        
        analogs = []
        for row in rows:
            analogs.append(AnalogData(
                name=row['name'],
                basin=row['basin'],
                formation=row['formation'],
                permeability_typical=row['permeability_typical'],
                porosity_typical=row['porosity_typical'],
                depth_typical=row['depth_typical'],
                pressure_gradient=row['pressure_gradient'],
                api_gravity=row['api_gravity'],
                gor=row['gor'],
                confidence=row['confidence']
            ))
        
        return analogs
    
    async def get_analog_defaults(
        self,
        basin: str,
        formation: str = None
    ) -> Optional[Dict[str, Any]]:
        """
        Get default values for a specific basin/formation.
        """
        sql = """
            SELECT *
            FROM reservoir_analogs
            WHERE LOWER(basin) LIKE $1
        """
        params = [f"%{basin.lower()}%"]
        
        if formation:
            sql += " AND LOWER(formation) LIKE $2"
            params.append(f"%{formation.lower()}%")
        
        sql += " ORDER BY confidence DESC LIMIT 1"
        
        row = await self.db.fetchrow(sql, *params)
        
        if not row:
            return None
        
        analog = AnalogData(
            name=row['name'],
            basin=row['basin'],
            formation=row['formation'],
            permeability_typical=row['permeability_typical'],
            porosity_typical=row['porosity_typical'],
            depth_typical=row['depth_typical'],
            pressure_gradient=row['pressure_gradient'],
            api_gravity=row['api_gravity'],
            gor=row['gor'],
            confidence=row['confidence']
        )
        
        return analog.to_defaults()
    
    async def hybrid_search(
        self,
        query: str,
        keywords: List[str] = None,
        limit: int = 5
    ) -> List[SearchResult]:
        """
        Combine semantic search with keyword filtering.
        
        This is often more effective than pure semantic search
        for technical documentation.
        """
        query_embedding = self.embedder.embed(query)
        
        # Two-stage scoring: semantic similarity + keyword match
        sql = """
            WITH semantic_results AS (
                SELECT 
                    id,
                    content,
                    source_type,
                    section_title,
                    keywords,
                    1 - (embedding <=> $1::vector) as semantic_score
                FROM simulator_knowledge
                WHERE 1 - (embedding <=> $1::vector) > 0.5
            )
            SELECT 
                content,
                source_type,
                section_title,
                keywords,
                semantic_score,
                CASE 
                    WHEN $2::text[] IS NULL THEN 0
                    WHEN keywords && $2::text[] THEN 0.3
                    ELSE 0
                END as keyword_boost,
                semantic_score + CASE 
                    WHEN $2::text[] IS NULL THEN 0
                    WHEN keywords && $2::text[] THEN 0.3
                    ELSE 0
                END as combined_score
            FROM semantic_results
            ORDER BY combined_score DESC
            LIMIT $3
        """
        
        rows = await self.db.fetch(
            sql,
            query_embedding.tolist(),
            keywords,
            limit
        )
        
        results = []
        for row in rows:
            results.append(SearchResult(
                content=row['content'],
                source_type=row['source_type'],
                similarity=row['combined_score'],
                metadata={
                    'section_title': row['section_title'],
                    'keywords': row['keywords'],
                    'semantic_score': row['semantic_score'],
                    'keyword_boost': row['keyword_boost']
                }
            ))
        
        return results
    
    async def record_correction(
        self,
        session_id: str,
        original: str,
        corrected: str,
        correction_type: str,
        context: str = None
    ):
        """
        Record a user correction for learning.
        """
        context_text = context or original
        context_embedding = self.embedder.embed(context_text)
        
        sql = """
            INSERT INTO user_corrections (
                session_id,
                original_response,
                corrected_response,
                correction_type,
                query_context,
                context_embedding
            ) VALUES ($1, $2, $3, $4, $5, $6)
        """
        
        await self.db.execute(
            sql,
            session_id,
            original,
            corrected,
            correction_type,
            context,
            context_embedding.tolist()
        )

print("KnowledgeService class defined")

## 5. Reservoir Analog Database

Pre-populated analog data for common basins and formations.

In [None]:
# Sample reservoir analogs database
# In production, this would be much more extensive

RESERVOIR_ANALOGS = [
    {
        "name": "Permian Basin - Delaware",
        "basin": "Permian Basin",
        "formation": "Wolfcamp",
        "region": "Delaware Basin",
        "country": "USA",
        "permeability_min": 0.001,
        "permeability_max": 0.1,
        "permeability_typical": 0.01,  # Very tight
        "porosity_min": 0.04,
        "porosity_max": 0.12,
        "porosity_typical": 0.08,
        "depth_min": 6000,
        "depth_max": 12000,
        "depth_typical": 8500,
        "pressure_gradient": 0.50,  # Slightly overpressured
        "temperature_gradient": 1.5,
        "api_gravity": 42,
        "gor": 1200,
        "water_salinity": 150000,
        "primary_rf": 0.08,
        "waterflood_rf": 0.15,
        "description": "Permian Basin Delaware sub-basin Wolfcamp tight oil unconventional shale play",
        "confidence": "high",
        "data_source": "EIA/USGS compilations"
    },
    {
        "name": "Permian Basin - Midland",
        "basin": "Permian Basin",
        "formation": "Spraberry",
        "region": "Midland Basin",
        "country": "USA",
        "permeability_min": 0.1,
        "permeability_max": 10,
        "permeability_typical": 1.0,
        "porosity_min": 0.06,
        "porosity_max": 0.14,
        "porosity_typical": 0.10,
        "depth_min": 5000,
        "depth_max": 9000,
        "depth_typical": 7000,
        "pressure_gradient": 0.46,
        "temperature_gradient": 1.4,
        "api_gravity": 38,
        "gor": 800,
        "water_salinity": 100000,
        "primary_rf": 0.10,
        "waterflood_rf": 0.25,
        "description": "Permian Basin Midland sub-basin Spraberry conventional tight oil",
        "confidence": "high",
        "data_source": "EIA/USGS compilations"
    },
    {
        "name": "Bakken Formation",
        "basin": "Williston Basin",
        "formation": "Bakken",
        "region": "North Dakota",
        "country": "USA",
        "permeability_min": 0.001,
        "permeability_max": 0.05,
        "permeability_typical": 0.005,
        "porosity_min": 0.03,
        "porosity_max": 0.09,
        "porosity_typical": 0.06,
        "depth_min": 9000,
        "depth_max": 11000,
        "depth_typical": 10000,
        "pressure_gradient": 0.52,  # Overpressured
        "temperature_gradient": 1.8,
        "api_gravity": 42,
        "gor": 1500,
        "water_salinity": 250000,
        "primary_rf": 0.06,
        "waterflood_rf": 0.12,
        "description": "Bakken shale tight oil unconventional horizontal well development",
        "confidence": "high",
        "data_source": "NDIC/USGS"
    },
    {
        "name": "Eagle Ford Shale",
        "basin": "Western Gulf Basin",
        "formation": "Eagle Ford",
        "region": "South Texas",
        "country": "USA",
        "permeability_min": 0.0001,
        "permeability_max": 0.01,
        "permeability_typical": 0.001,
        "porosity_min": 0.03,
        "porosity_max": 0.10,
        "porosity_typical": 0.06,
        "depth_min": 4000,
        "depth_max": 14000,
        "depth_typical": 8000,
        "pressure_gradient": 0.48,
        "temperature_gradient": 1.6,
        "api_gravity": 45,
        "gor": 2000,
        "water_salinity": 80000,
        "primary_rf": 0.05,
        "waterflood_rf": 0.10,
        "description": "Eagle Ford shale play South Texas oil and condensate window",
        "confidence": "high",
        "data_source": "RRC Texas/EIA"
    },
    {
        "name": "North Sea Brent",
        "basin": "North Sea",
        "formation": "Brent Group",
        "region": "UK Continental Shelf",
        "country": "UK",
        "permeability_min": 50,
        "permeability_max": 2000,
        "permeability_typical": 500,
        "porosity_min": 0.15,
        "porosity_max": 0.28,
        "porosity_typical": 0.22,
        "depth_min": 7000,
        "depth_max": 12000,
        "depth_typical": 9000,
        "pressure_gradient": 0.45,
        "temperature_gradient": 1.5,
        "api_gravity": 38,
        "gor": 600,
        "water_salinity": 35000,
        "primary_rf": 0.25,
        "waterflood_rf": 0.50,
        "description": "North Sea Brent province conventional sandstone reservoir waterflooded",
        "confidence": "high",
        "data_source": "OGA UK"
    },
    {
        "name": "Ghawar Field Analog",
        "basin": "Arabian Basin",
        "formation": "Arab-D",
        "region": "Eastern Province",
        "country": "Saudi Arabia",
        "permeability_min": 100,
        "permeability_max": 5000,
        "permeability_typical": 1000,
        "porosity_min": 0.15,
        "porosity_max": 0.30,
        "porosity_typical": 0.25,
        "depth_min": 5000,
        "depth_max": 8000,
        "depth_typical": 6500,
        "pressure_gradient": 0.46,
        "temperature_gradient": 1.2,
        "api_gravity": 34,
        "gor": 500,
        "water_salinity": 180000,
        "primary_rf": 0.30,
        "waterflood_rf": 0.60,
        "description": "Middle East giant carbonate reservoir Arab formation analog",
        "confidence": "medium",
        "data_source": "Published literature"
    },
    {
        "name": "Generic Sandstone",
        "basin": "Generic",
        "formation": "Sandstone",
        "region": "Global",
        "country": "Global",
        "permeability_min": 10,
        "permeability_max": 1000,
        "permeability_typical": 100,
        "porosity_min": 0.10,
        "porosity_max": 0.25,
        "porosity_typical": 0.18,
        "depth_min": 5000,
        "depth_max": 10000,
        "depth_typical": 7500,
        "pressure_gradient": 0.465,
        "temperature_gradient": 1.5,
        "api_gravity": 35,
        "gor": 500,
        "water_salinity": 50000,
        "primary_rf": 0.15,
        "waterflood_rf": 0.35,
        "description": "Generic conventional sandstone reservoir default values",
        "confidence": "low",
        "data_source": "Textbook values"
    }
]

print(f"Defined {len(RESERVOIR_ANALOGS)} reservoir analogs")

# Display summary
for analog in RESERVOIR_ANALOGS:
    print(f"\n{analog['name']}:")
    print(f"  Perm: {analog['permeability_typical']} mD")
    print(f"  Porosity: {analog['porosity_typical']*100:.0f}%")
    print(f"  Depth: {analog['depth_typical']} ft")

## 6. ECLIPSE Keywords Database

Structured keyword information for syntax assistance.

In [None]:
# Sample ECLIPSE keywords database
ECLIPSE_KEYWORDS = [
    {
        "keyword": "DIMENS",
        "section": "RUNSPEC",
        "description": "Specifies the dimensions of the simulation grid (NX, NY, NZ)",
        "syntax_template": "DIMENS\n  NX NY NZ /",
        "parameters": [
            {"name": "NX", "type": "int", "description": "Number of cells in X direction"},
            {"name": "NY", "type": "int", "description": "Number of cells in Y direction"},
            {"name": "NZ", "type": "int", "description": "Number of cells in Z direction"}
        ],
        "examples": [
            "DIMENS\n  10 10 3 /",
            "DIMENS\n  100 100 20 /"
        ],
        "required_keywords": [],
        "incompatible_keywords": [],
        "related_keywords": ["DX", "DY", "DZ", "TOPS"],
        "opm_supported": True,
        "opm_notes": None
    },
    {
        "keyword": "WELSPECS",
        "section": "SCHEDULE",
        "description": "Defines well specifications including name, group, head location, and preferred phase",
        "syntax_template": "WELSPECS\n  'WELL' 'GROUP' I J DEPTH 'PHASE' /\n/",
        "parameters": [
            {"name": "WELL", "type": "string", "description": "Well name (max 8 chars)"},
            {"name": "GROUP", "type": "string", "description": "Group name"},
            {"name": "I", "type": "int", "description": "I-index of well head"},
            {"name": "J", "type": "int", "description": "J-index of well head"},
            {"name": "DEPTH", "type": "float", "description": "Reference depth for BHP"},
            {"name": "PHASE", "type": "string", "description": "Preferred phase (OIL/WATER/GAS)"}
        ],
        "examples": [
            "WELSPECS\n  'PROD1' 'G1' 5 5 8335 'OIL' /\n/",
            "WELSPECS\n  'INJ1' 'INJ' 1 1 8335 'WATER' /\n  'PROD1' 'PROD' 10 10 8335 'OIL' /\n/"
        ],
        "required_keywords": ["COMPDAT"],
        "incompatible_keywords": [],
        "related_keywords": ["COMPDAT", "WCONPROD", "WCONINJE"],
        "opm_supported": True,
        "opm_notes": None
    },
    {
        "keyword": "COMPDAT",
        "section": "SCHEDULE",
        "description": "Specifies well completion data including perforation intervals and connection properties",
        "syntax_template": "COMPDAT\n  'WELL' I J K1 K2 'STATUS' SAT CF DIAM KH S D /\n/",
        "parameters": [
            {"name": "WELL", "type": "string", "description": "Well name"},
            {"name": "I", "type": "int", "description": "I-index (or 0 for well head)"},
            {"name": "J", "type": "int", "description": "J-index (or 0 for well head)"},
            {"name": "K1", "type": "int", "description": "Top layer of completion"},
            {"name": "K2", "type": "int", "description": "Bottom layer of completion"},
            {"name": "STATUS", "type": "string", "description": "OPEN or SHUT"},
            {"name": "DIAM", "type": "float", "description": "Wellbore diameter (ft)"}
        ],
        "examples": [
            "COMPDAT\n  'PROD1' 5 5 1 3 'OPEN' 2* 0.5 /\n/"
        ],
        "required_keywords": ["WELSPECS"],
        "incompatible_keywords": [],
        "related_keywords": ["WELSPECS", "WCONPROD", "WCONINJE"],
        "opm_supported": True,
        "opm_notes": None
    },
    {
        "keyword": "WCONPROD",
        "section": "SCHEDULE",
        "description": "Sets production well controls including rate targets and pressure limits",
        "syntax_template": "WCONPROD\n  'WELL' 'STATUS' 'MODE' ORAT WRAT GRAT LRAT RESV BHP /\n/",
        "parameters": [
            {"name": "WELL", "type": "string", "description": "Well name"},
            {"name": "STATUS", "type": "string", "description": "OPEN, SHUT, or AUTO"},
            {"name": "MODE", "type": "string", "description": "Control mode (ORAT/WRAT/GRAT/LRAT/BHP)"},
            {"name": "ORAT", "type": "float", "description": "Oil rate target (STB/day)"},
            {"name": "BHP", "type": "float", "description": "Minimum BHP limit (psia)"}
        ],
        "examples": [
            "WCONPROD\n  'PROD1' 'OPEN' 'ORAT' 1000 4* 500 /\n/"
        ],
        "required_keywords": ["WELSPECS", "COMPDAT"],
        "incompatible_keywords": ["WCONINJE"],
        "related_keywords": ["WELSPECS", "COMPDAT", "WCONINJE"],
        "opm_supported": True,
        "opm_notes": None
    },
    {
        "keyword": "WCONINJE",
        "section": "SCHEDULE",
        "description": "Sets injection well controls for water or gas injection",
        "syntax_template": "WCONINJE\n  'WELL' 'TYPE' 'STATUS' 'MODE' RATE RESV BHP /\n/",
        "parameters": [
            {"name": "WELL", "type": "string", "description": "Well name"},
            {"name": "TYPE", "type": "string", "description": "Injection type (WATER/GAS)"},
            {"name": "STATUS", "type": "string", "description": "OPEN or SHUT"},
            {"name": "MODE", "type": "string", "description": "Control mode (RATE/BHP)"},
            {"name": "RATE", "type": "float", "description": "Injection rate target"},
            {"name": "BHP", "type": "float", "description": "Maximum BHP limit (psia)"}
        ],
        "examples": [
            "WCONINJE\n  'INJ1' 'WATER' 'OPEN' 'RATE' 5000 1* 6000 /\n/"
        ],
        "required_keywords": ["WELSPECS", "COMPDAT"],
        "incompatible_keywords": ["WCONPROD"],
        "related_keywords": ["WELSPECS", "COMPDAT", "WCONPROD"],
        "opm_supported": True,
        "opm_notes": None
    },
    {
        "keyword": "EQUIL",
        "section": "SOLUTION",
        "description": "Defines equilibration data for initialization including datum, contacts, and pressures",
        "syntax_template": "EQUIL\n  DATUM PDAT WOC PCOW GOC PCOG RSVD RVVD ACC /",
        "parameters": [
            {"name": "DATUM", "type": "float", "description": "Datum depth (ft)"},
            {"name": "PDAT", "type": "float", "description": "Pressure at datum (psia)"},
            {"name": "WOC", "type": "float", "description": "Water-oil contact depth (ft)"},
            {"name": "GOC", "type": "float", "description": "Gas-oil contact depth (ft)"}
        ],
        "examples": [
            "EQUIL\n  8400 4000 8500 0 0 0 /"
        ],
        "required_keywords": ["DIMENS", "PORO", "PERMX"],
        "incompatible_keywords": [],
        "related_keywords": ["SWOF", "SGOF", "PVTO", "PVTW"],
        "opm_supported": True,
        "opm_notes": None
    },
    {
        "keyword": "SWOF",
        "section": "PROPS",
        "description": "Water-oil relative permeability and capillary pressure table",
        "syntax_template": "SWOF\n  SW KRW KROW PCOW\n  ... /",
        "parameters": [
            {"name": "SW", "type": "float", "description": "Water saturation"},
            {"name": "KRW", "type": "float", "description": "Water relative permeability"},
            {"name": "KROW", "type": "float", "description": "Oil relative permeability (water-oil)"},
            {"name": "PCOW", "type": "float", "description": "Water-oil capillary pressure"}
        ],
        "examples": [
            "SWOF\n  0.2 0.0 1.0 0\n  0.5 0.15 0.3 0\n  0.8 0.35 0.0 0 /"
        ],
        "required_keywords": ["WATER", "OIL"],
        "incompatible_keywords": [],
        "related_keywords": ["SGOF", "SWFN", "SOF3"],
        "opm_supported": True,
        "opm_notes": None
    },
    {
        "keyword": "TSTEP",
        "section": "SCHEDULE",
        "description": "Advances simulation time by specified timestep sizes",
        "syntax_template": "TSTEP\n  N*DT ... /",
        "parameters": [
            {"name": "DT", "type": "float", "description": "Timestep size (days)"},
            {"name": "N", "type": "int", "description": "Number of repetitions"}
        ],
        "examples": [
            "TSTEP\n  30 30 30 30 /",
            "TSTEP\n  12*30 /"
        ],
        "required_keywords": [],
        "incompatible_keywords": [],
        "related_keywords": ["DATES", "TUNING"],
        "opm_supported": True,
        "opm_notes": None
    }
]

print(f"Defined {len(ECLIPSE_KEYWORDS)} keyword entries")

for kw in ECLIPSE_KEYWORDS:
    print(f"  {kw['keyword']:12s} ({kw['section']}) - OPM: {'✓' if kw['opm_supported'] else '✗'}")

## 7. Knowledge Ingestion Pipeline

How to populate the knowledge base from documentation.

In [None]:
from typing import List, Dict, Any
from dataclasses import dataclass
import re

@dataclass
class DocumentChunk:
    """A chunk of documentation ready for embedding"""
    content: str
    source_type: str
    source_file: str
    section_title: str
    keywords: List[str]

class DocumentProcessor:
    """
    Process documentation files into chunks for embedding.
    """
    
    # ECLIPSE keywords to detect in text
    KEYWORD_PATTERN = re.compile(r'\b([A-Z]{2,10})\b')
    KNOWN_KEYWORDS = {
        'RUNSPEC', 'GRID', 'PROPS', 'SOLUTION', 'SUMMARY', 'SCHEDULE',
        'DIMENS', 'DX', 'DY', 'DZ', 'TOPS', 'PERMX', 'PERMY', 'PERMZ',
        'PORO', 'NTG', 'SWOF', 'SGOF', 'PVTO', 'PVDO', 'PVTW', 'PVDG',
        'EQUIL', 'WELSPECS', 'COMPDAT', 'WCONPROD', 'WCONINJE', 'TSTEP'
    }
    
    def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 200):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
    
    def extract_keywords(self, text: str) -> List[str]:
        """Extract ECLIPSE keywords mentioned in text"""
        found = self.KEYWORD_PATTERN.findall(text)
        return [kw for kw in found if kw in self.KNOWN_KEYWORDS]
    
    def chunk_markdown(self, content: str, source_file: str) -> List[DocumentChunk]:
        """
        Split markdown documentation into chunks.
        
        Respects heading boundaries when possible.
        """
        chunks = []
        
        # Split by headers
        sections = re.split(r'\n(#{1,3}\s+[^\n]+)\n', content)
        
        current_title = "Introduction"
        current_content = ""
        
        for i, section in enumerate(sections):
            if section.startswith('#'):
                # This is a header
                if current_content.strip():
                    # Save previous section
                    chunks.extend(self._split_section(
                        current_content, current_title, source_file, 'manual'
                    ))
                current_title = section.strip('# ').strip()
                current_content = ""
            else:
                current_content += section
        
        # Don't forget last section
        if current_content.strip():
            chunks.extend(self._split_section(
                current_content, current_title, source_file, 'manual'
            ))
        
        return chunks
    
    def _split_section(
        self,
        content: str,
        title: str,
        source_file: str,
        source_type: str
    ) -> List[DocumentChunk]:
        """Split a section into chunks with overlap"""
        chunks = []
        
        # Simple character-based chunking
        # In production, use sentence-aware chunking
        text = content.strip()
        
        if len(text) <= self.chunk_size:
            # Small enough for single chunk
            chunks.append(DocumentChunk(
                content=text,
                source_type=source_type,
                source_file=source_file,
                section_title=title,
                keywords=self.extract_keywords(text)
            ))
        else:
            # Split with overlap
            start = 0
            while start < len(text):
                end = start + self.chunk_size
                chunk_text = text[start:end]
                
                chunks.append(DocumentChunk(
                    content=chunk_text,
                    source_type=source_type,
                    source_file=source_file,
                    section_title=title,
                    keywords=self.extract_keywords(chunk_text)
                ))
                
                start = end - self.chunk_overlap
        
        return chunks

class KnowledgeIngester:
    """
    Ingest documents into the knowledge base.
    """
    
    def __init__(
        self,
        db_connection,
        embedder: EmbeddingProvider,
        processor: DocumentProcessor = None
    ):
        self.db = db_connection
        self.embedder = embedder
        self.processor = processor or DocumentProcessor()
    
    async def ingest_markdown(self, filepath: str):
        """
        Ingest a markdown documentation file.
        """
        from pathlib import Path
        
        path = Path(filepath)
        content = path.read_text()
        
        # Process into chunks
        chunks = self.processor.chunk_markdown(content, path.name)
        
        print(f"Processing {len(chunks)} chunks from {path.name}")
        
        # Generate embeddings in batches
        batch_size = 10
        for i in range(0, len(chunks), batch_size):
            batch = chunks[i:i+batch_size]
            texts = [c.content for c in batch]
            embeddings = self.embedder.embed_batch(texts)
            
            # Insert into database
            for chunk, embedding in zip(batch, embeddings):
                await self._insert_chunk(chunk, embedding)
        
        print(f"Ingested {len(chunks)} chunks")
    
    async def _insert_chunk(self, chunk: DocumentChunk, embedding):
        """Insert a single chunk into the database"""
        sql = """
            INSERT INTO simulator_knowledge (
                source_type, source_file, section_title,
                content, embedding, keywords
            ) VALUES ($1, $2, $3, $4, $5, $6)
        """
        
        await self.db.execute(
            sql,
            chunk.source_type,
            chunk.source_file,
            chunk.section_title,
            chunk.content,
            embedding.tolist(),
            chunk.keywords
        )
    
    async def ingest_keywords(self, keywords: List[Dict[str, Any]]):
        """
        Ingest keyword reference data.
        """
        print(f"Ingesting {len(keywords)} keywords")
        
        for kw in keywords:
            # Generate embedding for description
            embedding = self.embedder.embed(kw['description'])
            
            sql = """
                INSERT INTO eclipse_keywords (
                    keyword, section, description, syntax_template,
                    parameters, examples, required_keywords,
                    incompatible_keywords, related_keywords,
                    opm_supported, opm_notes, description_embedding
                ) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12)
                ON CONFLICT (keyword) DO UPDATE SET
                    description = EXCLUDED.description,
                    description_embedding = EXCLUDED.description_embedding
            """
            
            import json
            await self.db.execute(
                sql,
                kw['keyword'],
                kw['section'],
                kw['description'],
                kw.get('syntax_template'),
                json.dumps(kw.get('parameters', [])),
                json.dumps(kw.get('examples', [])),
                kw.get('required_keywords', []),
                kw.get('incompatible_keywords', []),
                kw.get('related_keywords', []),
                kw.get('opm_supported', True),
                kw.get('opm_notes'),
                embedding.tolist()
            )
        
        print(f"Ingested {len(keywords)} keywords")
    
    async def ingest_analogs(self, analogs: List[Dict[str, Any]]):
        """
        Ingest reservoir analog data.
        """
        print(f"Ingesting {len(analogs)} reservoir analogs")
        
        for analog in analogs:
            # Generate embedding for description
            embedding = self.embedder.embed(analog['description'])
            
            sql = """
                INSERT INTO reservoir_analogs (
                    name, basin, formation, region, country,
                    permeability_min, permeability_max, permeability_typical,
                    porosity_min, porosity_max, porosity_typical,
                    depth_min, depth_max, depth_typical,
                    pressure_gradient, temperature_gradient,
                    api_gravity, gor, water_salinity,
                    primary_rf, waterflood_rf,
                    description, description_embedding,
                    data_source, confidence
                ) VALUES (
                    $1, $2, $3, $4, $5, $6, $7, $8, $9, $10,
                    $11, $12, $13, $14, $15, $16, $17, $18, $19,
                    $20, $21, $22, $23, $24, $25
                )
            """
            
            await self.db.execute(
                sql,
                analog['name'],
                analog['basin'],
                analog.get('formation'),
                analog.get('region'),
                analog.get('country', 'USA'),
                analog.get('permeability_min'),
                analog.get('permeability_max'),
                analog.get('permeability_typical'),
                analog.get('porosity_min'),
                analog.get('porosity_max'),
                analog.get('porosity_typical'),
                analog.get('depth_min'),
                analog.get('depth_max'),
                analog.get('depth_typical'),
                analog.get('pressure_gradient'),
                analog.get('temperature_gradient'),
                analog.get('api_gravity'),
                analog.get('gor'),
                analog.get('water_salinity'),
                analog.get('primary_rf'),
                analog.get('waterflood_rf'),
                analog['description'],
                embedding.tolist(),
                analog.get('data_source'),
                analog.get('confidence', 'medium')
            )
        
        print(f"Ingested {len(analogs)} reservoir analogs")

print("Knowledge ingestion classes defined")

## Summary: Knowledge Layer

### Components Built

1. **Database Schema** - PostgreSQL + pgvector for semantic search
2. **Embedding Providers** - OpenAI, SentenceTransformers, Mock
3. **KnowledgeService** - Main interface for CLARISSA
4. **Analog Database** - Default values by basin/formation
5. **Keyword Reference** - Structured ECLIPSE keyword info
6. **Ingestion Pipeline** - Document → Chunks → Embeddings → DB

### How CLARISSA Uses This

```python
# User says: "Build me a Permian Basin waterflood model"

# 1. Find analog
analogs = await knowledge.find_analog("Permian Basin")
defaults = analogs[0].to_defaults()
# → {'permx': 0.01, 'poro': 0.08, 'top_depth': 8500, ...}

# 2. User asks: "How do I define producer controls?"
results = await knowledge.search_documentation("producer controls")
# → Returns docs about WCONPROD

# 3. Get keyword syntax
kw_info = await knowledge.get_keyword_info("WCONPROD")
# → Structured syntax, examples, parameters
```

### Next Notebook

In **04_LLM_Conversation.ipynb**, we'll cover:
- Slot extraction from natural language
- Clarification request generation
- Conversation state management
- Prompt engineering for CLARISSA