# Metadata and Data Catalogs

**Metadata** is "data about data" – it provides context, meaning, and governance information that makes data assets discoverable, understandable, and trustworthy. A **Data Catalog** is a centralized inventory that organizes and manages metadata, enabling users to find, understand, and govern data across an organization.

---

## Why Metadata Matters

| Challenge | How Metadata Helps |
|-----------|-------------------|
| Data silos | Provides unified visibility across systems |
| Lack of trust | Documents lineage, quality, and ownership |
| Compliance risk | Tracks sensitive data and access policies |
| Slow discovery | Enables search and self-service analytics |

## Types of Metadata

Metadata can be categorized into three primary types:

### 1. Technical Metadata
Describes the **structure and format** of data assets.

| Attribute | Description | Example |
|-----------|-------------|--------|
| Schema | Table/column definitions | `customer_id INT PRIMARY KEY` |
| Data Types | Column data types | `VARCHAR(255)`, `TIMESTAMP` |
| Constraints | Keys, indexes, partitions | `FOREIGN KEY`, `UNIQUE INDEX` |
| Storage Location | Physical location | `s3://bucket/table/` |
| File Format | Serialization format | Parquet, Avro, JSON |
| Row Count | Volume statistics | `1,250,000 rows` |

### 2. Business Metadata
Provides **context and meaning** for business users.

| Attribute | Description | Example |
|-----------|-------------|--------|
| Business Name | Human-readable name | "Customer Lifetime Value" |
| Description | Plain-language explanation | "Total revenue per customer" |
| Domain/Category | Business classification | Finance, Marketing, Sales |
| Owner | Responsible team/person | "Data Engineering Team" |
| Tags | Searchable labels | `#PII`, `#Revenue`, `#Critical` |
| Glossary Terms | Standardized definitions | "Churn Rate", "ARR" |

### 3. Operational Metadata
Captures **runtime and process** information.

| Attribute | Description | Example |
|-----------|-------------|--------|
| Lineage | Data flow and transformations | Source → ETL → Target |
| Last Updated | Freshness timestamp | `2026-02-01 14:30:00 UTC` |
| Job Statistics | ETL run metrics | Duration, records processed |
| Access Logs | Usage patterns | Query frequency, top users |
| Quality Metrics | Data quality scores | Completeness: 98.5% |
| SLA Status | Pipeline health | On-time delivery rate |

## Data Catalog: Features and Importance

A **Data Catalog** serves as the single source of truth for all data assets in an organization.

### Core Features

```
┌─────────────────────────────────────────────────────────────────┐
│                      DATA CATALOG                               │
├─────────────────┬─────────────────┬─────────────────────────────┤
│   Discovery     │   Governance    │      Collaboration          │
├─────────────────┼─────────────────┼─────────────────────────────┤
│ • Search        │ • Lineage       │ • Comments & Reviews        │
│ • Browse        │ • Access Control│ • Ratings & Endorsements    │
│ • Filtering     │ • Classification│ • Shared Collections        │
│ • Recommendations│ • Audit Trails │ • Knowledge Sharing         │
└─────────────────┴─────────────────┴─────────────────────────────┘
```

### Key Capabilities

| Capability | Description |
|------------|-------------|
| **Automated Ingestion** | Crawlers that extract metadata from sources |
| **Search & Discovery** | Full-text and faceted search across assets |
| **Data Lineage** | Visual representation of data flow |
| **Business Glossary** | Standardized terminology definitions |
| **Data Classification** | PII/sensitive data tagging |
| **Access Management** | Role-based permissions |
| **APIs & Integrations** | Programmatic access and tool connectivity |

### Business Value

- **Reduce time-to-insight**: Analysts find data 5-10x faster
- **Improve data quality**: Visibility enables proactive fixes
- **Ensure compliance**: Track sensitive data for GDPR, CCPA, HIPAA
- **Enable self-service**: Reduce dependency on IT for data access
- **Foster trust**: Clear ownership and lineage build confidence

## Data Catalog Tools Comparison

### Popular Data Catalog Solutions

| Tool | Type | Best For | Key Strengths |
|------|------|----------|---------------|
| **AWS Glue Data Catalog** | Cloud-native | AWS ecosystem | Tight Athena/Redshift integration |
| **Apache Atlas** | Open-source | Hadoop ecosystem | Deep Hadoop lineage support |
| **DataHub** | Open-source | Modern data stack | Extensible, active community |
| **Atlan** | Commercial | Enterprise collaboration | User experience, AI features |

---

### AWS Glue Data Catalog

The central metadata repository for AWS analytics services.

```
┌──────────────────────────────────────────────────────────────┐
│                  AWS GLUE DATA CATALOG                       │
├──────────────────────────────────────────────────────────────┤
│  ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐      │
│  │ Athena  │   │ Redshift│   │  EMR    │   │Lake Form│      │
│  │ Spectrum│   │         │   │         │   │ ation   │      │
│  └────┬────┘   └────┬────┘   └────┬────┘   └────┬────┘      │
│       │             │             │             │           │
│       └─────────────┴─────────────┴─────────────┘           │
│                         │                                    │
│              ┌──────────▼──────────┐                        │
│              │   Glue Data Catalog │                        │
│              │  • Databases        │                        │
│              │  • Tables           │                        │
│              │  • Partitions       │                        │
│              │  • Connections      │                        │
│              └──────────┬──────────┘                        │
│                         │                                    │
│              ┌──────────▼──────────┐                        │
│              │    Glue Crawlers    │                        │
│              └──────────┬──────────┘                        │
│                         │                                    │
│    ┌────────────┬───────┴────────┬────────────┐             │
│    ▼            ▼                ▼            ▼             │
│  ┌────┐      ┌─────┐         ┌──────┐     ┌──────┐         │
│  │ S3 │      │ RDS │         │Redshift│   │DynamoDB│        │
│  └────┘      └─────┘         └──────┘     └──────┘         │
└──────────────────────────────────────────────────────────────┘
```

**Key Features:**
- Automatic schema discovery via Crawlers
- Hive-compatible metastore
- Integration with Lake Formation for fine-grained access
- Pay-per-use pricing model

### Apache Atlas

Open-source metadata management and governance framework for Hadoop.

```
┌────────────────────────────────────────────────────────────┐
│                     APACHE ATLAS                           │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  ┌─────────────────┐     ┌──────────────────────────────┐ │
│  │  Type System    │     │      Core Services           │ │
│  │  • Entities     │     │  • Metadata Store            │ │
│  │  • Classifications│   │  • Search & Indexing         │ │
│  │  • Relationships │    │  • Lineage Engine            │ │
│  └─────────────────┘     │  • Notification System       │ │
│                          └──────────────────────────────┘ │
│                                                            │
│  ┌─────────────────────────────────────────────────────┐  │
│  │              Integration Hooks                       │  │
│  │   Hive │ Sqoop │ Storm │ Falcon │ Kafka │ NiFi      │  │
│  └─────────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────────┘
```

**Key Features:**
- Extensible type system for custom metadata
- Native lineage tracking for Hadoop ecosystem
- Classification propagation (e.g., PII tags flow downstream)
- REST API for programmatic access

### DataHub (LinkedIn Open Source)

Modern, extensible data catalog for the modern data stack.

```
┌─────────────────────────────────────────────────────────────┐
│                        DATAHUB                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   ┌─────────────────────────────────────────────────────┐  │
│   │                   Frontend (React)                   │  │
│   │   Search │ Browse │ Lineage │ Governance │ Profiles │  │
│   └────────────────────────┬────────────────────────────┘  │
│                            │                                │
│   ┌────────────────────────▼────────────────────────────┐  │
│   │               GraphQL / REST API                     │  │
│   └────────────────────────┬────────────────────────────┘  │
│                            │                                │
│   ┌────────────────────────▼────────────────────────────┐  │
│   │          Metadata Service (GMS)                      │  │
│   │   • Entity Registry  • Aspect Store  • Search Index │  │
│   └────────────────────────┬────────────────────────────┘  │
│                            │                                │
│   ┌─────────┬──────────────┼──────────────┬─────────────┐  │
│   │         │              │              │             │  │
│   ▼         ▼              ▼              ▼             ▼  │
│ MySQL  Elasticsearch    Kafka       Neo4j (opt)   MCE/MAE │
│                                                             │
│   ┌─────────────────────────────────────────────────────┐  │
│   │              Ingestion Framework                     │  │
│   │ Snowflake│BigQuery│dbt│Airflow│Spark│Looker│Tableau │  │
│   └─────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘
```

**Key Features:**
- 50+ native integrations
- Real-time metadata updates via Kafka
- GraphQL API for flexible queries
- dbt integration for transformation lineage
- Active open-source community

### Atlan

Enterprise data catalog with emphasis on collaboration and user experience.

```
┌─────────────────────────────────────────────────────────────┐
│                         ATLAN                               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐ │
│  │   Search    │  │  Lineage    │  │   Collaboration     │ │
│  │  • AI-powered│ │  • Column   │  │   • Slack-like      │ │
│  │  • Natural   │  │  • Impact   │  │   • @mentions       │ │
│  │    language  │  │  • Bi-direct│  │   • Announcements   │ │
│  └─────────────┘  └─────────────┘  └─────────────────────┘ │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Active Metadata Platform                │   │
│  │  • Playbooks (automation)  • Personas (custom views) │   │
│  │  • Policies (governance)   • Insights (analytics)    │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                  Integrations                        │   │
│  │  Snowflake│Databricks│BigQuery│Redshift│Tableau│dbt │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
```

**Key Features:**
- AI-powered search and recommendations
- Playbooks for automated governance workflows
- Column-level lineage
- Slack/Teams integration for notifications
- SOC 2 Type II certified

## Python Code for Metadata Extraction

Let's explore practical examples of extracting metadata using Python.

In [None]:
# Example 1: Extract metadata from a Pandas DataFrame
import pandas as pd
from datetime import datetime

def extract_dataframe_metadata(df: pd.DataFrame, name: str = "dataset") -> dict:
    """
    Extract comprehensive metadata from a pandas DataFrame.
    
    Returns technical, statistical, and quality metadata.
    """
    metadata = {
        "name": name,
        "extraction_timestamp": datetime.now().isoformat(),
        
        # Technical metadata
        "technical": {
            "row_count": len(df),
            "column_count": len(df.columns),
            "memory_usage_bytes": df.memory_usage(deep=True).sum(),
            "columns": []
        },
        
        # Quality metadata
        "quality": {
            "completeness": {},
            "uniqueness": {},
            "total_null_count": int(df.isnull().sum().sum())
        }
    }
    
    # Extract column-level metadata
    for col in df.columns:
        col_meta = {
            "name": col,
            "dtype": str(df[col].dtype),
            "nullable": bool(df[col].isnull().any()),
            "null_count": int(df[col].isnull().sum()),
            "unique_count": int(df[col].nunique()),
            "sample_values": df[col].dropna().head(3).tolist()
        }
        
        # Add statistics for numeric columns
        if pd.api.types.is_numeric_dtype(df[col]):
            col_meta["statistics"] = {
                "min": float(df[col].min()) if not pd.isna(df[col].min()) else None,
                "max": float(df[col].max()) if not pd.isna(df[col].max()) else None,
                "mean": float(df[col].mean()) if not pd.isna(df[col].mean()) else None,
                "std": float(df[col].std()) if not pd.isna(df[col].std()) else None
            }
        
        metadata["technical"]["columns"].append(col_meta)
        
        # Quality metrics
        total_rows = len(df)
        metadata["quality"]["completeness"][col] = round(
            (1 - df[col].isnull().sum() / total_rows) * 100, 2
        )
        metadata["quality"]["uniqueness"][col] = round(
            (df[col].nunique() / total_rows) * 100, 2
        )
    
    return metadata

# Demo with sample data
sample_df = pd.DataFrame({
    "customer_id": [1, 2, 3, 4, 5],
    "name": ["Alice", "Bob", "Charlie", None, "Eve"],
    "revenue": [1500.50, 2300.00, 890.25, 3400.75, 1200.00],
    "signup_date": pd.to_datetime(["2024-01-15", "2024-02-20", "2024-03-10", "2024-04-05", "2024-05-01"])
})

metadata = extract_dataframe_metadata(sample_df, "customer_metrics")
print("=== DataFrame Metadata ===")
import json
print(json.dumps(metadata, indent=2, default=str))

In [None]:
# Example 2: Extract metadata from SQL database using SQLAlchemy
from sqlalchemy import create_engine, inspect, MetaData
from typing import Dict, List, Any

def extract_database_metadata(connection_string: str) -> Dict[str, Any]:
    """
    Extract metadata from a SQL database.
    
    Returns schema information including tables, columns, 
    primary keys, and foreign keys.
    """
    engine = create_engine(connection_string)
    inspector = inspect(engine)
    
    db_metadata = {
        "database_type": engine.dialect.name,
        "schemas": []
    }
    
    for schema_name in inspector.get_schema_names():
        schema_info = {
            "name": schema_name,
            "tables": []
        }
        
        for table_name in inspector.get_table_names(schema=schema_name):
            # Get columns
            columns = []
            for col in inspector.get_columns(table_name, schema=schema_name):
                columns.append({
                    "name": col["name"],
                    "type": str(col["type"]),
                    "nullable": col.get("nullable", True),
                    "default": str(col.get("default")) if col.get("default") else None
                })
            
            # Get primary key
            pk = inspector.get_pk_constraint(table_name, schema=schema_name)
            
            # Get foreign keys
            fks = inspector.get_foreign_keys(table_name, schema=schema_name)
            
            # Get indexes
            indexes = inspector.get_indexes(table_name, schema=schema_name)
            
            table_info = {
                "name": table_name,
                "columns": columns,
                "primary_key": pk.get("constrained_columns", []),
                "foreign_keys": [
                    {
                        "columns": fk["constrained_columns"],
                        "references": f"{fk['referred_table']}.{fk['referred_columns']}"
                    }
                    for fk in fks
                ],
                "indexes": [
                    {"name": idx["name"], "columns": idx["column_names"], "unique": idx["unique"]}
                    for idx in indexes
                ]
            }
            
            schema_info["tables"].append(table_info)
        
        db_metadata["schemas"].append(schema_info)
    
    return db_metadata

# Example usage (commented out - requires actual database)
# metadata = extract_database_metadata("postgresql://user:pass@localhost:5432/mydb")
# print(json.dumps(metadata, indent=2))

print("Database metadata extraction function defined.")
print("Usage: extract_database_metadata('postgresql://user:pass@host:port/db')")

In [None]:
# Example 3: AWS Glue Data Catalog interaction (boto3)
import json

# Note: Requires AWS credentials configured
# pip install boto3

class GlueCatalogClient:
    """
    Client for interacting with AWS Glue Data Catalog.
    """
    
    def __init__(self, region_name: str = "us-east-1"):
        try:
            import boto3
            self.client = boto3.client('glue', region_name=region_name)
        except ImportError:
            print("boto3 not installed. Run: pip install boto3")
            self.client = None
    
    def list_databases(self) -> list:
        """List all databases in the Glue catalog."""
        if not self.client:
            return []
        
        databases = []
        paginator = self.client.get_paginator('get_databases')
        
        for page in paginator.paginate():
            for db in page['DatabaseList']:
                databases.append({
                    "name": db['Name'],
                    "description": db.get('Description', ''),
                    "location": db.get('LocationUri', ''),
                    "create_time": str(db.get('CreateTime', ''))
                })
        
        return databases
    
    def get_table_metadata(self, database: str, table: str) -> dict:
        """Get detailed metadata for a specific table."""
        if not self.client:
            return {}
        
        response = self.client.get_table(DatabaseName=database, Name=table)
        table_data = response['Table']
        
        return {
            "name": table_data['Name'],
            "database": database,
            "description": table_data.get('Description', ''),
            "location": table_data.get('StorageDescriptor', {}).get('Location', ''),
            "input_format": table_data.get('StorageDescriptor', {}).get('InputFormat', ''),
            "output_format": table_data.get('StorageDescriptor', {}).get('OutputFormat', ''),
            "columns": [
                {
                    "name": col['Name'],
                    "type": col['Type'],
                    "comment": col.get('Comment', '')
                }
                for col in table_data.get('StorageDescriptor', {}).get('Columns', [])
            ],
            "partition_keys": [
                {"name": pk['Name'], "type": pk['Type']}
                for pk in table_data.get('PartitionKeys', [])
            ],
            "table_type": table_data.get('TableType', ''),
            "create_time": str(table_data.get('CreateTime', '')),
            "update_time": str(table_data.get('UpdateTime', ''))
        }
    
    def search_tables(self, search_text: str, max_results: int = 10) -> list:
        """Search tables across all databases."""
        if not self.client:
            return []
        
        response = self.client.search_tables(
            SearchText=search_text,
            MaxResults=max_results
        )
        
        return [
            {
                "database": t['DatabaseName'],
                "table": t['Name'],
                "description": t.get('Description', '')
            }
            for t in response.get('TableList', [])
        ]

# Example usage (requires AWS credentials)
print("AWS Glue Catalog client class defined.")
print("")
print("Usage example:")
print("  glue = GlueCatalogClient(region_name='us-east-1')")
print("  databases = glue.list_databases()")
print("  table_meta = glue.get_table_metadata('my_database', 'my_table')")

In [None]:
# Example 4: DataHub metadata ingestion using Python SDK
# pip install acryl-datahub

def datahub_emit_dataset_example():
    """
    Example of emitting dataset metadata to DataHub.
    
    Requires: pip install acryl-datahub
    """
    try:
        from datahub.emitter.mce_builder import make_dataset_urn
        from datahub.emitter.rest_emitter import DatahubRestEmitter
        from datahub.metadata.schema_classes import (
            DatasetPropertiesClass,
            MetadataChangeEventClass,
            SchemaMetadataClass,
            SchemaFieldClass,
            StringTypeClass,
            NumberTypeClass
        )
        
        # Initialize emitter (connects to DataHub GMS)
        emitter = DatahubRestEmitter("http://localhost:8080")
        
        # Create dataset URN
        dataset_urn = make_dataset_urn(
            platform="snowflake",
            name="analytics.public.customer_metrics"
        )
        
        # Dataset properties (business metadata)
        properties = DatasetPropertiesClass(
            name="Customer Metrics",
            description="Aggregated customer metrics including LTV and churn risk",
            customProperties={
                "owner": "data-engineering",
                "domain": "Finance",
                "pii": "true",
                "refresh_frequency": "daily"
            },
            tags=["production", "critical", "pii"]
        )
        
        # Emit to DataHub
        # emitter.emit_mcp(...)  # Actual emission
        
        print("DataHub emission example prepared.")
        return dataset_urn, properties
        
    except ImportError:
        print("acryl-datahub not installed.")
        print("Install with: pip install acryl-datahub")
        return None, None

# Show the code structure
print("=== DataHub Python SDK Example ===")
print("""
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.emitter.mce_builder import make_dataset_urn

# Connect to DataHub
emitter = DatahubRestEmitter("http://localhost:8080")

# Create dataset URN
dataset_urn = make_dataset_urn(
    platform="snowflake",
    name="analytics.public.customer_metrics"
)

# Emit metadata
emitter.emit_mcp(
    entityUrn=dataset_urn,
    aspectName="datasetProperties",
    aspect=DatasetPropertiesClass(
        name="Customer Metrics",
        description="Customer LTV and churn metrics"
    )
)
""")

## Data Discovery and Search

Effective data discovery enables users to find relevant data assets quickly. Modern catalogs provide multiple discovery mechanisms:

### Discovery Approaches

```
┌─────────────────────────────────────────────────────────────────┐
│                    DATA DISCOVERY METHODS                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────┐  ┌─────────────────┐  ┌────────────────┐  │
│  │   SEARCH        │  │    BROWSE       │  │   RECOMMEND    │  │
│  ├─────────────────┤  ├─────────────────┤  ├────────────────┤  │
│  │ • Keyword       │  │ • By Domain     │  │ • Popular      │  │
│  │ • Semantic      │  │ • By Owner      │  │ • Similar      │  │
│  │ • Filters       │  │ • By Tag        │  │ • Trending     │  │
│  │ • Faceted       │  │ • By Source     │  │ • Personalized │  │
│  └─────────────────┘  └─────────────────┘  └────────────────┘  │
│                                                                 │
│  ┌─────────────────┐  ┌─────────────────┐  ┌────────────────┐  │
│  │   LINEAGE       │  │   GLOSSARY      │  │   GOVERNANCE   │  │
│  ├─────────────────┤  ├─────────────────┤  ├────────────────┤  │
│  │ • Upstream      │  │ • Term search   │  │ • By policy    │  │
│  │ • Downstream    │  │ • Definitions   │  │ • Compliance   │  │
│  │ • Impact        │  │ • Related terms │  │ • Certified    │  │
│  └─────────────────┘  └─────────────────┘  └────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
```

### Search Best Practices

| Practice | Description |
|----------|-------------|
| **Rich descriptions** | Write clear, searchable descriptions |
| **Consistent tagging** | Use standardized tag taxonomy |
| **Business terms** | Link to glossary terms |
| **Ownership** | Assign clear data owners |
| **Certification** | Mark trusted, verified datasets |

In [None]:
# Example 5: Simple in-memory data catalog with search
from dataclasses import dataclass, field
from typing import List, Optional, Dict
from datetime import datetime
import re

@dataclass
class DataAsset:
    """Represents a data asset in the catalog."""
    id: str
    name: str
    description: str
    platform: str  # e.g., "snowflake", "s3", "postgres"
    schema: str
    owner: str
    tags: List[str] = field(default_factory=list)
    domain: str = ""
    created_at: datetime = field(default_factory=datetime.now)
    updated_at: datetime = field(default_factory=datetime.now)
    columns: List[Dict] = field(default_factory=list)
    certified: bool = False
    pii: bool = False


class SimpleDataCatalog:
    """A simple in-memory data catalog with search capabilities."""
    
    def __init__(self):
        self.assets: Dict[str, DataAsset] = {}
        self.search_index: Dict[str, set] = {}  # term -> asset_ids
    
    def register_asset(self, asset: DataAsset) -> None:
        """Register a new data asset in the catalog."""
        self.assets[asset.id] = asset
        self._index_asset(asset)
        print(f"Registered: {asset.name}")
    
    def _index_asset(self, asset: DataAsset) -> None:
        """Build search index for the asset."""
        # Index name, description, tags, owner, domain
        terms = set()
        
        # Tokenize and add to terms
        for text in [asset.name, asset.description, asset.owner, asset.domain, asset.platform]:
            terms.update(re.findall(r'\w+', text.lower()))
        
        terms.update(t.lower() for t in asset.tags)
        
        # Add to inverted index
        for term in terms:
            if term not in self.search_index:
                self.search_index[term] = set()
            self.search_index[term].add(asset.id)
    
    def search(self, query: str, filters: Optional[Dict] = None) -> List[DataAsset]:
        """
        Search for assets matching the query.
        
        Args:
            query: Search terms
            filters: Optional filters (platform, domain, certified, pii)
        """
        query_terms = re.findall(r'\w+', query.lower())
        
        if not query_terms:
            matching_ids = set(self.assets.keys())
        else:
            # Find assets matching all query terms (AND logic)
            matching_ids = None
            for term in query_terms:
                term_matches = self.search_index.get(term, set())
                if matching_ids is None:
                    matching_ids = term_matches.copy()
                else:
                    matching_ids &= term_matches
            
            matching_ids = matching_ids or set()
        
        # Apply filters
        results = [self.assets[aid] for aid in matching_ids]
        
        if filters:
            if "platform" in filters:
                results = [a for a in results if a.platform == filters["platform"]]
            if "domain" in filters:
                results = [a for a in results if a.domain == filters["domain"]]
            if "certified" in filters:
                results = [a for a in results if a.certified == filters["certified"]]
            if "pii" in filters:
                results = [a for a in results if a.pii == filters["pii"]]
        
        return results
    
    def browse_by_domain(self) -> Dict[str, List[str]]:
        """Browse assets grouped by domain."""
        domains = {}
        for asset in self.assets.values():
            domain = asset.domain or "Uncategorized"
            if domain not in domains:
                domains[domain] = []
            domains[domain].append(asset.name)
        return domains
    
    def get_asset(self, asset_id: str) -> Optional[DataAsset]:
        """Get a specific asset by ID."""
        return self.assets.get(asset_id)


# Demo the catalog
catalog = SimpleDataCatalog()

# Register sample assets
catalog.register_asset(DataAsset(
    id="ds-001",
    name="customer_transactions",
    description="Daily customer transaction records with payment details",
    platform="snowflake",
    schema="analytics.finance",
    owner="finance-team",
    tags=["transactions", "finance", "daily"],
    domain="Finance",
    certified=True,
    pii=True
))

catalog.register_asset(DataAsset(
    id="ds-002",
    name="product_catalog",
    description="Master product catalog with SKU and pricing information",
    platform="postgres",
    schema="ecommerce.products",
    owner="product-team",
    tags=["products", "pricing", "master-data"],
    domain="E-Commerce",
    certified=True
))

catalog.register_asset(DataAsset(
    id="ds-003",
    name="user_events",
    description="Clickstream and user behavior events from web and mobile",
    platform="s3",
    schema="raw.events",
    owner="analytics-team",
    tags=["events", "clickstream", "real-time"],
    domain="Analytics"
))

print("\n=== Search Examples ===")
print("\nSearch 'customer':")
for asset in catalog.search("customer"):
    print(f"  - {asset.name} ({asset.platform})")

print("\nSearch 'finance' with certified=True filter:")
for asset in catalog.search("finance", filters={"certified": True}):
    print(f"  - {asset.name} (certified: {asset.certified})")

print("\nBrowse by domain:")
for domain, assets in catalog.browse_by_domain().items():
    print(f"  {domain}: {assets}")

## Key Takeaways

### Summary

| Concept | Key Points |
|---------|------------|
| **Metadata Types** | Technical (schema), Business (context), Operational (runtime) |
| **Data Catalog** | Centralized inventory for discovery, governance, and collaboration |
| **Tool Selection** | Choose based on ecosystem (AWS→Glue, Hadoop→Atlas, Modern→DataHub) |
| **Discovery** | Enable search, browse, and recommendations for self-service |

### Best Practices

1. **Automate metadata collection** – Use crawlers and integrations to keep metadata fresh
2. **Define ownership** – Every dataset needs a clear owner and steward
3. **Establish a business glossary** – Standardize terminology across the organization
4. **Track lineage** – Understand data flow for impact analysis and debugging
5. **Classify sensitive data** – Tag PII/sensitive data for compliance
6. **Enable collaboration** – Allow comments, ratings, and knowledge sharing

### Tool Selection Guide

```
                    ┌─────────────────────────────────┐
                    │     Which catalog to use?       │
                    └────────────────┬────────────────┘
                                     │
                    ┌────────────────▼────────────────┐
                    │   Using AWS ecosystem heavily?  │
                    └────────────────┬────────────────┘
                           Yes │           │ No
                    ┌──────────▼──┐  ┌─────▼──────────┐
                    │  AWS Glue   │  │ Hadoop-based?  │
                    │  Data       │  └────────┬───────┘
                    │  Catalog    │     Yes │     │ No
                    └─────────────┘  ┌──────▼──┐ ┌─▼──────────┐
                                     │ Apache  │ │ Enterprise │
                                     │ Atlas   │ │ features?  │
                                     └─────────┘ └────┬───────┘
                                                Yes │     │ No
                                               ┌────▼──┐ ┌─▼──────┐
                                               │ Atlan │ │DataHub │
                                               └───────┘ └────────┘
```

### Further Reading

- [DataHub Documentation](https://datahubproject.io/docs/)
- [AWS Glue Data Catalog Guide](https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html)
- [Apache Atlas Architecture](https://atlas.apache.org/Architecture.html)
- [The Data Catalog Vendor Landscape (Atlan Blog)](https://atlan.com/)