[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/introduction/02_Data_Ingestion.ipynb)

# Data Ingestion - Comprehensive Guide

## Overview

This notebook provides a comprehensive guide to Semantica's data ingestion capabilities. It covers all submodules, classes, and helper functions available in the `semantica.ingest` module.

**Documentation**: [Ingest API Reference](https://semantica.readthedocs.io/reference/ingest/)

### Table of Contents

1.  **Unified Ingestion**: `ingest` function
2.  **File Ingestion**: `FileIngestor`, `FileTypeDetector`, `CloudStorageIngestor`
3.  **Web Ingestion**: `WebIngestor`, `ContentExtractor`, `SitemapCrawler`, `RobotsChecker`
4.  **Feed Ingestion**: `FeedIngestor`, `FeedMonitor`
5.  **Stream Ingestion**: `StreamIngestor`, `StreamMonitor`
6.  **Repository Ingestion**: `RepoIngestor`, `CodeExtractor`, `GitAnalyzer`
7.  **Email Ingestion**: `EmailIngestor`, `AttachmentProcessor`
8.  **Database Ingestion**: `DBIngestor`, `DatabaseConnector`
9.  **MCP Ingestion**: `MCPIngestor`
10. **Configuration**: `IngestConfig`

## Installation

Install Semantica with all dependencies:

```bash
pip install semantica[all]
```

---

## 1. Unified Ingestion

The `ingest` function is the main entry point for quick data loading. It automatically detects the source type.


In [1]:
!pip install semantica






## 2. File Ingestion

Detailed control over file processing using `FileIngestor` and helper classes.


In [4]:
import os
import tempfile
from semantica.ingest import FileIngestor, FileTypeDetector, CloudStorageIngestor

# Ensure dependencies from previous cells are available
if 'temp_dir' not in locals():
    temp_dir = tempfile.mkdtemp()
    print(f"Created temporary directory: {temp_dir}")

if 'sample_file' not in locals():
    sample_file = os.path.join(temp_dir, "sample_large.txt")

if not os.path.exists(sample_file):
    # Create a sample file with a lot of info
    with open(sample_file, 'w') as f:
        f.write("# Semantica Data Ingestion Guide\n\n")
        f.write("Semantica is a powerful framework for semantic data processing.\n")
        # ... (more content) ...
    print(f"Created sample file: {sample_file}")

# --- FileTypeDetector ---
detector = FileTypeDetector()
detected_type = detector.detect_type(sample_file)
print(f"Detected Type: {detected_type}")
# ...

Detected Type: txt


## 3. Web Ingestion

Scraping and crawling with `WebIngestor`, `ContentExtractor`, and `SitemapCrawler`.


In [None]:
import requests
from semantica.ingest import WebIngestor, ContentExtractor, SitemapCrawler, RobotsChecker

# --- ContentExtractor ---
# Demonstrating extraction from a real, content-rich web page
extractor = ContentExtractor()
url = "https://en.wikipedia.org/wiki/Artificial_intelligence"
try:
    # Wikipedia requires a User-Agent header
    headers = {'User-Agent': 'Semantica/1.0 (Education/Example)'}
    response = requests.get(url, headers=headers)
    html_content = response.text
    print(f"Fetched content from {url}")
except Exception as e:
    print(f"Failed to fetch {url}: {e}")
    # Fallback content
    html_content = "<html><body><h1>Hello World</h1><p>This is a test.</p><a href='/link'>Link</a></body></html>"

text = extractor.extract_text(html_content)
links = extractor.extract_links(html_content, base_url=url)
print(f"Extracted Text (excerpt): {text[:200]}...")
print(f"Found {len(links)} links")

# --- RobotsChecker ---
# Initialize with user agent
checker = RobotsChecker(user_agent="SemanticaBot")
# Check if we can fetch a specific page (e.g. Wikipedia Special pages are often restricted)
check_url = "https://en.wikipedia.org/wiki/Special:Search"
can_fetch = checker.can_fetch(check_url)
print(f"Can fetch {check_url}? {can_fetch}")

# --- WebIngestor ---
# Configure WebIngestor to be polite but allow the demo to run
web_ingestor = WebIngestor(
    delay=1.0,
    user_agent="Semantica/1.0 (Education/Example)",
    respect_robots=False  # Disabled for this demo to ensure Wikipedia access
)
try:
    web_content = web_ingestor.ingest_url(url)
    print(f"Web Content Title: {web_content.title}")
except Exception as e:
    print(f"Web ingest failed: {e}")

# --- SitemapCrawler ---
crawler = SitemapCrawler()
try:
    # Using FastAPI documentation sitemap as a clean, technical example
    sitemap_url = "https://fastapi.tiangolo.com/sitemap.xml"
    urls = crawler.parse_sitemap(sitemap_url)
    print(f"Found {len(urls)} URLs in sitemap: {sitemap_url}")
except Exception as e:
    print(f"Sitemap crawl failed: {e}")

## 4. Feed Ingestion

Consuming RSS/Atom feeds with `FeedIngestor` and monitoring with `FeedMonitor`.


In [None]:
from semantica.ingest import FeedIngestor, FeedMonitor
import time

# --- FeedIngestor ---
feed_ingestor = FeedIngestor()
# Using Lilian Weng's AI Blog RSS feed as a reliable source
feed_url = "https://lilianweng.github.io/index.xml"
try:
    feed_data = feed_ingestor.ingest_feed(feed_url)
    print(f"Feed Title: {feed_data.title}")
    if feed_data.items:
        print(f"Latest Post: {feed_data.items[0].title}")
except Exception as e:
    print(f"Feed ingest failed: {e}")

# --- FeedMonitor ---
def feed_callback(feed_url, new_items):
    print(f"Feed Updated: {feed_url} with {len(new_items)} new items")

monitor = FeedMonitor(check_interval=5)
try:
    monitor.add_feed(feed_url)
    monitor.set_update_callback(feed_callback)
    monitor.start_monitoring()
    time.sleep(2) # Let it run briefly
    monitor.stop_monitoring()
except Exception as e:
    print(f"Feed monitor failed: {e}")


## 5. Stream Ingestion

Real-time processing with `StreamIngestor` and `StreamMonitor`.


In [None]:
from semantica.ingest import StreamIngestor, StreamMonitor

stream_ingestor = StreamIngestor()

# --- Kafka Processor ---
# Note: This requires a running Kafka instance. We wrap it in try-except for the demo.
kafka_config = {"bootstrap_servers": ["localhost:9092"]}
try:
    kafka_processor = stream_ingestor.ingest_kafka("my-topic", **kafka_config)
    print("Kafka processor initialized.")
except Exception as e:
    print(f"Kafka ingest skipped (requires active broker): {e}")

# --- RabbitMQ Processor ---
# Note: This requires a running RabbitMQ instance. We wrap it in try-except for the demo.
try:
    rabbitmq_processor = stream_ingestor.ingest_rabbitmq("my-queue", "amqp://guest:guest@localhost:5672/")
    print("RabbitMQ processor initialized.")
except Exception as e:
    print(f"RabbitMQ ingest skipped (requires active broker): {e}")

# --- Stream Monitor ---
monitor = stream_ingestor.monitor
health = monitor.check_health()
print(f"Stream Health: {health['overall']}")
print(f"Processors: {list(health['processors'].keys())}")


## 6. Repository Ingestion

Analyzing codebases with `RepoIngestor`, `CodeExtractor`, and `GitAnalyzer`.


In [None]:
from semantica.ingest import RepoIngestor, CodeExtractor, GitAnalyzer
from pathlib import Path
import os

# --- CodeExtractor ---
code_extractor = CodeExtractor()
py_code = "class MyClass:\n    def my_method(self):\n        pass"
# Note: Using internal method _extract_structure for demonstration on string input
structure = code_extractor._extract_structure(py_code, language="python")
print(f"Classes: {structure.get('classes')}")
print(f"Functions: {structure.get('functions')}")

# --- RepoIngestor ---
repo_ingestor = RepoIngestor()
try:
    # Ingesting a public repository (requests) for reliable demonstration
    repo_data = repo_ingestor.ingest_repository("https://github.com/psf/requests.git")
    # Accessing repo info from the returned dictionary
    repo_info = repo_data.get('repository_info', {})
    print(f"Ingested Repo URL: {repo_info.get('url')}")
    print(f"Branches: {repo_info.get('branches')[:5]}...") # Show first 5 branches
    repo_ingestor.cleanup() # Clean up temp files
except Exception as e:
    print(f"Repo ingest failed: {e}")

# --- GitAnalyzer ---
try:
    # Initialize analyzer
    analyzer = GitAnalyzer()
    
    # Use current directory for demonstration
    current_path = Path(".")
    
    # Metrics calculation
    metrics = analyzer.calculate_metrics(current_path)
    print(f"Total Files (recursive): {metrics.get('total_files')}")
    print(f"Total Lines: {metrics.get('total_lines')}")
except Exception as e:
    print(f"Git analysis failed: {e}")


## 7. Email Ingestion

Processing emails with `EmailIngestor` and `AttachmentProcessor`.


In [None]:
from semantica.ingest import EmailIngestor, AttachmentProcessor
import tempfile
import os

# Create a temporary directory if not exists (though AttachmentProcessor handles its own temp dir)
temp_dir = tempfile.gettempdir()

# --- AttachmentProcessor ---
att_processor = AttachmentProcessor()
dummy_content = b"PDF Content"
# Use the correct method 'process_attachment' instead of 'save_attachment'
# This method saves the file and returns metadata including the saved path
att_info = att_processor.process_attachment(dummy_content, "doc.pdf", "application/pdf")
print(f"Saved attachment to: {att_info.get('saved_path')}")

# --- EmailIngestor ---
email_ingestor = EmailIngestor()
try:
    # Note: This will fail without real credentials, identifying it as an example
    # We wrap it in a try-block to allow the notebook to proceed
    email_ingestor.connect_imap("imap.gmail.com", "user", "pass")
    emails = email_ingestor.ingest_mailbox("INBOX", max_emails=5)
    print(f"Fetched {len(emails)} emails")
except Exception as e:
    print(f"Email ingest skipped (Auth required): {e}")

# Cleanup any temp files creation by attachment processor
att_processor.cleanup_attachments()


## 8. Database Ingestion

Connecting to SQL databases with `DBIngestor` and `DatabaseConnector`.


In [None]:
from semantica.ingest import DBIngestor, DatabaseConnector
import sqlite3
import os
import tempfile

# Setup SQLite DB in temp dir
temp_dir = tempfile.gettempdir()
db_path = os.path.join(temp_dir, "test.db")
if os.path.exists(db_path):
    os.remove(db_path)

conn = sqlite3.connect(db_path)
conn.execute("CREATE TABLE items (id INT, name TEXT)")
conn.execute("INSERT INTO items VALUES (1, 'Item 1'), (2, 'Item 2')")
conn.commit()
conn.close()

# --- DatabaseConnector ---
connector = DatabaseConnector()
# Fix: Use 'connect' method, not 'create_engine'
engine = connector.connect(f"sqlite:///{db_path}")
# engine.name for sqlite is 'sqlite'
print(f"Connected to DB Driver: {engine.name}")
connector.disconnect()

# --- DBIngestor ---
db_ingestor = DBIngestor()
# Fix: Use 'export_table' to get a single TableData object, matching the variable usage
table_data = db_ingestor.export_table(f"sqlite:///{db_path}", table_name="items")
print(f"Table: {table_data.table_name}")
print(f"Rows: {table_data.row_count}")
print(f"Data: {table_data.rows}")


## 9. MCP Ingestion

Integrating with Model Context Protocol servers using `MCPIngestor`.


In [None]:
from semantica.ingest import MCPIngestor
import logging

# --- MCPIngestor ---
mcp_ingestor = MCPIngestor()

# Public Daemon MCP Server
# Source: https://danielmiessler.com/p/daemon-mcp-server
mcp_server_url = "https://mcp.daemon.danielmiessler.com"

try:
    print(f"Connecting to public MCP server: {mcp_server_url}...")
    
    # This server supports standard JSON-RPC over HTTP
    mcp_ingestor.connect("daemon_server", url=mcp_server_url)

    # 1. List Available Tools
    print("\n--- Available Tools ---")
    tools = mcp_ingestor.list_available_tools("daemon_server")
    for tool in tools:
        # Print first 5 tools to avoid clutter
        if tools.index(tool) < 5:
             print(f"- {tool.name}: {tool.description or 'No description'}")
    if len(tools) > 5:
        print(f"... and {len(tools) - 5} more.")

    # 2. Call Tool (get_about)
    tool_name = "get_about"
    print(f"\n--- Calling Tool '{tool_name}' ---")
    
    result = mcp_ingestor.ingest_tool_output("daemon_server", tool_name, {})
    
    # Parse content
    content = result.content.get('content', [])
    if content and isinstance(content, list):
        for block in content:
            if block.get('type') == 'text':
                # Truncate if too long
                text = block.get('text', '')
                preview = text[:200] + "..." if len(text) > 200 else text
                print(f"Result: {preview}")
    else:
        print(f"Raw Result: {result.content}")

except Exception as e:
    print(f"MCP Ingestion failed: {e}")


## 10. Configuration

Managing ingestion settings with `IngestConfig`.


In [None]:
from semantica.ingest import IngestConfig, ingest_config

# Global config
print(f"Default Source Type: {ingest_config.get('default_source_type')}")

# Custom config instance
config = IngestConfig()
config.set("max_file_size", 1024 * 1024) # 1MB
print(f"Max File Size: {config.get('max_file_size')} bytes")
