# Multi-Source Data Ingestion

## Overview

This notebook demonstrates how to ingest data from multiple sources (files, web, feeds, streams, and databases) and process them through a unified pipeline.

### Learning Objectives

- Learn to ingest from various data sources
- Combine data from multiple sources
- Process diverse data through a unified pipeline

---

## Unified Processing Pipeline

Semantica provides specialized ingestors for different data sources, all producing a unified document format that can be processed together.

---

## Step 1: Ingest from Files

Start by ingesting documents from local files or directories.


In [None]:
from semantica.ingest import FileIngestor
from pathlib import Path

file_ingestor = FileIngestor()

sample_file = Path("sample_file.txt")
sample_file.write_text("Sample file content for ingestion demonstration.")

try:
    file_docs = file_ingestor.ingest_file(sample_file, read_content=True)
    print("✓ Files ingested successfully!")
    print(f"  Document: {file_docs.name if hasattr(file_docs, 'name') else 'N/A'}")
except Exception as e:
    print(f"✗ Error ingesting files: {e}")
    file_docs = []


## Step 2: Ingest from Web

Ingest content from web pages using the `WebIngestor`.


In [None]:
from semantica.ingest import WebIngestor

web_ingestor = WebIngestor()

print("Web ingestion example:")
print("  web_docs = web_ingestor.ingest('https://example.com')")
print("\nNote: Actual web ingestion requires valid URLs and network access")
web_docs = []


## Step 3: Ingest from Feeds

Ingest content from RSS/Atom feeds using the `FeedIngestor`.


In [None]:
from semantica.ingest import FeedIngestor

feed_ingestor = FeedIngestor()

print("Feed ingestion example:")
print("  feed_docs = feed_ingestor.ingest('https://example.com/feed.xml')")
print("\nNote: Actual feed ingestion requires valid feed URLs")
feed_docs = []


## Step 4: Ingest from Streams

Ingest real-time data from streams using the `StreamIngestor`.


In [None]:
from semantica.ingest import StreamIngestor

stream_ingestor = StreamIngestor()

print("Stream ingestion example:")
print("  stream_docs = stream_ingestor.ingest(stream_source)")
print("\nNote: Stream ingestion requires configured stream sources (Kafka, RabbitMQ, etc.)")
stream_docs = []


## Step 5: Ingest from Databases

Ingest data from databases using the `DBIngestor`.


In [None]:
from semantica.ingest import DBIngestor

print("Database ingestion example:")
print("  db_ingestor = DBIngestor(connection_string='...')")
print("  db_docs = db_ingestor.ingest(query='SELECT * FROM table')")
print("\nNote: Database ingestion requires valid connection strings and queries")
db_docs = []


## Step 6: Unified Processing

Combine all documents from different sources and process them through a unified pipeline.


In [None]:
from semantica.parse import DocumentParser

all_docs = []
if file_docs:
    all_docs.append(file_docs)
all_docs.extend(web_docs)
all_docs.extend(feed_docs)
all_docs.extend(stream_docs)
all_docs.extend(db_docs)

print(f"Total documents from all sources: {len(all_docs)}")

parser = DocumentParser()

if all_docs:
    try:
        parsed_docs = []
        for doc in all_docs:
            if hasattr(doc, 'content') and doc.content:
                parsed = parser.parse_document(doc.content)
                parsed_docs.append(parsed)
        
        print(f"\n✓ Processed {len(parsed_docs)} documents through unified pipeline")
    except Exception as e:
        print(f"\n✗ Error processing documents: {e}")
else:
    print("\nNote: Add documents from various sources to see unified processing")

try:
    if sample_file.exists():
        sample_file.unlink()
except:
    pass
