# Data Ingestion

## Overview

This notebook demonstrates how to ingest data from various sources using Semantica's ingestion modules. You'll learn to ingest files, web content, databases, streams, and feeds.

### Learning Objectives

- Use `FileIngestor` to load files from local and cloud storage
- Use `WebIngestor` to scrape and crawl web content
- Use `DBIngestor` to extract data from databases
- Use `StreamIngestor` for real-time data streams
- Use `FeedIngestor` to process RSS/Atom feeds

---

## Step 1: File Ingestion

Ingest files from local filesystem or cloud storage.


In [None]:
from semantica.ingest import FileIngestor
import tempfile
import os

file_ingestor = FileIngestor()

temp_dir = tempfile.mkdtemp()
sample_file = os.path.join(temp_dir, "sample.txt")

with open(sample_file, 'w') as f:
    f.write("Apple Inc. is a technology company. Tim Cook is the CEO.")

file_object = file_ingestor.ingest_file(sample_file, read_content=True)

print(f"Ingested file: {file_object.name}")
print(f"File type: {file_object.file_type}")
print(f"Size: {file_object.size} bytes")
print(f"Content preview: {file_object.content[:50]}...")


## Step 2: Directory Ingestion

Ingest multiple files from a directory.


In [None]:
file2 = os.path.join(temp_dir, "doc2.txt")
with open(file2, 'w') as f:
    f.write("Microsoft Corporation is a technology company. Satya Nadella is the CEO.")

file_objects = file_ingestor.ingest_directory(temp_dir, recursive=False, read_content=True)

print(f"Ingested {len(file_objects)} files from directory")
for file_obj in file_objects:
    print(f"  - {file_obj.name} ({file_obj.file_type})")


## Step 3: Web Ingestion

Ingest content from web pages.


In [None]:
from semantica.ingest import WebIngestor

web_ingestor = WebIngestor()

try:
    web_content = web_ingestor.ingest_url("https://example.com")
    print(f"Ingested web page: {web_content.url}")
    print(f"Title: {web_content.title}")
    print(f"Content length: {len(web_content.text)} characters")
except Exception as e:
    print(f"Web ingestion example (requires internet): {e}")


## Step 4: Database Ingestion

Ingest data from databases.


In [None]:
from semantica.ingest import DBIngestor

db_ingestor = DBIngestor()

print("DBIngestor initialized")
print("To use: Configure database connection and call ingest_table() or ingest_query()")


## Step 5: Stream Ingestion

Ingest data from real-time streams.


In [None]:
from semantica.ingest import StreamIngestor

stream_ingestor = StreamIngestor()

print("StreamIngestor initialized")
print("To use: Configure stream source (Kafka, RabbitMQ, etc.) and start consuming")


## Step 6: Feed Ingestion

Ingest RSS/Atom feeds.


In [None]:
from semantica.ingest import FeedIngestor

feed_ingestor = FeedIngestor()

try:
    feed_data = feed_ingestor.ingest_feed("https://feeds.feedburner.com/oreilly/radar")
    print(f"Ingested feed: {feed_data.title}")
    print(f"Items: {len(feed_data.items)}")
    if feed_data.items:
        print(f"First item: {feed_data.items[0].title}")
except Exception as e:
    print(f"Feed ingestion example (requires internet): {e}")


## Summary

You've learned how to ingest data from multiple sources:

- **FileIngestor**: Local files and directories
- **WebIngestor**: Web pages and URLs
- **DBIngestor**: Database tables and queries
- **StreamIngestor**: Real-time data streams
- **FeedIngestor**: RSS/Atom feeds

Next: Learn how to parse the ingested data in the Document_Parsing notebook.
