Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 96 additions & 0 deletions libs/admin-api-lib/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# admin-api-lib

Document lifecycle orchestration for the STACKIT RAG template. This library exposes a FastAPI-compatible admin surface that receives raw user content, coordinates extraction, summarisation, chunking, and storage, and finally hands normalized information pieces to the core RAG API.

It powers the [`services/admin-backend`](https://github.com/stackitcloud/rag-template/tree/main/services/admin-backend) deployment and is the primary integration point for operators managing their document corpus.

## Responsibilities

1. **Ingestion** – Accept files or external sources from the admin UI or API clients.
2. **Extraction** – Call `extractor-api-lib` to obtain normalized information pieces.
3. **Enhancement** – Summarize and enrich content using LLMs and tracing hooks from `rag-core-lib`.
4. **Chunking** – Split content via recursive or semantic strategies before vectorization.
5. **Persistence** – Store raw assets in S3-compatible storage and push processed chunks to `rag-core-api`.
6. **Status tracking** – Keep track of upload progress and expose document status endpoints backed by KeyDB/Redis.

## Feature highlights

- Ready-to-wire dependency-injector container with sensible defaults for S3 storage, KeyDB status tracking, and background tasks.
- Pluggable chunkers (`recursive` vs `semantic`) and summariser implementations with shared retry/backoff controls.
- Rich Pydantic request/response models covering uploads, non-file sources, and document status queries.
- Thin endpoint implementations that can be swapped or extended while keeping the public API stable.
- Structured tracing (Langfuse) and logging that mirror the behaviour of the chat backend.

## Installation

```bash
pip install admin-api-lib
```

Requires Python 3.13 and `rag-core-lib`.

## Module tour

- `dependency_container.py` – Configures and wires dependency-injection providers. Override registrations here to customise behaviour.
- `api_endpoints/` & `impl/api_endpoints/` – Endpoints + abstractions for file uploads, source uploads, deletions, document status, and reference retrieval.
- `apis/` – Admin API abstractions and implementations.
- `chunker/` & `impl/chunker/` – Abstractions + default text/semantic chunkers and chunker type selection class.
- `extractor_api_client/` & `rag_backend_client/` – Generated OpenAPI clients to talk to the extractor and rag core API services.
- `file_services/` & `impl/file_services/` – Abstract and default S3 interface.
- `summarizer/` & `impl/summarizer/` – Interfaces and LangChain-based summariser that leverage shared retry logic.
- `information_enhancer/` & `impl/information_enhancer/` – Abstractions + page and summary enhancer. Enhancers are centralized with general enhancer.
- `impl/key_db/` – KeyDB/Redis client implementation for document status tracking.
- `impl/mapper/` – Mapper between extractor documents and langchain documents.
- `impl/settings/` – Configuration settings for dependency injection container components.
- `prompt_templates/` – Default summarisation prompt shipped with the template.
- `utils/` – Utility functions and classes.

## Endpoints provided

- `POST /upload_file` – Uploads user selected files
- `POST /upload_source` - Uploads user selected sources
- `DELETE /documents/{identification}` – Deletes a document from the system.
- `GET /document_reference/{identification}` – Retrieves a document reference.
- `GET /all_documents_status` – Retrieves the status of all documents.

Refer to [`libs/README.md`](../README.md#2-admin-api-lib) for in-depth API documentation.

## Configuration overview

All settings are powered by `pydantic-settings`, so you can use environment variables or instantiate classes manually:

- `S3_*` (`S3_ACCESS_KEY_ID`, `S3_SECRET_ACCESS_KEY`, `S3_ENDPOINT`, `S3_BUCKET`) – configure storage for raw uploads.
- `DOCUMENT_EXTRACTOR_HOST` – base URL of the extractor service.
- `RAG_API_HOST` – base URL of the rag-core API.
- `CHUNKER_CLASS_TYPE_CHUNKER_TYPE` – choose `recursive` (default) or `semantic` chunking.
- `CHUNKER_*` (`CHUNKER_MAX_SIZE`, `CHUNKER_OVERLAP`, `CHUNKER_BREAKPOINT_THRESHOLD_TYPE`, …) – fine-tune chunking behaviour.
- `SUMMARIZER_MAXIMUM_INPUT_SIZE`, `SUMMARIZER_MAXIMUM_CONCURRENCY`, `SUMMARIZER_MAX_RETRIES`, etc. – tune summariser limits and retry behaviour.
- `SOURCE_UPLOADER_TIMEOUT` – adjust how long non-file source ingestions wait before timing out.
- `USECASE_KEYVALUE_HOST` / `USECASE_KEYVALUE_PORT` – configure the KeyDB/Redis instance that persists document status.

The Helm chart forwards these values through `adminBackend.envs.*`, keeping deployments declarative. Local development can rely on `.env` as described in the repository root README.

## Typical usage

```python
from admin_api_lib.main import app as perfect_admin_app
```

The admin frontend (`services/frontend` → Admin app) and automation scripts talk to these endpoints to manage the corpus. Downstream, `rag-core-api` receives the processed information pieces and stores them in the vector database.

## Extending the library

1. Implement a new interface (e.g., `Chunker`, `Summarizer`, `FileService`).
2. Register it in `dependency_container.py` or override via dependency-injector in your service.
3. Update or add API endpoints if you expose new capabilities.
4. Cover the new behaviour with pytest-based unit tests under `libs/admin-api-lib/tests`.

Because components depend on interfaces defined here, downstream services can swap behavior without modifying the public API surface.

## Contributing

Ensure new endpoints or adapters remain thin and defer to [`rag-core-lib`](../rag-core-lib/) for shared logic. Run `poetry run pytest` and the configured linters before opening a PR. For further instructions see the [Contributing Guide](https://github.com/stackitcloud/rag-template/blob/main/CONTRIBUTING.md).

## License

Licensed under the project license. See the root [`LICENSE`](https://github.com/stackitcloud/rag-template/blob/main/LICENSE) file.
4 changes: 2 additions & 2 deletions libs/admin-api-lib/poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

13 changes: 11 additions & 2 deletions libs/admin-api-lib/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,19 @@ build-backend = "poetry.core.masonry.api"

[tool.poetry]
name = "admin-api-lib"
version = "1.0.1"
version = "v3.2.1"
description = "The admin backend is responsible for the document management. This includes deletion, upload and returning the source document."
authors = ["STACKIT Data and AI Consulting <data-ai-consulting@stackit.cloud>"]
authors = [
"STACKIT GmbH & Co. KG <data-ai@stackit.cloud>",
]
maintainers = [
"Andreas Klos <andreas.klos@stackit.cloud>",
]
packages = [{ include = "admin_api_lib", from = "src" }]
readme = "README.md"
license = "Apache-2.0"
repository = "https://github.com/stackitcloud/rag-template"
homepage = "https://pypi.org/project/admin-api-lib"

[tool.flake8]
exclude= [".eggs", "./libs/*", "./src/admin_api_lib/models/*", "./src/admin_api_lib/rag_backend_client/*", "./src/admin_api_lib/extractor_api_client/*", ".git", ".hg", ".mypy_cache", ".tox", ".venv", ".devcontainer", "venv", "_build", "buck-out", "build", "dist", "**/__init__.py"]
Expand Down
94 changes: 94 additions & 0 deletions libs/extractor-api-lib/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# extractor-api-lib

Content ingestion layer for the STACKIT RAG template. This library exposes a FastAPI extraction service that ingests raw documents (files or remote sources), extracts and converts (to internal representations) the information, and hands output to [`admin-api-lib`](../admin-api-lib/).

## Responsibilities

- Receive binary uploads and remote source descriptors from the admin backend.
- Route each request through the appropriate extractor (file, sitemap, Confluence, etc.).
- Convert extracted fragments into the shared `InformationPiece` schema expected by downstream services.

## Feature highlights

- **Broad format coverage** – PDFs, DOCX, PPTX, XML/EPUB, Confluence spaces, and sitemap-driven websites.
- **Consistent output schema** – Information pieces are returned in a unified structure with content type (`TEXT`, `TABLE`, `IMAGE`) and metadata.
- **Swappable extractors** – Dependency-injector container makes it easy to add or replace file/source extractors, table converters, etc.
- **Production-grade plumbing** – Built-in S3-compatible file service, LangChain loaders with retry/backoff, optional PDF OCR, and throttling controls for web crawls.

## Installation

```bash
pip install extractor-api-lib
```

Python 3.13 is required. OCR and computer-vision features expect system packages such as `ffmpeg`, `poppler-utils`, and `tesseract` (see `services/document-extractor/README.md` for the full list).

## Module tour

- `dependency_container.py` – Central dependency-injector wiring. Override providers here to plug in custom extractors, endpoints etc.
- `api_endpoints/` & `impl/api_endpoints/` – Thin FastAPI endpoint abstractions and implementations for file and source (like confluence & sitemaps) extractors.
- `apis/` – Extractor API abstractions and implementations.
- `extractors/` & `impl/extractors/` – Format-specific logic (PDF, DOCX, PPTX, XML, EPUB, Confluence, sitemap) packaged behind the `InformationExtractor`/`InformationFileExtractor` interfaces.
- `mapper/` & `impl/mapper/` – Abstractions and implementations to map langchain documents, internal and external information piece representations to each other.
- `file_services/` – Default S3-compatible storage adapter; replace it if you store files elsewhere.
- `impl/settings/` – Configuration settings for dependency injection container components.
- `table_converter/` & `impl/table_converter/` – Abstractions and implementations to convert `pandas.DataFrame` to markdown and vice versa.
- `impl/types/` - Enums for content-, extractor- and file types.
- `impl/utils/` – Helper functions for hashed datetime and sitemap crawling, header injection, and custom metadata parsing.

## Endpoints provided

- `POST /extract_from_file` – Downloads the file from S3, extracts its contents, and returns normalized `InformationPiece` records.
- `POST /extract_from_source` – Pulls from remote sources (Confluence, sitemap) using credentials and further optional kwargs.

Both endpoints stream their results back to `admin-api-lib`, which takes care of enrichment and persistence.

## How the file extraction endpoint works

1. Download the file from S3
2. Chose suitable file extractor based on the filename ending
3. Extract the content from the file
4. Map the internal representation to the external schema
5. Return the final output

## How the source extraction endpoint works

1. Chose suitable source extractor based on the source type
2. Pull the source content using the provided credentials and parameters
3. Extract the content from the source
4. Map the internal representation to the external schema
5. Return the final output

## Configuration overview

Two `pydantic-settings` models ship with this package:

- **S3 storage** (`S3Settings`) – configure the built-in file service with `S3_ACCESS_KEY_ID`, `S3_SECRET_ACCESS_KEY`, `S3_ENDPOINT`, and `S3_BUCKET`.
- **PDF extraction** (`PDFExtractorSettings`) – adjust footer trimming or diagram export via `PDF_EXTRACTOR_FOOTER_HEIGHT` and `PDF_EXTRACTOR_DIAGRAMS_FOLDER_NAME`.

Other extractors accept their parameters at runtime through the request payload (`ExtractionParameters`). For example, the admin backend forwards Confluence credentials, sitemap URLs, or custom headers when it calls `/extract_from_source`. This keeps the library stateless and makes it easy to plug in additional sources without redeploying.

The Helm chart exposes the environment variables mentioned above under `documentExtractor.envs.*` so production deployments remain declarative.

## Typical usage

```python
from extractor_api_lib.main import app as perfect_extractor_app
```

`admin-api-lib` calls `/extract_from_file` and `/extract_from_source` to populate the ingestion pipeline.

## Extending the library

1. Implement `InformationFileExtractor` or `InformationExtractor` for your new format/source.
2. Register it in `dependency_container.py` (append to `file_extractors` list or `source_extractors` dict).
3. Update mapper or metadata handling if additional fields are required.
4. Add unit tests under `libs/extractor-api-lib/tests` using fixtures and fake storage providers.

## Contributing

Ensure new endpoints or adapters remain thin and defer to [`rag-core-lib`](../rag-core-lib/) for shared logic. Run `poetry run pytest` and the configured linters before opening a PR. For further instructions see the [Contributing Guide](https://github.com/stackitcloud/rag-template/blob/main/CONTRIBUTING.md).

## License

Licensed under the project license. See the root [`LICENSE`](https://github.com/stackitcloud/rag-template/blob/main/LICENSE) file.
16 changes: 6 additions & 10 deletions libs/extractor-api-lib/poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

17 changes: 13 additions & 4 deletions libs/extractor-api-lib/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,19 @@ build-backend = "poetry.core.masonry.api"

[tool.poetry]
name = "extractor_api_lib"
version = "1.0.1"
version = "v3.2.1"
description = "Extracts the content of documents, websites, etc and maps it to a common format."
authors = ["STACKIT Data and AI Consulting <data-ai-consulting@stackit.cloud>"]
authors = [
"STACKIT GmbH & Co. KG <data-ai@stackit.cloud>",
]
maintainers = [
"Andreas Klos <andreas.klos@stackit.cloud>",
]
packages = [{ include = "extractor_api_lib", from = "src" }]
readme = "README.md"
license = "Apache-2.0"
repository = "https://github.com/stackitcloud/rag-template"
homepage = "https://pypi.org/project/extractor-api-lib"

[[tool.poetry.source]]
name = "pytorch_cpu"
Expand Down Expand Up @@ -70,7 +79,7 @@ max-line-length = 120
python = "^3.13"
wheel = "^0.45.1"
botocore = "^1.38.10"
fasttext = {git = "https://github.com/cfculhane/fastText", rev = "4a4451337ae6b476b9c584b97776c8c3eb4b27c5"}
fasttext = "^0.9.3"
pytesseract = "^0.3.10"
fastapi = "^0.118.0"
uvicorn = "^0.37.0"
Expand Down Expand Up @@ -136,7 +145,7 @@ black = "^25.1.0"
httpx = "^0.28.1"

[tool.pytest.ini_options]
log_cli = 1
log_cli = true
log_cli_level = "DEBUG"
pythonpath = "src"
testpaths = "src/tests"
Loading