From 77015f9c4855472804a366b795ec2af7f8e71e21 Mon Sep 17 00:00:00 2001 From: Luke Kim <80174+lukekim@users.noreply.github.com> Date: Mon, 24 Mar 2025 08:42:37 -0700 Subject: [PATCH 1/3] Add memory --- website/docs/reference/memory.md | 86 +++++++++++++++++++ website/docs/reference/system_requirements.md | 4 +- 2 files changed, 87 insertions(+), 3 deletions(-) create mode 100644 website/docs/reference/memory.md diff --git a/website/docs/reference/memory.md b/website/docs/reference/memory.md new file mode 100644 index 000000000..12df53219 --- /dev/null +++ b/website/docs/reference/memory.md @@ -0,0 +1,86 @@ +--- +title: 'Managing Memory Usage' +sidebar_label: 'Memory' +sidebar_position: 31 +description: 'Guidelines and best practices for managing memory usage and optimizing performance in Spice.ai Open Source deployments.' +keywords: + - memory +pagination_prev: null +pagination_next: null +--- + +Effective memory management is critical for optimal performance and stability in Spice.ai Open Source deployments. This guide provides clear recommendations and best practices for managing memory usage. + +## General Memory Recommendations + +Memory requirements depend on workload characteristics, dataset sizes, query complexity, and refresh modes. Recommended allocations: + +- Typical workloads: at least 8 GB RAM. +- Larger datasets: + - `refresh_mode: full`: 2.5x dataset size. + - `refresh_mode: append`: 1.5x dataset size. + - `refresh_mode: changes`: Primarily influenced by CDC event volume and frequency; 1.5x dataset size is a reasonable estimate. + +## Refresh Modes and Memory Implications + +Refresh modes directly impact memory usage: + +- **Full Refresh**: Loads data into a temporary table, then atomically swaps it with the existing table. Requires memory for both tables simultaneously, resulting in higher usage. +- **Append Refresh**: Incrementally inserts or upserts data, using memory only for incremental data, significantly reducing memory usage. +- **Changes Refresh**: Applies CDC events incrementally, with memory usage primarily influenced by incoming event volume and frequency, typically resulting in lower and predictable usage. + +## DataFusion Memory Management + +Spice.ai uses DataFusion as its query execution engine. DataFusion does not enforce strict memory limits by default, potentially causing unbounded memory usage. Spice.ai mitigates this through: + +- **Memory Budgeting**: Limits memory per query execution. Queries exceeding this budget return an error. See [Spicepod Configuration](spicepod/index.md). +- **Spill-to-Disk**: Operators such as Sort, Join, and GroupByHash spill intermediate results to disk when memory limits are exceeded, preventing out-of-memory errors. + +## Embedded Data Accelerators + +Spice.ai supports embedded accelerators like [SQLite](/website/docs/components/data-accelerators/sqlite.md) and [DuckDB](/website/docs/components/data-accelerators/duckdb.md), each with distinct memory considerations: + +- **SQLite**: Lightweight and memory-efficient, suitable for smaller datasets. Does not support intermediate spilling; datasets should fit comfortably in memory or use application-level paging. +- **DuckDB**: Optimized for larger datasets and complex queries. Manages memory through streaming execution, intermediate spilling, and buffer management. By default, DuckDB instances use up to 80% of available system memory. Consolidate multiple datasets into a single DuckDB instance to avoid excessive cumulative memory usage: + +```yaml +acceleration: + engine: duckdb + params: + duckdb_file: '/data/shared_duckdb_instance.db' + duckdb_memory_limit: '4G' +``` + +Configure DuckDB temporary directories and limits as follows: + +```sql +SET temp_directory = '/tmp/duckdb_swap'; +SET max_temp_directory_size = '100GB'; +``` + +For detailed DuckDB memory management, refer to the [DuckDB Memory Management Guide](https://duckdb.org/docs/operations_manual/limits.html). + +## Kubernetes Memory Configuration + +Set appropriate memory requests and limits in Kubernetes pod specifications: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: spice-ai-pod +spec: + containers: + - name: spice-ai-container + image: spiceai/spiceai:latest-models + resources: + requests: + memory: '8Gi' + cpu: '4' +``` + +## Monitoring and Profiling + +Regularly monitor and profile memory usage with observability tools to identify and address potential memory bottlenecks promptly. + +Following these recommendations helps developers effectively manage memory resources, ensuring Spice.ai deployments remain performant, stable, and reliable. diff --git a/website/docs/reference/system_requirements.md b/website/docs/reference/system_requirements.md index 0117fb7f7..b46ded480 100644 --- a/website/docs/reference/system_requirements.md +++ b/website/docs/reference/system_requirements.md @@ -80,9 +80,7 @@ Spice resource requirements, particularly memory, are highly dependent on worklo | `refresh_mode: full` | 2.5x the dataset size | | `refresh_mode: append` | 1.5x the dataset size | -### DuckDB Memory Usage - -When using DuckDB (as an accelerator or connector), it by default uses 80% of available memory. For more details, refer to the [DuckDB operations manual limits](https://duckdb.org/docs/operations_manual/limits.html). +See [Memory Management and Best Pratices](memory.md) for a detailed guide on memory considerations. ## Additional Considerations From 7af91f53ce78a1b1d6ca3c7b6a5d5e1707715cf3 Mon Sep 17 00:00:00 2001 From: Luke Kim <80174+lukekim@users.noreply.github.com> Date: Mon, 24 Mar 2025 13:52:53 -0700 Subject: [PATCH 2/3] Fixes for Spicepod --- spicepod.yaml | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/spicepod.yaml b/spicepod.yaml index 0be786a9d..6a28457b6 100755 --- a/spicepod.yaml +++ b/spicepod.yaml @@ -65,7 +65,7 @@ datasets: overlap_size: 128 trim_whitespace: true - - from: github:github.com/spiceai/blog/files/trunk + - from: github:github.com/spiceai/docs/files/trunk name: spiceai.blog description: Spice.ai OSS blog posts metadata: @@ -77,7 +77,7 @@ datasets: github_client_id: ${secrets:GITHUB_CLIENT_ID} github_private_key: ${secrets:GITHUB_PRIVATE_KEY} github_installation_id: ${secrets:GITHUB_INSTALLATION_ID} - include: 'content/posts/**/*.md' + include: 'website/blog/**/*.md' acceleration: enabled: true refresh_check_interval: 4h @@ -159,6 +159,8 @@ models: **Write clearly, concisely, and precisely for a developer audience.** Use objective third-person tense. Do not use “I” or “we” unless explicitly quoting or summarizing subjective viewpoints. Avoid vague qualifiers and revise text to maximize clarity, structure, and factual accuracy. Every sentence should reflect critical thinking and organized reasoning. + Ensure that all documentation adheres to the project's style guide and includes relevant examples where applicable. + **Be specific. Prioritize detail without being verbose.** Use plain, direct language. Avoid jargon, exaggeration, and promotional tone. @@ -174,3 +176,5 @@ models: **Always provide citations and references with links when applicable.** Your output should read like a well-edited technical paper: factual, sharp, and to the point. + + Always output Markdown in plain text for copy. From d5ec815db5f6538fbcefeaa0ab1c9162cacf1c9e Mon Sep 17 00:00:00 2001 From: Luke Kim <80174+lukekim@users.noreply.github.com> Date: Mon, 24 Mar 2025 13:53:27 -0700 Subject: [PATCH 3/3] Revert "Add memory" This reverts commit 77015f9c4855472804a366b795ec2af7f8e71e21. --- website/docs/reference/memory.md | 86 ------------------- website/docs/reference/system_requirements.md | 4 +- 2 files changed, 3 insertions(+), 87 deletions(-) delete mode 100644 website/docs/reference/memory.md diff --git a/website/docs/reference/memory.md b/website/docs/reference/memory.md deleted file mode 100644 index 12df53219..000000000 --- a/website/docs/reference/memory.md +++ /dev/null @@ -1,86 +0,0 @@ ---- -title: 'Managing Memory Usage' -sidebar_label: 'Memory' -sidebar_position: 31 -description: 'Guidelines and best practices for managing memory usage and optimizing performance in Spice.ai Open Source deployments.' -keywords: - - memory -pagination_prev: null -pagination_next: null ---- - -Effective memory management is critical for optimal performance and stability in Spice.ai Open Source deployments. This guide provides clear recommendations and best practices for managing memory usage. - -## General Memory Recommendations - -Memory requirements depend on workload characteristics, dataset sizes, query complexity, and refresh modes. Recommended allocations: - -- Typical workloads: at least 8 GB RAM. -- Larger datasets: - - `refresh_mode: full`: 2.5x dataset size. - - `refresh_mode: append`: 1.5x dataset size. - - `refresh_mode: changes`: Primarily influenced by CDC event volume and frequency; 1.5x dataset size is a reasonable estimate. - -## Refresh Modes and Memory Implications - -Refresh modes directly impact memory usage: - -- **Full Refresh**: Loads data into a temporary table, then atomically swaps it with the existing table. Requires memory for both tables simultaneously, resulting in higher usage. -- **Append Refresh**: Incrementally inserts or upserts data, using memory only for incremental data, significantly reducing memory usage. -- **Changes Refresh**: Applies CDC events incrementally, with memory usage primarily influenced by incoming event volume and frequency, typically resulting in lower and predictable usage. - -## DataFusion Memory Management - -Spice.ai uses DataFusion as its query execution engine. DataFusion does not enforce strict memory limits by default, potentially causing unbounded memory usage. Spice.ai mitigates this through: - -- **Memory Budgeting**: Limits memory per query execution. Queries exceeding this budget return an error. See [Spicepod Configuration](spicepod/index.md). -- **Spill-to-Disk**: Operators such as Sort, Join, and GroupByHash spill intermediate results to disk when memory limits are exceeded, preventing out-of-memory errors. - -## Embedded Data Accelerators - -Spice.ai supports embedded accelerators like [SQLite](/website/docs/components/data-accelerators/sqlite.md) and [DuckDB](/website/docs/components/data-accelerators/duckdb.md), each with distinct memory considerations: - -- **SQLite**: Lightweight and memory-efficient, suitable for smaller datasets. Does not support intermediate spilling; datasets should fit comfortably in memory or use application-level paging. -- **DuckDB**: Optimized for larger datasets and complex queries. Manages memory through streaming execution, intermediate spilling, and buffer management. By default, DuckDB instances use up to 80% of available system memory. Consolidate multiple datasets into a single DuckDB instance to avoid excessive cumulative memory usage: - -```yaml -acceleration: - engine: duckdb - params: - duckdb_file: '/data/shared_duckdb_instance.db' - duckdb_memory_limit: '4G' -``` - -Configure DuckDB temporary directories and limits as follows: - -```sql -SET temp_directory = '/tmp/duckdb_swap'; -SET max_temp_directory_size = '100GB'; -``` - -For detailed DuckDB memory management, refer to the [DuckDB Memory Management Guide](https://duckdb.org/docs/operations_manual/limits.html). - -## Kubernetes Memory Configuration - -Set appropriate memory requests and limits in Kubernetes pod specifications: - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: spice-ai-pod -spec: - containers: - - name: spice-ai-container - image: spiceai/spiceai:latest-models - resources: - requests: - memory: '8Gi' - cpu: '4' -``` - -## Monitoring and Profiling - -Regularly monitor and profile memory usage with observability tools to identify and address potential memory bottlenecks promptly. - -Following these recommendations helps developers effectively manage memory resources, ensuring Spice.ai deployments remain performant, stable, and reliable. diff --git a/website/docs/reference/system_requirements.md b/website/docs/reference/system_requirements.md index b46ded480..0117fb7f7 100644 --- a/website/docs/reference/system_requirements.md +++ b/website/docs/reference/system_requirements.md @@ -80,7 +80,9 @@ Spice resource requirements, particularly memory, are highly dependent on worklo | `refresh_mode: full` | 2.5x the dataset size | | `refresh_mode: append` | 1.5x the dataset size | -See [Memory Management and Best Pratices](memory.md) for a detailed guide on memory considerations. +### DuckDB Memory Usage + +When using DuckDB (as an accelerator or connector), it by default uses 80% of available memory. For more details, refer to the [DuckDB operations manual limits](https://duckdb.org/docs/operations_manual/limits.html). ## Additional Considerations