Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: parquet compression options #1725

Merged
merged 2 commits into from
Apr 2, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 24 additions & 26 deletions docs/en/sql-reference/00-sql-reference/50-file-format-options.md
Original file line number Diff line number Diff line change
@@ -3,7 +3,7 @@ title: Input & Output File Formats
---
import FunctionDescription from '@site/src/components/FunctionDescription';

<FunctionDescription description="Introduced or updated: v1.2.530"/>
<FunctionDescription description="Introduced or updated: v1.2.713"/>

Databend accepts a variety of file formats both as a source and as a target for data loading or unloading. This page explains the supported file formats and their available options.

@@ -72,14 +72,10 @@ Separates fields in a record.

**Default**: `,` (comma)

### QUOTE
### QUOTE (Load Only)

Quotes strings in a CSV file. For data loading, the quote is not necessary unless a string contains the character of a [QUOTE](#quote), [ESCAPE](#escape), [RECORD_DELIMITER](#record_delimiter), or [FIELD_DELIMITER](#field_delimiter).

:::note
**Used for data loading ONLY**: This option is not available when you unload data from Databend.
:::

**Available Values**: `'`, `"`, or `(backtick)

**Default**: `"`
@@ -92,49 +88,43 @@ Escapes a quote in a quoted string.

**Default**: `''`

### SKIP_HEADER
### SKIP_HEADER (Load Only)

Specifies how many lines to be skipped from the beginning of the file.

:::note
**Used for data loading ONLY**: This option is not available when you unload data from Databend.
:::

**Default**: `0`

### NAN_DISPLAY
### NAN_DISPLAY (Load Only)

Specifies how "NaN" (Not-a-Number) values are displayed in query results.

**Available Values**: Must be literal `'nan'` or `'null'` (case-insensitive)

**Default**: `'NaN'`

### NULL_DISPLAY
### NULL_DISPLAY (Load Only)

Specifies how NULL values are displayed in query results.

**Default**: `'\N'`

### ERROR_ON_COLUMN_COUNT_MISMATCH
### ERROR_ON_COLUMN_COUNT_MISMATCH (Load Only)

ERROR_ON_COLUMN_COUNT_MISMATCH is a boolean option that, when set to true, specifies that an error should be raised if the number of columns in the data file doesn't match the number of columns in the destination table. Setting it to true helps ensure data integrity and consistency during the loading process.

**Default**: `true`

### EMPTY_FIELD_AS
### EMPTY_FIELD_AS (Load Only)

Specifies the value that should be used when encountering empty fields, including both `,,` and `,"",`, in the CSV data being loaded into the table.

**Available Values**:

| Value | Description |
| Available Values | Description |
|------------------|-----------------------------------------------------------------------------------|
| `null` (Default) | Interprets empty fields as NULL values. Applicable to nullable columns only. |
| `string` | Interprets empty fields as empty strings (''). Applicable to String columns only. |
| `field_default` | Uses the column's default value for empty fields. |

### OUTPUT_HEADER
### OUTPUT_HEADER (Unload Only)

Specifies whether to include a header row in the CSV file when exporting data with the `COPY INTO <location>` command. Defaults to `false`.

@@ -146,9 +136,7 @@ Controls the binary encoding format during both data export and import operation

Specifies the compression algorithm.

**Available Values**:

| Value | Description |
| Available Values | Description |
|------------------|-----------------------------------------------------------------|
| `NONE` (Default) | Indicates that the files are not compressed. |
| `AUTO` | Auto detect compression via file extensions |
@@ -209,7 +197,7 @@ Same as [the COMPRESSION option for CSV](#compression).

## NDJSON Options

### NULL_FIELD_AS
### NULL_FIELD_AS (Load Only)

Specifies how to handle null values during data loading. Refer to the options in the table below for possible configurations.

@@ -218,7 +206,7 @@ Specifies how to handle null values during data loading. Refer to the options in
| `NULL` (Default) | Interprets null values as NULL for nullable fields. An error will be generated for non-nullable fields. |
| `FIELD_DEFAULT` | Uses the default value of the field for null values. |

### MISSING_FIELD_AS
### MISSING_FIELD_AS (Load Only)

Determines the behavior when encountering missing fields during data loading. Refer to the options in the table below for possible configurations.

@@ -234,7 +222,7 @@ Same as [the COMPRESSION option for CSV](#compression).

## PARQUET Options

### MISSING_FIELD_AS
### MISSING_FIELD_AS (Load Only)

Determines the behavior when encountering missing fields during data loading. Refer to the options in the table below for possible configurations.

@@ -243,9 +231,19 @@ Determines the behavior when encountering missing fields during data loading. Re
| `ERROR` (Default)| Generates an error if a missing field is encountered. |
| `FIELD_DEFAULT` | Uses the default value of the field for missing fields. |

### COMPRESSION (Unload Only)

Specifies the compression algorithm, which is used for compressing internal blocks of the file rather than the entire file, so the output remains in Parquet format.

| Available Values | Description |
|------------------|-----------------------------------------------------------------------------|
| `ZSTD` (default) | Zstandard v0.8 (and higher) is supported. |
| `SNAPPY` | Snappy is a popular and fast compression algorithm often used with Parquet. |


## ORC Options

### MISSING_FIELD_AS
### MISSING_FIELD_AS (Load Only)

Determines the behavior when encountering missing fields during data loading. Refer to the options in the table below for possible configurations.

Loading