Skip to content

Commit

Permalink
Merge pull request #103 from supabase/feat/s3fdw-parquet-support
Browse files Browse the repository at this point in the history
feat: add parquet file reading support for s3fdw
  • Loading branch information
burmecia committed Jun 12, 2023
2 parents ebc32eb + ef1420f commit c63e89a
Show file tree
Hide file tree
Showing 11 changed files with 1,098 additions and 226 deletions.
63 changes: 59 additions & 4 deletions docs/s3.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
[AWS S3](https://aws.amazon.com/s3/) is an object storage service offering industry-leading scalability, data availability, security, and performance. The S3 wrapper is under development. It is read-only and supports 2 file formats:
[AWS S3](https://aws.amazon.com/s3/) is an object storage service offering industry-leading scalability, data availability, security, and performance. The S3 wrapper is under development. It is read-only and supports below file formats:

1. CSV - with or without header line
2. [JSON Lines](https://jsonlines.org/)
3. [Parquet](https://parquet.apache.org/)

The S3 FDW also supports below compression algorithms:

Expand All @@ -10,7 +11,27 @@ The S3 FDW also supports below compression algorithms:
3. xz
4. zlib

**Note: currently all columns in S3 files must be defined in the foreign table and their types must be `text` type**
**Note for CSV and JSONL files: currently all columns in S3 files must be defined in the foreign table and their types must be `text` type**

**Note for Parquet files: the whole Parquet file will be loaded into local memory if it is compressed, so keep its size small**

### Supported Data Types For Parquet File

The S3 FDW uses Parquet file data types from [arrow_array::types](https://docs.rs/arrow-array/41.0.0/arrow_array/types/index.html), below are their mappings to Postgres data types.

| Postgres Type | Parquet Type |
| ------------------ | ------------------------ |
| boolean | BooleanType |
| char | Int8Type |
| smallint | Int16Type |
| real | Float32Type |
| integer | Int32Type |
| double precision | Float64Type |
| bigint | Int64Type |
| numeric | Float64Type |
| text | ByteArrayType |
| date | Date64Type |
| timestamp | TimestampNanosecondType |

### Wrapper
To get started with the S3 wrapper, create a foreign data wrapper specifying `handler` and `validator` as below.
Expand Down Expand Up @@ -90,14 +111,17 @@ create server s3_server

S3 wrapper is implemented with [ELT](https://hevodata.com/learn/etl-vs-elt/) approach, so the data transformation is encouraged to be performed locally after data is extracted from remote data source.

One file in S3 corresponds a foreign table in Postgres, all columns must be present in the foreign table and type must be `text`. You can do custom transformations, like type conversion, by creating a view on top of the foreign table or using a subquery.
One file in S3 corresponds a foreign table in Postgres. For CSV and JSONL file, all columns must be present in the foreign table and type must be `text`. You can do custom transformations, like type conversion, by creating a view on top of the foreign table or using a subquery.

For Parquet file, no need to define all columns in the foreign table but column names must match between Parquet file and its foreign table.


#### Foreign Table Options

The full list of foreign table options are below:

- `uri` - S3 URI, required. For example, `s3://bucket/s3_table.csv`
- `format` - File format, required. `csv` or `jsonl`
- `format` - File format, required. `csv`, `jsonl`, or `parquet`
- `has_header` - If the CSV file has header, optional. `true` or `false`, default is `false`
- `compress` - Compression algorithm, optional. One of `gzip`, `bzip2`, `xz`, `zlib`, default is no compression

Expand Down Expand Up @@ -148,5 +172,36 @@ create foreign table s3_table_csv_gzip (
has_header 'true',
compress 'gzip'
);

-- Parquet file, no compression
create foreign table s3_table_parquet (
id integer,
bool_col boolean,
bigint_col bigint,
float_col real,
date_string_col text,
timestamp_col timestamp
)
server s3_server
options (
uri 's3://bucket/s3_table.parquet',
format 'parquet'
);

-- GZIP compressed Parquet file
create foreign table s3_table_parquet_gz (
id integer,
bool_col boolean,
bigint_col bigint,
float_col real,
date_string_col text,
timestamp_col timestamp
)
server s3_server
options (
uri 's3://bucket/s3_table.parquet.gz',
format 'parquet',
compress 'gzip'
);
```

Loading

0 comments on commit c63e89a

Please sign in to comment.