Merge pull request #103 from supabase/feat/s3fdw-parquet-support

feat: add parquet file reading support for s3fdw
supabase · Jun 12, 2023 · c63e89a · c63e89a
2 parents ebc32eb + ef1420f
commit c63e89a
Show file tree

Hide file tree

Showing 11 changed files with 1,098 additions and 226 deletions.
diff --git a/docs/s3.md b/docs/s3.md
@@ -1,7 +1,8 @@
-[AWS S3](https://aws.amazon.com/s3/) is an object storage service offering industry-leading scalability, data availability, security, and performance. The S3 wrapper is under development. It is read-only and supports 2 file formats:
+[AWS S3](https://aws.amazon.com/s3/) is an object storage service offering industry-leading scalability, data availability, security, and performance. The S3 wrapper is under development. It is read-only and supports below file formats:
 
 1. CSV - with or without header line
 2. [JSON Lines](https://jsonlines.org/)
+3. [Parquet](https://parquet.apache.org/)
 
 The S3 FDW also supports below compression algorithms:
 
@@ -10,7 +11,27 @@ The S3 FDW also supports below compression algorithms:
 3. xz
 4. zlib
 
-**Note: currently all columns in S3 files must be defined in the foreign table and their types must be `text` type**
+**Note for CSV and JSONL files: currently all columns in S3 files must be defined in the foreign table and their types must be `text` type**
+
+**Note for Parquet files: the whole Parquet file will be loaded into local memory if it is compressed, so keep its size small**
+
+### Supported Data Types For Parquet File
+
+The S3 FDW uses Parquet file data types from [arrow_array::types](https://docs.rs/arrow-array/41.0.0/arrow_array/types/index.html), below are their mappings to Postgres data types.
+
+| Postgres Type      | Parquet Type             |
+| ------------------ | ------------------------ |
+| boolean            | BooleanType              |
+| char               | Int8Type                 |
+| smallint           | Int16Type                |
+| real               | Float32Type              |
+| integer            | Int32Type                |
+| double precision   | Float64Type              |
+| bigint             | Int64Type                |
+| numeric            | Float64Type              |
+| text               | ByteArrayType            |
+| date               | Date64Type               |
+| timestamp          | TimestampNanosecondType  |
 
 ### Wrapper 
 To get started with the S3 wrapper, create a foreign data wrapper specifying `handler` and `validator` as below.
@@ -90,14 +111,17 @@ create server s3_server
 
 S3 wrapper is implemented with [ELT](https://hevodata.com/learn/etl-vs-elt/) approach, so the data transformation is encouraged to be performed locally after data is extracted from remote data source.
 
-One file in S3 corresponds a foreign table in Postgres, all columns must be present in the foreign table and type must be `text`. You can do custom transformations, like type conversion, by creating a view on top of the foreign table or using a subquery.
+One file in S3 corresponds a foreign table in Postgres. For CSV and JSONL file, all columns must be present in the foreign table and type must be `text`. You can do custom transformations, like type conversion, by creating a view on top of the foreign table or using a subquery.
+
+For Parquet file, no need to define all columns in the foreign table but column names must match between Parquet file and its foreign table.
+
 
 #### Foreign Table Options
 
 The full list of foreign table options are below:
 
 - `uri` - S3 URI, required. For example, `s3://bucket/s3_table.csv`
-- `format` - File format, required. `csv` or `jsonl`
+- `format` - File format, required. `csv`, `jsonl`, or `parquet`
 - `has_header` - If the CSV file has header, optional. `true` or `false`, default is `false`
 - `compress` - Compression algorithm, optional. One of `gzip`, `bzip2`, `xz`, `zlib`, default is no compression
 
@@ -148,5 +172,36 @@ create foreign table s3_table_csv_gzip (
     has_header 'true',
     compress 'gzip'
   );
+
+-- Parquet file, no compression
+create foreign table s3_table_parquet (
+  id integer,
+  bool_col boolean,
+  bigint_col bigint,
+  float_col real,
+  date_string_col text,
+  timestamp_col timestamp
+)
+  server s3_server
+  options (
+    uri 's3://bucket/s3_table.parquet',
+    format 'parquet'
+  );
+
+-- GZIP compressed Parquet file
+create foreign table s3_table_parquet_gz (
+  id integer,
+  bool_col boolean,
+  bigint_col bigint,
+  float_col real,
+  date_string_col text,
+  timestamp_col timestamp
+)
+  server s3_server
+  options (
+    uri 's3://bucket/s3_table.parquet.gz',
+    format 'parquet',
+    compress 'gzip'
+  );
 ```