Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 doc change to reflect the csv file change to #247

Merged
merged 13 commits into from
May 20, 2024
89 changes: 39 additions & 50 deletions spiceaidocs/docs/data-connectors/ftp.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,72 +7,61 @@ description: 'FTP/SFTP Data Connector Documentation'
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

The FTP/SFTP Data Connector enables federated SQL query across files stored in FTP/SFTP servers.
The FTP/SFTP Data Connector enables federated SQL query across Parquet/CSV files stored in FTP/SFTP servers.

Supports Parquet and CSV file formats.
If a folder is provided, all child Parquet/CSV files will be loaded.

If a folder is proivided, all child files will be loaded.

To connect to any FTP/SFTP server, specify `ftp` or `sftp` as a selector in the `from` value for the dataset.
## Configuration

<Tabs>
<TabItem value="ftp" label="FTP" default>
```yaml
datasets:
- from: ftp://<host>/path/to/folder/
name: my_dataset
```
</TabItem>
<TabItem value="sftp" label="SFTP">
```yaml
datasets:
- from: sftp://<host>/path/to/folder/
name: my_dataset
```
</TabItem>
</Tabs>
### Parameters

## Configuration
The connection to FTP can be configured by providing the following params:

<Tabs>
<TabItem value="ftp" label="FTP" default>
- `file_format`: Optional parameter, specifies the requested file format.
- `file_format`: Optional, specifies the requested file format.
- `parquet`: (default) Parquet file format.
- `csv`: CSV file format.
- `ftp_port`: Optional parameter, specifies the port of the FTP server. Default is 21. E.g. `ftp_port: 21`
- `ftp_port`: Optional, specifies the port of the FTP server. Default is 21. E.g. `ftp_port: 21`
- `ftp_user`: The username for the FTP server. E.g. `ftp_user: my-ftp-user`
- `ftp_pass`: The password for the FTP server. E.g. `ftp_pass: my-ftp-password`
- `ftp_pass_key`: The secret key container the password to connect with. E.g. `ftp_pass_key: my-ftp-password-key`

More CSV related parameters can be configured, see [CSV Parameters](../reference/file_format.md#CSV)

### Examples
```yaml
- from: ftp://remote-ftp-server.com/path/to/folder/
name: my_dataset
params:
file_format: csv
ftp_user: my-ftp-user
ftp_pass_key: my-ftp-password
```
</TabItem>
<TabItem value="sftp" label="SFTP">
- `file_format`: Optional parameter, specifies the requested file format.
### Parameters

The connection to SFTP can be configured by providing the following params:

- `file_format`: Optional, specifies the requested file format.
- `parquet`: (default) Parquet file format.
- `csv`: CSV file format.
- `sftp_port`: Optional parameter, specifies the port of the SFTP server. Default is 22. E.g. `sftp_port: 22`
- `sftp_user`: The username for the FTP server. E.g. `sftp_user: my-sftp-user`
- `sftp_pass`: The password for the FTP server. E.g. `sftp_pass: my-sftp-password`
- `sftp_port`: Optional, specifies the port of the SFTP server. Default is 22. E.g. `sftp_port: 22`
- `sftp_user`: The username for the SFTP server. E.g. `sftp_user: my-sftp-user`
- `sftp_pass`: The password for the SFTP server. E.g. `sftp_pass: my-sftp-password`
- `sftp_pass_key`: The secret key container the password to connect with. E.g. `sftp_pass_key: my-sftp-password-key`
</TabItem>
</Tabs>

Configuration `params` are provided either in the top level `dataset` for a dataset source and federated SQL query.

```yaml
- from: ftp://remote-ftp-server.com/path/to/folder/
name: my_dataset
params:
file_format: csv
ftp_user: my-ftp-user
ftp_pass_key: my-ftp-password
```

```yaml
- from: sftp://remote-ftp-server.com/path/to/folder/
name: my_dataset
params:
sftp_port: 20
sftp_user: my-ftp-user
sftp_pass_key: my-ftp-password
```

More CSV related parameters can be configured, see [CSV Parameters](../reference/file_format.md#CSV)

### Examples
```yaml
- from: sftp://remote-sftp-server.com/path/to/folder/
name: my_dataset
params:
sftp_port: 20
sftp_user: my-sftp-user
sftp_pass_key: my-sftp-password
```
</TabItem>
</Tabs>
12 changes: 6 additions & 6 deletions spiceaidocs/docs/data-connectors/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ description: 'S3 Data Connector Documentation'
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

The S3 Data Connector enables federated SQL query across Parquet files stored in S3, or S3-compatible storage solutions (e.g. MinIO, Cloudflare R2).
The S3 Data Connector enables federated SQL query across Parquet/CSV files stored in S3, or S3-compatible storage solutions (e.g. MinIO, Cloudflare R2).

Support for Iceberg, CSV, and other file-formats are on the roadmap.
Support for Iceberg and other file-formats are on the roadmap.

If a folder is provided, all child Parquet files will be loaded.
If a folder is provided, all child Parquet/CSV files will be loaded.

## Dataset Schema Reference

Expand All @@ -29,12 +29,12 @@ Example: `name: cool_dataset`

### `params` (optional)

- `file_format`: Specifies the requested file format. Default is `parquet`.
- `parquet`: (default) Parquet file format.
- `csv`: CSV file format.
- `endpoint`: The S3 endpoint, or equivalent (e.g. MinIO endpoint), for the S3-compatible storage. Defaults to region endpoint. E.g. `endpoint: https://my.minio.server`
- `region`: Region of the S3 bucket, if region specific. Default value is `us-east-1` E.g. `region: us-east-1`
- `timeout`: Specifies timeout for S3 operations. Default value is `30s` E.g. `timeout: 60s`
- `file_format`: Optional. The file format to query against, either `csv` or `parquet`. Defaults to `parquet`.

More CSV related parameters can be configured, see [CSV Parameters](../reference/file_format.md#CSV)

## Auth

Expand Down
20 changes: 20 additions & 0 deletions spiceaidocs/docs/reference/file_format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
title: "File Formats"
sidebar_label: "File Formats"
pagination_prev: 'reference/index'
pagination_next: null
---

Spice currently supports CSV and Parquet data file-formats. Support for Iceberg and other file-formats are on the roadmap.

The parameters supported for specific file-format are detailed on this page.

## CSV

### Parameters

- `has_header`: Optional. Indicate if the CSV file has header row. Defaults to `true`
- `quote`: Optional. A one-character string used to quote fields containing special characters. Defaults to `"`
- `escape`: Optional. A one-character string used to represent special characters or to include characters that would normally be interpreted as delimiters or new line characters within a field value. Defaults to `null`
- `schema_infer_max_records`: Optional. A number used to set the limit in terms of records to scan to infer the schema. Defaults to `1000`
- `delimiter`: Optional. A one-character string used to separate individual fields. Defaults to `,`