Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow setting the header for csv, tsv, and ssv manually #3778

Merged
merged 2 commits into from Jan 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions changelog/next/features/3778--xsv-header.md
@@ -0,0 +1,2 @@
The `csv`, `tsv`, `ssv` and `xsv` parsers now support setting the header line
manually with the `--header` option.
50 changes: 30 additions & 20 deletions libtenzir/builtins/formats/xsv.cpp
Expand Up @@ -38,6 +38,7 @@ struct xsv_options {
char list_sep = {};
std::string null_value = {};
bool allow_comments = {};
std::optional<std::string> header = {};

static auto try_parse(parser_interface& p, std::string name, bool is_parser)
-> xsv_options {
Expand All @@ -47,8 +48,10 @@ struct xsv_options {
auto field_sep_str = located<std::string>{};
auto list_sep_str = located<std::string>{};
auto null_value = located<std::string>{};
auto header = std::optional<std::string>{};
if (is_parser) {
parser.add("--allow-comments", allow_comments);
parser.add("--header", header, "<header>");
}
parser.add(field_sep_str, "<field-sep>");
parser.add(list_sep_str, "<list-sep>");
Expand Down Expand Up @@ -93,15 +96,15 @@ struct xsv_options {
.list_sep = *list_sep,
.null_value = std::move(null_value.inner),
.allow_comments = allow_comments,
.header = std::move(header),
};
}

friend auto inspect(auto& f, xsv_options& x) -> bool {
return f.object(x).fields(f.field("name", x.name),
f.field("field_sep", x.field_sep),
f.field("list_sep", x.list_sep),
f.field("null_value", x.null_value),
f.field("allow_comments", x.allow_comments));
return f.object(x).fields(
f.field("name", x.name), f.field("field_sep", x.field_sep),
f.field("list_sep", x.list_sep), f.field("null_value", x.null_value),
f.field("allow_comments", x.allow_comments), f.field("header", x.header));
}
};

Expand Down Expand Up @@ -245,22 +248,25 @@ auto parse_impl(generator<std::optional<std::string_view>> lines,
// Parse header.
auto it = lines.begin();
auto header = std::optional<std::string_view>{};
while (it != lines.end()) {
auto line = *it;
++it;
if (not line) {
co_yield {};
continue;
if (args.header) {
header = *args.header;
} else
while (it != lines.end()) {
auto line = *it;
++it;
if (not line) {
co_yield {};
continue;
}
if (line->empty())
continue;
if (args.allow_comments && line->front() == '#')
dominiklohmann marked this conversation as resolved.
Show resolved Hide resolved
continue;
header = line;
break;
if (not header)
co_return;
}
if (line->empty())
continue;
if (args.allow_comments && line->front() == '#')
continue;
header = line;
break;
}
if (not header)
co_return;
const auto qqstring_value_parser = parsers::qqstr.then([](std::string in) {
static auto unescaper = [](auto& f, auto l, auto out) {
if (*f != '\\') { // Skip every non-escape character.
Expand Down Expand Up @@ -493,14 +499,17 @@ class configured_xsv_plugin final : public virtual parser_parser_plugin,
-> std::unique_ptr<plugin_parser> override {
auto parser = argument_parser{name()};
bool allow_comments = {};
std::optional<std::string> header = {};
parser.add("--allow-comments", allow_comments);
parser.add("--header", header, "<header>");
parser.parse(p);
return std::make_unique<xsv_parser>(xsv_options{
.name = std::string{Name.str()},
.field_sep = Sep,
.list_sep = ListSep,
.null_value = std::string{Null.str()},
.allow_comments = allow_comments,
.header = std::move(header),
});
}

Expand All @@ -513,6 +522,7 @@ class configured_xsv_plugin final : public virtual parser_parser_plugin,
.list_sep = ListSep,
.null_value = std::string{Null.str()},
.allow_comments = false,
.header = {},
});
}

Expand Down
2 changes: 1 addition & 1 deletion tenzir/integration/tests.yaml
Expand Up @@ -109,7 +109,7 @@
expected_result: error
- command: exec "apply does_not_exist.tql"
expected_result: error
# This test works, but it includes the path to the file in the error

Check warning on line 112 in tenzir/integration/tests.yaml

View workflow job for this annotation

GitHub Actions / Style Check

112:7 [comments-indentation] comment not indented like content
# message, which makes the output change.
# - command: exec "apply @./queries/from_unknown_file.tql"
# expected_result: error
Expand Down Expand Up @@ -933,7 +933,7 @@
- command: exec 'read zeek-tsv | head 1 | write json -c | shell rev'
input: data/zeek/conn.log.gz
- command: exec 'shell "echo foo"'
- command: exec 'shell "{ echo \"#\"; seq 1 2 10; }" | read csv | write json -c'
- command: exec 'shell "{ seq 1 2 10; }" | read csv --header "#" | write json -c'

Top and Rare Operators:
fixture: ServerTester
Expand Down Expand Up @@ -1134,10 +1134,10 @@
# TODO: Set up Tenzir S3 stuff for Tenzir-internal read/write tests?
# TODO: Re-enable this test once Arrow updated their bundled AWS SDK from version 1.10.55,
# see: https://github.com/apache/arrow/issues/37721
#S3 Connector:

Check warning on line 1137 in tenzir/integration/tests.yaml

View workflow job for this annotation

GitHub Actions / Style Check

1137:4 [comments] missing starting space in comment
#tags: [pipelines]

Check warning on line 1138 in tenzir/integration/tests.yaml

View workflow job for this annotation

GitHub Actions / Style Check

1138:6 [comments] missing starting space in comment

Check warning on line 1138 in tenzir/integration/tests.yaml

View workflow job for this annotation

GitHub Actions / Style Check

1138:5 [comments-indentation] comment not indented like content
#steps:

Check warning on line 1139 in tenzir/integration/tests.yaml

View workflow job for this annotation

GitHub Actions / Style Check

1139:6 [comments] missing starting space in comment
#- command: exec 'from s3 s3://sentinel-cogs/sentinel-s2-l2a-cogs/1/C/CV/2023/1/S2B_1CCV_20230101_0_L2A/tileinfo_metadata.json | write json'

Check warning on line 1140 in tenzir/integration/tests.yaml

View workflow job for this annotation

GitHub Actions / Style Check

1140:8 [comments] missing starting space in comment

Check warning on line 1140 in tenzir/integration/tests.yaml

View workflow job for this annotation

GitHub Actions / Style Check

1140:7 [comments-indentation] comment not indented like content

Blob Type:
fixture: ServerTester
Expand Down
15 changes: 10 additions & 5 deletions web/docs/formats/xsv.md
Expand Up @@ -12,10 +12,10 @@ Reads and writes lines with separated values.
## Synopsis

```
csv [--allow-comments]
ssv [--allow-comments]
tsv [--allow-comments]
xsv <field-sep> <list-sep> <null-value> [--allow-comments]
csv [--allow-comments] [--header <header>]
ssv [--allow-comments] [--header <header>]
tsv [--allow-comments] [--header <header>]
xsv <field-sep> <list-sep> <null-value> [--allow-comments] [--header <header>]
```

## Description
Expand Down Expand Up @@ -70,10 +70,15 @@ Specifies the string that separates list elements *within* a field.

Specifies the string that denotes an absent value.

### `--allow-comments`
### `--allow-comments` (Parser)

Treat lines beginning with `'#'` as comments.

### `--header <header>` (Parser)

Use the manually provided header line instead of treating the first line as the
header.

## Examples

Read CSV from stdin:
Expand Down