Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support an arbitrary CSV delimiter #2263

Merged
merged 10 commits into from
Apr 27, 2018

Conversation

zhkvia
Copy link
Contributor

@zhkvia zhkvia commented Apr 22, 2018

Added a format_csv_delimiter setting for specifying an arbitrary CSV delimiter.
How it can be used:

  • As a client argument:
    $ clickhouse-client --format_csv_delimiter=";" --query="INSERT INTO table FORMAT CSV"
  • As a session setting:
    :) SET format_csv_delimiter=';'SET format_csv_delimiter = ';'
    
    Ok.
    
    :) SELECT * FROM table FORMAT CSV

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

private:
void checkStringIsACharacter(const String & x) const {
if (x.size() != 1)
throw Exception(std::string("A setting's value string has to be an exactly one character long"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary to construct std::string explicitly.

private:
void checkStringIsACharacter(const String & x) const {
if (x.size() != 1)
throw Exception(std::string("A setting's value string has to be an exactly one character long"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing ErrorCodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style. Braces {} should be in a separate new line.


void set(const Field & x)
{
String s = safeGet<const String &>(x);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use reference here to avoid copying.


When parsing, all values can be parsed either with or without quotes. Both double and single quotes are supported. Rows can also be arranged without quotes. In this case, they are parsed up to a comma or line feed (CR or LF). In violation of the RFC, when parsing rows without quotes, the leading and trailing spaces and tabs are ignored. For the line feed, Unix (LF), Windows (CR LF) and Mac OS Classic (CR LF) are all supported.
&ast;By default — `,`. See a [format_csv_delimiter](/docs/en/operations/settings/settings/#format_csv_delimiter) setting for additional info.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that Markdown supports HTML entities?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems logical since markdown is compiled into HTML. Also, mkdocs in the docs/ directory compiles it correctly.

@alexey-milovidov
Copy link
Member

Almost everything is Ok, but tests are missing.
You may use simple functional tests (look at dbms/tests/queries directory).
Both input and output should be tested.
Please add test cases when unquoted string use comma and delimiter is a semicolon or something like this:
abc,def;hello

@amosbird
Copy link
Collaborator

amosbird commented Apr 24, 2018

Hmm, can we still call that a CSV format without comma? I assume a new format with some delimiter argument would be a better choice. And there are some use cases of multi-bytes delimiters such as data exported from Netezza.

@alexey-milovidov alexey-milovidov merged commit 093c054 into ClickHouse:master Apr 27, 2018
@alexey-milovidov
Copy link
Member

Hmm, can we still call that a CSV format without comma? I assume a new format with some delimiter argument would be a better choice.

Yes, it is controversial. In fact, many "CSV readers" are configurable in this way.
We may add a format with name 'DSV', but CSV is also Ok.

And there are some use cases of multi-bytes delimiters such as data exported from Netezza.

Do they use multibyte delimiter by default?

@amosbird
Copy link
Collaborator

amosbird commented Apr 28, 2018

Do they use multibyte delimiter by default?

Nope. The default is |.

@hereTac
Copy link

hereTac commented Jun 22, 2018

does --format_csv_delimiter= support in v1.1.54385-stable?
Errors:
Bad arguments: unrecognised option '--format_csv_delimiter= '

@AntonSaykovsky
Copy link

does --format_csv_delimiter= support in v1.1.54385-stable?
Errors:
Bad arguments: unrecognised option '--format_csv_delimiter= '

1.1.54390 version has the same error. =(((

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants