New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty values in comma-delimited data causes parsing issues #469

Open
marklit opened this Issue Feb 8, 2017 · 6 comments

Comments

Projects
None yet
3 participants
@marklit

marklit commented Feb 8, 2017

I have data files where each line contains a record of data delimited by only a comma. There are no commas within any of the contents of the data so there shouldn't be any escaping issues. The first 22 columns of the table look as follows:

CREATE TABLE trips (
    trip_id                 UInt32,
    vendor_id               String,
    pickup_datetime         DateTime,
    dropoff_datetime        Nullable(DateTime),
    store_and_fwd_flag      Nullable(FixedString(1)),
    rate_code_id            Nullable(UInt8),
    pickup_longitude        Nullable(Float64),
    pickup_latitude         Nullable(Float64),
    dropoff_longitude       Nullable(Float64),
    dropoff_latitude        Nullable(Float64),
    passenger_count         Nullable(UInt8),
    trip_distance           Nullable(Float64),
    fare_amount             Nullable(Float32),
    extra                   Nullable(Float32),
    mta_tax                 Nullable(Float32),
    tip_amount              Nullable(Float32),
    tolls_amount            Nullable(Float32),
    ehail_fee               Nullable(Float32),
    improvement_surcharge   Nullable(Float32),
    total_amount            Nullable(Float32),
    payment_type            Nullable(String),
    trip_type               Nullable(UInt8),
    ...
) ENGINE = Log;

When I attempt to import the data with the following command:

time (gunzip -c /home/mark/trips/trips_x*.csv.gz | \
        clickhouse-client \
        --query="INSERT INTO trips FORMAT CSV")

I'm getting an error message on one of the records:

Code: 27. DB::Exception: Cannot parse input: expected , before: -6,52,,,,,,,,,,,,,,,,,,,,\n504769614,CMT,2011-11-05 06:07:07,2011-11-05 06:12:53,N,1,-73.990061999999995,40.756630000000001,-73.996763999999999,40.73723799999999: (at row 73452711)

Row 73452710:
Column 0,   name: trip_id,               type: UInt32,                   parsed text: "504769613"
Column 1,   name: vendor_id,             type: String,                   parsed text: "VTS"
Column 2,   name: pickup_datetime,       type: DateTime,                 parsed text: "2011-11-04 13:21:00"
Column 3,   name: dropoff_datetime,      type: Nullable(DateTime),       parsed text: "2011-11-04 13:31:00"
Column 4,   name: store_and_fwd_flag,    type: Nullable(FixedString(1)), parsed text: <EMPTY>
Column 5,   name: rate_code_id,          type: Nullable(UInt8),          parsed text: "1"
Column 6,   name: pickup_longitude,      type: Nullable(Float64),        parsed text: "-74.005681999999993"
Column 7,   name: pickup_latitude,       type: Nullable(Float64),        parsed text: "40.745690000000003"
Column 8,   name: dropoff_longitude,     type: Nullable(Float64),        parsed text: "-73.996515000000002"
Column 9,   name: dropoff_latitude,      type: Nullable(Float64),        parsed text: "40.732435000000002"
Column 10,  name: passenger_count,       type: Nullable(UInt8),          parsed text: "1"
Column 11,  name: trip_distance,         type: Nullable(Float64),        parsed text: "1.3799999999999999"
Column 12,  name: fare_amount,           type: Nullable(Float32),        parsed text: "7.2999999999999998"
Column 13,  name: extra,                 type: Nullable(Float32),        parsed text: "0"
Column 14,  name: mta_tax,               type: Nullable(Float32),        parsed text: "0.5"
Column 15,  name: tip_amount,            type: Nullable(Float32),        parsed text: "2"
Column 16,  name: tolls_amount,          type: Nullable(Float32),        parsed text: "0"
Column 17,  name: ehail_fee,             type: Nullable(Float32),        ERROR: text ",,9.800000" is not like Nullable(Float32)

Is there a better way to define the table schema and/or the format parameters in the INSERT INTO statement so that the missing values don't cause such an issue when importing?

@alexey-milovidov

This comment has been minimized.

Show comment
Hide comment
@alexey-milovidov

alexey-milovidov Feb 8, 2017

Member

CSV format doesn't treat empty as NULL to avoid ambiguity with empty strings.
We should add an option to treat empty as NULL.

Please note, that NULLs support is still shallow. We use NULLs only to import data and to convert to non-Nullable types before any further data analysis.

I suggest to use TSV (TabSeparated) format. Postgres, MySQL, Hive, etc. print NULLs as \N in this format and ClickHouse understands it without ambiguity.

Member

alexey-milovidov commented Feb 8, 2017

CSV format doesn't treat empty as NULL to avoid ambiguity with empty strings.
We should add an option to treat empty as NULL.

Please note, that NULLs support is still shallow. We use NULLs only to import data and to convert to non-Nullable types before any further data analysis.

I suggest to use TSV (TabSeparated) format. Postgres, MySQL, Hive, etc. print NULLs as \N in this format and ClickHouse understands it without ambiguity.

@marklit

This comment has been minimized.

Show comment
Hide comment
@marklit

marklit Feb 8, 2017

Thanks for the explanation. I'll use sed to convert the data to TSV format.

Just in case it's helpful for anyone looking at this issue in the future this is the command I'm running:

time (gunzip -c /home/mark/trips/trips_x*.csv.gz | \
        sed 's/,/\t/g' | \
        clickhouse-client \
        --query="INSERT INTO trips FORMAT TSV")

marklit commented Feb 8, 2017

Thanks for the explanation. I'll use sed to convert the data to TSV format.

Just in case it's helpful for anyone looking at this issue in the future this is the command I'm running:

time (gunzip -c /home/mark/trips/trips_x*.csv.gz | \
        sed 's/,/\t/g' | \
        clickhouse-client \
        --query="INSERT INTO trips FORMAT TSV")

@marklit marklit closed this Feb 8, 2017

@marklit marklit reopened this Feb 8, 2017

@marklit

This comment has been minimized.

Show comment
Hide comment
@marklit

marklit Feb 8, 2017

Looks like I'll need to transform the blanks to \N. Switching to tabs and using TSV format for importing is raising the same issues.

marklit commented Feb 8, 2017

Looks like I'll need to transform the blanks to \N. Switching to tabs and using TSV format for importing is raising the same issues.

@alexey-milovidov

This comment has been minimized.

Show comment
Hide comment
@alexey-milovidov

alexey-milovidov Feb 8, 2017

Member

Yes.

Although we can introduce option to treat empty as NULL rather quickly (few days).

Member

alexey-milovidov commented Feb 8, 2017

Yes.

Although we can introduce option to treat empty as NULL rather quickly (few days).

@marklit

This comment has been minimized.

Show comment
Hide comment
@marklit

marklit Feb 8, 2017

If you could that would be great. In the mean time I've put together a Python script to transform the data:

import sys


for line in sys.stdin:
    print ','.join([item if len(item.strip()) else '\N'
                    for item in line.strip().split(',')])

I'm running the import now and it seems to be running well.

for filename in /home/mark/trips/trips_x*.csv.gz; do
    gunzip -c $filename | \
        python trans.py | \
        clickhouse-client \
            --query="INSERT INTO trips FORMAT CSV"
done

marklit commented Feb 8, 2017

If you could that would be great. In the mean time I've put together a Python script to transform the data:

import sys


for line in sys.stdin:
    print ','.join([item if len(item.strip()) else '\N'
                    for item in line.strip().split(',')])

I'm running the import now and it seems to be running well.

for filename in /home/mark/trips/trips_x*.csv.gz; do
    gunzip -c $filename | \
        python trans.py | \
        clickhouse-client \
            --query="INSERT INTO trips FORMAT CSV"
done
@Imaskar

This comment has been minimized.

Show comment
Hide comment
@Imaskar

Imaskar Feb 28, 2018

This problem occurs specifically for floating point types (Nullable(Float64), Nullable(Float32)). I don't think there is any ambiguity what empty floating point value means. Also this problem occurs for version 1.1.54343, but not for 1.1.54310.

Imaskar commented Feb 28, 2018

This problem occurs specifically for floating point types (Nullable(Float64), Nullable(Float32)). I don't think there is any ambiguity what empty floating point value means. Also this problem occurs for version 1.1.54343, but not for 1.1.54310.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment