A filter plugin for Embulk to filter out rows
Requirement: version >= 0.3.0
- where: Select only rows which match with conditions written in SQL-like syntax. See SQL-like Syntax
filters:
- type: row
where: column1 = 'str'
filters:
- type: row
where: |-
(
string_column START_WITH 'str' AND
number_column > 1.0
)
OR
(
time_column = TIMESTAMP '2016-01-01 +0900' AND
"true_column" = true
)
See SQL-like Syntax for more details
This syntax must be similar with a standard SQL syntax.
where: column1 = 'str'
where: |-
(
string_column START_WITH 'str' AND
number_column > 1.0
)
OR
(
time_column = TIMESTAMP '2016-01-01 +0900' AND
"true_column" = true AND
string_column REGEXP '^reg'
)
true
or TRUE
or false
or FALSE
are considered as a boolean literal
Characters matching with a regular expression -?[0-9]+(\.[0-9]+)?
is considered as a number literal
Characters surrounded by '
such as 'foo'
is considered as a string literal
NOTE: It became possible to omit TIMESTAMP
keyword on comparing with timestamp
identifier (column) from version >= 0.3.3.
TIMESTAMP ( NumberLiteral | StringLiteral )
such as TIMESTAMP 1470433087.747123
or TIMESTAMP '2016-08-06 06:38:07.747123 +0900'
is considered as a timestamp literal
Number is a epoch time since 1970-01-01 UTC with nano time resolution.
String is a timestamp string which matches with one of following format:
%Y-%m-%d %H:%M:%S.%N %z
%Y-%m-%d %H:%M:%S.%N
%Y-%m-%d %H:%M:%S %z
%Y-%m-%d %H:%M:%S
%Y-%m-%d %z
%Y-%m-%d
The time zone for formats without %z
is UTC, and the time resolution is micro second (caused by limitation of Embulk TimestampParser).
Not supported yet
Characters matching with a regular expression [a-zA-Z_][a-zA-z0-9_]*
such as foobar
, and characters surrounded by "
such as "foo-bar"
, "foo.bar"
, and "foo\"bar"
are considred as an identifier literal, that is, embulk's column name.
=
!=
=
!=
>
>=
<=
<
=
!=
START_WITH
END_WITH
INCLUDE
REGEXP
=
!=
>
>=
<=
<
Not supported yet
- "xxx IS NULL"
- "xxx IS NOT NULL"
- "NOT xxx"
Versions >= 0.3.0 has where
option to supports SQL-like syntax. I recommend to use it.
Following options are deprecated, and will be removed someday.
- condition: AND or OR (string, default: AND).
- conditions: select only rows which matches with conditions.
- column: column name (string, required)
- operator operator (string, optional, default: ==)
- boolean operator
==
!=
- numeric operator (long, double, Timestamp)
==
!=
>
>=
<=
<
- string operator
==
!=
start_with
(orstartsWith
)end_with
(orendsWith
)include
(orcontains
)
- unary operator
IS NULL
IS NOT NULL
- boolean operator
- argument: argument for the operation (string, required for non-unary operators)
- not: not (boolean, optional, default: false)
- format: special option for timestamp column, specify the format of timestamp argument, parsed argument is compared with the column value as Timestamp object (string, default is
%Y-%m-%d %H:%M:%S.%N %z
) - timezone: special option for timestamp column, specify the timezone of timestamp argument (string, default is
UTC
)
NOTE: column type is automatically retrieved from input data (inputSchema)
Deprecated
filters:
- type: row
condition: AND
conditions:
- {column: foo, operator: "IS NOT NULL"}
- {column: id, operator: ">=", argument: 10}
- {column: id, operator: "<", argument: 20}
- {column: name, opeartor: "include", argument: foo, not: true}
- {column: time, operator: "==", argument: "2015-07-13", format: "%Y-%m-%d"}
Deprecated
filters:
- type: row
condition: OR
conditions:
- {column: a, operator: "IS NOT NULL"}
- {column: b, operator: "IS NOT NULL"}
Deprecated
You can express a condition such as (A OR B) AND (C OR D)
by combining multiple filters like
filters:
- type: row
condition: OR
conditions:
- {column: a, operator: "IS NOT NULL"}
- {column: b, operator: "IS NOT NULL"}
- type: row
condition: OR
conditions:
- {column: c, operator: "IS NOT NULL"}
- {column: d, operator: "IS NOT NULL"}
- embulk-filter-calcite
- embulk-filter-calcite is a pretty nice plugin which enables us to write SQL query to filter embulk records, not only
WHERE
but alsoSELECT
. - Based on my benchmark (Japanese), embulk-filter-row was faster than embulk-filter-calcite.
- Choose which to use as your demand.
- embulk-filter-calcite is a pretty nice plugin which enables us to write SQL query to filter embulk records, not only
- Support filtering by values of
type: json
with JSONPath - Support IN operator
Run example:
$ ./gradlew classpath
$ embulk preview -I lib example/example.yml
Run test:
$ ./gradlew test
Run checkstyle:
$ ./gradlew check
Release gem:
$ ./gradlew gemPush
Read the article Supported SQL-like Syntax with embulk-filter-row using BYACC/J and JFlex.
To download BYACC/J and JFlex and run them, you can use:
$ script/byaccj.sh
or
$ ./gradlew byaccj # this runs script/byaccj.sh internally
This generates src/main/java/org/embulk/filter/row/where/{Parser,ParserVal,Yylex}.java
.
The byaccj
task of gradle is ran before compileJava
task (which means to be ran before classpath
or test
task also) automatically.