timezone of DateTime are not correctly processed #27

jinmingjian · 2021-04-21T23:10:01Z

the logic in arrow/DF ignores timezone, but CH re-interpret the presentation according to the server's timezone.

frank-king · 2021-06-29T07:28:04Z

Have read the documentation of CH [1], I found that the default timezone of a CH server is configurable. Has TB supported this yet? If yes, we might need to consider how to pass this parameter to DF.

Another question, from [2] we know that DateTime in CH also supports an optional timezone. The next step might be to enable timezone support BqlType. There will also be some refactors of the CH date/time functions in DF to support timezone.

[1] https://clickhouse.tech/docs/en/operations/server-configuration-parameters/settings/#server_configuration_parameters-timezone
[2] https://clickhouse.tech/docs/en/sql-reference/data-types/datetime/

jinmingjian · 2021-06-29T08:29:33Z

ok, nice summary! All we want to have:)

We have auto-discovered default tz in mgmt.rs which is easy to extend a configurable option in conf. But this is not high priority. I list the tasks in a priority (welcome to add or recorrect):

priority high:

correct process date/time functions with default tz (in this, we figure out a good way to handle the tz in TB)
DateTime with tz in meta
date/time functions with tz

priority low(not necessarily solved in near future):

support DateTime with tz
configurable tz in conf

frank-king · 2021-06-30T06:11:57Z

To support the default tz detected by TB, we should correctly pass the tz configuration to DF right, I think. And that might be the first thing to do. Then in the future, when we are planning to support configurable tz in conf, its easy to modify only the tz detecting process mgmt.rs to achieve this.

jinmingjian · 2021-06-30T10:18:20Z

@whjpji right. The main thing is how to make DF operations tz aware.

frank-king · 2021-06-30T15:13:30Z

@jinmingjian I got stuck, could you please give me some advises about how to represent the default tz in DF?

Should I pass the default tz via a default catalog or schema (or maybe information schema? I'm not very familiar with these concepts) into DF? i.e. In fn engine::datafusion::run [1], pass by adding an ExecutionConfig like

    let config = ExecutionConfig::new()
        .create_default_catalog_and_schema(true)
        .with_default_catalog_and_schema(/* ... */);
    let mut ctx = ExecutionContext::with_config(config);

I'm afraid this might bring extra cost if the engine rapidly access this parameter when doing query.

Another way might be using a SyncOnceCell in DF and passing the tz into DF at the initialization of TB, but I'm afraid it is not easy to implement the static value passing across crates.

Which way do you prefer? Or do you have any better ideas?

[1]

tensorbase/crates/engine/src/datafusions.rs

Line 68 in 21a4401

let mut ctx = ExecutionContext::new();

jinmingjian · 2021-07-01T00:59:57Z

@whjpji thanks for your exploring. For your passing problem, this is unimportant for whole big task.

What we should figure out this how or does A-DF correctly process the tz info? This answer is important, in that I see rare tz handling logic in A-DF. Another talking, for example, is this one recent issue in Arrow. If A-DF does not respect the tz well, then we need to find a way to solve this problem by ourself. That's the biggest problem.

I'm afraid this might bring extra cost if the engine rapidly access this parameter when doing query.

We first figure out how "the engine access this parameter", then we can understand the cost of this accessing.

jinmingjian · 2021-07-01T01:13:58Z

@whjpji more thoughts:
to correctly support the tz is not hard. A tz offset is enough. The question is just mentioned above: do we need an extra tz offset parameter everywhere? For our TB own builitins, we can support an offset parameter. But the question is just mentioned above as well: If A-DF can handling this tz correctly, we just follow the A-DF convention.

frank-king · 2021-07-01T12:08:56Z

@jinmingjian I agree with you. My mind graph is "pass tz to DF, and then make DF handle tz right", so the first thing came up with my mind was "pass tz to DF". However the latter thing is more important, so my thoughts were too straight.

I will read codes of A-DF more carefully and do some experiments on it, to find out whether it can handle the tz right, and if yes, then how.

frank-king · 2021-07-02T06:33:00Z

@jinmingjian Unfortunately, arrow does not support tz yet. The codes show that:

from [1] and [2], all date/time types only support to be accessed as a naive date/time type, where tz is ignored;
from [3], when extracting hours/minutes/seconds etc., all tz is ignored;
from [4], when taking values from date/time-typed arrays, tz is ignored too. And here is an open issue of arrow-rs take drops timezone from an input array with type Timestamp apache/arrow-rs#393.

We have to support tz ourselves, by making changes in A-DF. However, unlike #154, we cannot make all the changes in a dedicated mod, and these changes might be strongly coupled with all parts of A-DF.

Should we directly make PRs on the arrow-rs repo and backport them to TB? I think It is better if these PRs can keep track with and be reviewed by the arrow's community.

What do you think of this question?

[1]

tensorbase/crates/arrow/src/array/array_primitive.rs

Line 171 in a1ec8e8

    
           DataType::Timestamp32(_) => Some(temporal_conversions::timestamp_s_to_datetime(v)),

[2]

tensorbase/crates/arrow/src/array/array_primitive.rs

Line 245 in a1ec8e8

pub fn value_as_datetime(&self, i: usize) -> Option<NaiveDateTime> {

[3]

tensorbase/crates/arrow/src/compute/kernels/temporal.rs

Line 48 in a1ec8e8

| &DataType::Timestamp(_, _) => {

[4]

tensorbase/crates/arrow/src/compute/kernels/take.rs

Line 157 in a1ec8e8

DataType::Timestamp32(_) => {

frank-king · 2021-07-02T07:03:07Z

Oh, my last comment seemed to talk about another question: arrow does not support tz in date/time types.

Return back to our original question: how to make arrow support a global tz offset based on the server's tz or in the future the tz conf from the configuration file.

To make TB compatible with CH, not only the tz in date/time types should not be ignored, but also there should be a global tz offset stored somewhere. In which way do you think is better to introduce this global tz offset? Via the system schema or a global static variable somewhere?

jinmingjian · 2021-07-02T12:18:54Z

@whjpji If you confirm that A-DF does not support tz, I suggest we firstly try to patch at the planner side.

Patch the TB own Timestamp32 firstly: if we meet a Timestamp32, we patch that column with +/- global tz offset from mgnt as an expression in planning. How do you think so?:)

jinmingjian · 2021-07-02T13:16:24Z

@whjpji correct: you can not use tz from mgnt, otherwise there is an inter-dependence. You setup a global tz in DF, and sync mgnt's tz to that in DF at boot or some time.

frank-king · 2021-07-02T14:39:54Z

@jinmingjian I have confirmed that A-DF does not support tz. First by default, my timezone is Etc/GMT+8 (Asia/Shanghai), and I create a table and insert a timestamp into it:

TensorBase :)  create table test_datetime (d DateTime) engine = BaseStorage;

CREATE TABLE test_datetime
(
    d DateTime
)
ENGINE = BaseStorage

Ok.

0 rows in set. Elapsed: 0.020 sec. 

TensorBase :) insert into test_datetime values (cast('2021-07-02 22:11:56' as DateTime))

INSERT INTO test_datetime VALUES

Ok.

1 rows in set. Elapsed: 0.172 sec. 

TensorBase :) select * from test_datetime

SELECT *
FROM test_datetime 

┌───────────────────d─┐
│ 2021-07-02 22:11:56 │
└─────────────────────┘

1 rows in set. Elapsed: 1.586 sec.

Then I set my timezone to Etc/GMT-5 (America/Los_Angeles), and requery the timestamp

TensorBase :) select * from test_datetime

SELECT *
FROM test_datetime 

┌───────────────────d─┐
│ 2021-07-02 22:11:56 │
└─────────────────────┘

1 rows in set. Elapsed: 0.020 sec.

I got the same answer. However, the timestamp displayed should vary with different server timezones, which is not expected CH's behavior.

I will try fixing it in this weekend.

jinmingjian · 2021-07-02T15:00:14Z

@whjpji your this method does not confirm the A-DF behavior. Because the back type of DateTime is Timestamp32, which is TB new-added type. The good way is, you use/change a unit test in DF to confirm.

A-DF may ignore tz now in computation as I and you see, but the true concern is if there are some preset slot in the temporal fn to enable the tz without patching. For a idea:

Timestamp32(_) => to_some_part(array)

this ignore the tz, but you can tweak the logic into

Timestamp32(tz) => to_some_part(array, tz)

jinmingjian · 2021-07-02T15:04:49Z

@whjpji if my idea is established, you can tweak the Timestamp32(Option< String >) -> Timestamp32(Option< Int32 >) something like for performance, because it is not necessary to parse the offset from string for every function call.

frank-king · 2021-07-02T15:36:49Z

@jinmingjian Thanks for your correction, I didn't know that Timestamp32 is new added by TB.

for the first question,

A-DF may ignore tz now in computation as I and you see, but the true concern is if there are some preset slot in the temporal fn to enable the tz without patching.

I agree with you, and we can tweek the Timestamp32(None) to be Timestamp32(Some(tz)) in the column type's metadata during the preparation of the schemes before query, where tz is the default tz provided by TB.

for the second question,

you can tweak the Timestamp32(Option< String >) -> Timestamp32(Option< Int32 >) something like for performance, because it is not necessary to parse the offset from string for every function call.

I think it is not enough to store only an offset in the data type, because the tz name is also needed in some functions. Here is an example from the CH's document:

SELECT toDateTime('2019-01-01 00:00:00', 'UTC') AS time_utc,
    toTypeName(time_utc) AS type_utc,
    toInt32(time_utc) AS int32utc,
    toTimeZone(time_utc, 'Asia/Yekaterinburg') AS time_yekat,
    toTypeName(time_yekat) AS type_yekat,
    toInt32(time_yekat) AS int32yekat,
    toTimeZone(time_utc, 'US/Samoa') AS time_samoa,
    toTypeName(time_samoa) AS type_samoa,
    toInt32(time_samoa) AS int32samoa
FORMAT Vertical;

And the result will be:

Row 1:
──────
time_utc:   2019-01-01 00:00:00
type_utc:   DateTime('UTC')
int32utc:   1546300800
time_yekat: 2019-01-01 05:00:00
type_yekat: DateTime('Asia/Yekaterinburg')
int32yekat: 1546300800
time_samoa: 2018-12-31 13:00:00
type_samoa: DateTime('US/Samoa')
int32samoa: 1546300800

If only the tz offset is stored in the metadata, the column type_yekat will be DateTime(18000) instead. Even if we reconstruct the timezone name from the offset 18000, we probably get something like DateTime('+05:00') or DateTime('Etc/GMT-5'), which dose not meet the user's expectation.

What if we make the Timestamp32(Option<String>) to Timestamp32(Option<(String, i32)>)? I think the extra memory cost will not be so large since the type is stored only in the metadata.

BTW, all the changes made to Timestamp32 might also be applied on Timestamp of A-DF, since CH also support DateTime64.

frank-king · 2021-07-02T15:47:24Z

But there is still a problem. If no tz specified when inserting into a table (named A) and the column's type also has no tz, CH/TB should re-interpret the time with a default tz (saying GMT-8) of the server.

But if the server restarted with the tz changed (saying GMT+5), should the timestamp stored in A be interpreted as time with tz of GMT+5 or GMT-8?

If the tz interpretation changed as the server's tz changed, the above strategy will be incorrect to handle this. And the tz might be handled in such way:

Timestamp32(None)      => to_some_part(array)      // parse the array with default tz provided by TB,
Timestamp32(Some(Utc)) => to_some_part(array, Utc) // parse the array with Utc (which is the current behavior of A-DF),
Timestamp32(Some(tz))  => to_some_part(array, tz)  // for other tz

I have no experience on using CH. What do you think of this problem?

jinmingjian · 2021-07-03T00:41:22Z

@whjpji all considerations are great! It is not necessary to keep 100% same to CH, but keep possible compatibility still wanted.

I think the extra memory cost will not be so large since the type is stored only in the metadata.

yeah, try it.

if the server restarted with the tz changed (saying GMT+5), should the timestamp stored in A be interpreted as time with tz of GMT+5 or GMT-8?

if the column's (A's) type also has no tz, the timestamp stored in A be interpreted as time with tz of GMT+5 in kinds of datetime calculations, but the timestamp itself is always unix timestamp/epoch in CH.

Timezone agnostic unix timestamp is stored in tables, and the timezone is used to transform it to text format or back during data import/export or to make calendar calculations on the values (example: toDate, toHour functions et cetera).

Timestamp32(Some(Utc)) => to_some_part(array, Utc) // parse the array with Utc (which is the current behavior of A-DF),

this treatment is unnecessary as above mentioned.

jinmingjian · 2021-07-05T02:15:18Z

We have committed a primary solution in #166. This makes the timezone behavior of TB compatible with that of CH. This behavior may be improved in the future. For now, I just close this issue.

jinmingjian added type/bug bug good first issue labels Apr 21, 2021

jinmingjian added this to the 2021.05 tpch game milestone Apr 21, 2021

jinmingjian added the component/engine label Apr 21, 2021

jinmingjian removed this from the 2021.05 tpch game milestone May 10, 2021

jinmingjian mentioned this issue Jun 28, 2021

support CH functions for dates and times #131

Open

51 tasks

frank-king mentioned this issue Jul 3, 2021

basic support of default timezone #166

Merged

jinmingjian closed this as completed Jul 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

timezone of DateTime are not correctly processed #27

timezone of DateTime are not correctly processed #27

jinmingjian commented Apr 21, 2021

frank-king commented Jun 29, 2021

jinmingjian commented Jun 29, 2021

frank-king commented Jun 30, 2021

jinmingjian commented Jun 30, 2021

frank-king commented Jun 30, 2021 •

edited

jinmingjian commented Jul 1, 2021

jinmingjian commented Jul 1, 2021

frank-king commented Jul 1, 2021

frank-king commented Jul 2, 2021 •

edited

frank-king commented Jul 2, 2021

jinmingjian commented Jul 2, 2021 •

edited

jinmingjian commented Jul 2, 2021

frank-king commented Jul 2, 2021

jinmingjian commented Jul 2, 2021

jinmingjian commented Jul 2, 2021 •

edited

frank-king commented Jul 2, 2021

frank-king commented Jul 2, 2021 •

edited

jinmingjian commented Jul 3, 2021 •

edited

jinmingjian commented Jul 5, 2021

timezone of DateTime are not correctly processed #27

timezone of DateTime are not correctly processed #27

Comments

jinmingjian commented Apr 21, 2021

frank-king commented Jun 29, 2021

jinmingjian commented Jun 29, 2021

frank-king commented Jun 30, 2021

jinmingjian commented Jun 30, 2021

frank-king commented Jun 30, 2021 • edited

jinmingjian commented Jul 1, 2021

jinmingjian commented Jul 1, 2021

frank-king commented Jul 1, 2021

frank-king commented Jul 2, 2021 • edited

frank-king commented Jul 2, 2021

jinmingjian commented Jul 2, 2021 • edited

jinmingjian commented Jul 2, 2021

frank-king commented Jul 2, 2021

jinmingjian commented Jul 2, 2021

jinmingjian commented Jul 2, 2021 • edited

frank-king commented Jul 2, 2021

frank-king commented Jul 2, 2021 • edited

jinmingjian commented Jul 3, 2021 • edited

jinmingjian commented Jul 5, 2021

frank-king commented Jun 30, 2021 •

edited

frank-king commented Jul 2, 2021 •

edited

jinmingjian commented Jul 2, 2021 •

edited

jinmingjian commented Jul 2, 2021 •

edited

frank-king commented Jul 2, 2021 •

edited

jinmingjian commented Jul 3, 2021 •

edited