Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

timezone of DateTime are not correctly processed #27

Closed
jinmingjian opened this issue Apr 21, 2021 · 19 comments
Closed

timezone of DateTime are not correctly processed #27

jinmingjian opened this issue Apr 21, 2021 · 19 comments

Comments

@jinmingjian
Copy link
Contributor

the logic in arrow/DF ignores timezone, but CH re-interpret the presentation according to the server's timezone.

@frank-king
Copy link
Contributor

Have read the documentation of CH [1], I found that the default timezone of a CH server is configurable. Has TB supported this yet? If yes, we might need to consider how to pass this parameter to DF.

Another question, from [2] we know that DateTime in CH also supports an optional timezone. The next step might be to enable timezone support BqlType. There will also be some refactors of the CH date/time functions in DF to support timezone.

[1] https://clickhouse.tech/docs/en/operations/server-configuration-parameters/settings/#server_configuration_parameters-timezone
[2] https://clickhouse.tech/docs/en/sql-reference/data-types/datetime/

@jinmingjian
Copy link
Contributor Author

ok, nice summary! All we want to have:)

We have auto-discovered default tz in mgmt.rs which is easy to extend a configurable option in conf. But this is not high priority. I list the tasks in a priority (welcome to add or recorrect):

priority high:

  • correct process date/time functions with default tz (in this, we figure out a good way to handle the tz in TB)
  • DateTime with tz in meta
  • date/time functions with tz

priority low(not necessarily solved in near future):

  • support DateTime with tz
  • configurable tz in conf

@frank-king
Copy link
Contributor

To support the default tz detected by TB, we should correctly pass the tz configuration to DF right, I think. And that might be the first thing to do. Then in the future, when we are planning to support configurable tz in conf, its easy to modify only the tz detecting process mgmt.rs to achieve this.

@jinmingjian
Copy link
Contributor Author

@whjpji right. The main thing is how to make DF operations tz aware.

@frank-king
Copy link
Contributor

frank-king commented Jun 30, 2021

@jinmingjian I got stuck, could you please give me some advises about how to represent the default tz in DF?

Should I pass the default tz via a default catalog or schema (or maybe information schema? I'm not very familiar with these concepts) into DF? i.e. In fn engine::datafusion::run [1], pass by adding an ExecutionConfig like

    let config = ExecutionConfig::new()
        .create_default_catalog_and_schema(true)
        .with_default_catalog_and_schema(/* ... */);
    let mut ctx = ExecutionContext::with_config(config);

I'm afraid this might bring extra cost if the engine rapidly access this parameter when doing query.

Another way might be using a SyncOnceCell in DF and passing the tz into DF at the initialization of TB, but I'm afraid it is not easy to implement the static value passing across crates.

Which way do you prefer? Or do you have any better ideas?

[1]

let mut ctx = ExecutionContext::new();

@jinmingjian
Copy link
Contributor Author

@whjpji thanks for your exploring. For your passing problem, this is unimportant for whole big task.

What we should figure out this how or does A-DF correctly process the tz info? This answer is important, in that I see rare tz handling logic in A-DF. Another talking, for example, is this one recent issue in Arrow. If A-DF does not respect the tz well, then we need to find a way to solve this problem by ourself. That's the biggest problem.

I'm afraid this might bring extra cost if the engine rapidly access this parameter when doing query.

We first figure out how "the engine access this parameter", then we can understand the cost of this accessing.

@jinmingjian
Copy link
Contributor Author

@whjpji more thoughts:
to correctly support the tz is not hard. A tz offset is enough. The question is just mentioned above: do we need an extra tz offset parameter everywhere? For our TB own builitins, we can support an offset parameter. But the question is just mentioned above as well: If A-DF can handling this tz correctly, we just follow the A-DF convention.

@frank-king
Copy link
Contributor

@jinmingjian I agree with you. My mind graph is "pass tz to DF, and then make DF handle tz right", so the first thing came up with my mind was "pass tz to DF". However the latter thing is more important, so my thoughts were too straight.

I will read codes of A-DF more carefully and do some experiments on it, to find out whether it can handle the tz right, and if yes, then how.

@frank-king
Copy link
Contributor

frank-king commented Jul 2, 2021

@jinmingjian Unfortunately, arrow does not support tz yet. The codes show that:

We have to support tz ourselves, by making changes in A-DF. However, unlike #154, we cannot make all the changes in a dedicated mod, and these changes might be strongly coupled with all parts of A-DF.

Should we directly make PRs on the arrow-rs repo and backport them to TB? I think It is better if these PRs can keep track with and be reviewed by the arrow's community.

What do you think of this question?

[1]

DataType::Timestamp32(_) => Some(temporal_conversions::timestamp_s_to_datetime(v)),

[2]

pub fn value_as_datetime(&self, i: usize) -> Option<NaiveDateTime> {

[3]

| &DataType::Timestamp(_, _) => {

[4]

DataType::Timestamp32(_) => {

@frank-king
Copy link
Contributor

Oh, my last comment seemed to talk about another question: arrow does not support tz in date/time types.

Return back to our original question: how to make arrow support a global tz offset based on the server's tz or in the future the tz conf from the configuration file.

To make TB compatible with CH, not only the tz in date/time types should not be ignored, but also there should be a global tz offset stored somewhere. In which way do you think is better to introduce this global tz offset? Via the system schema or a global static variable somewhere?

@jinmingjian
Copy link
Contributor Author

jinmingjian commented Jul 2, 2021

@whjpji If you confirm that A-DF does not support tz, I suggest we firstly try to patch at the planner side.

Patch the TB own Timestamp32 firstly: if we meet a Timestamp32, we patch that column with +/- global tz offset from mgnt as an expression in planning. How do you think so?:)

@jinmingjian
Copy link
Contributor Author

@whjpji correct: you can not use tz from mgnt, otherwise there is an inter-dependence. You setup a global tz in DF, and sync mgnt's tz to that in DF at boot or some time.

@frank-king
Copy link
Contributor

@jinmingjian I have confirmed that A-DF does not support tz. First by default, my timezone is Etc/GMT+8 (Asia/Shanghai), and I create a table and insert a timestamp into it:

TensorBase :)  create table test_datetime (d DateTime) engine = BaseStorage;

CREATE TABLE test_datetime
(
    d DateTime
)
ENGINE = BaseStorage

Ok.

0 rows in set. Elapsed: 0.020 sec. 

TensorBase :) insert into test_datetime values (cast('2021-07-02 22:11:56' as DateTime))

INSERT INTO test_datetime VALUES

Ok.

1 rows in set. Elapsed: 0.172 sec. 

TensorBase :) select * from test_datetime

SELECT *
FROM test_datetime 

┌───────────────────d─┐
│ 2021-07-02 22:11:56 │
└─────────────────────┘

1 rows in set. Elapsed: 1.586 sec. 

Then I set my timezone to Etc/GMT-5 (America/Los_Angeles), and requery the timestamp

TensorBase :) select * from test_datetime

SELECT *
FROM test_datetime 

┌───────────────────d─┐
│ 2021-07-02 22:11:56 │
└─────────────────────┘

1 rows in set. Elapsed: 0.020 sec. 

I got the same answer. However, the timestamp displayed should vary with different server timezones, which is not expected CH's behavior.

I will try fixing it in this weekend.

@jinmingjian
Copy link
Contributor Author

@whjpji your this method does not confirm the A-DF behavior. Because the back type of DateTime is Timestamp32, which is TB new-added type. The good way is, you use/change a unit test in DF to confirm.

A-DF may ignore tz now in computation as I and you see, but the true concern is if there are some preset slot in the temporal fn to enable the tz without patching. For a idea:

Timestamp32(_) => to_some_part(array)

this ignore the tz, but you can tweak the logic into

Timestamp32(tz) => to_some_part(array, tz)

@jinmingjian
Copy link
Contributor Author

jinmingjian commented Jul 2, 2021

@whjpji if my idea is established, you can tweak the Timestamp32(Option< String >) -> Timestamp32(Option< Int32 >) something like for performance, because it is not necessary to parse the offset from string for every function call.

@frank-king
Copy link
Contributor

@jinmingjian Thanks for your correction, I didn't know that Timestamp32 is new added by TB.

  1. for the first question,

A-DF may ignore tz now in computation as I and you see, but the true concern is if there are some preset slot in the temporal fn to enable the tz without patching.

I agree with you, and we can tweek the Timestamp32(None) to be Timestamp32(Some(tz)) in the column type's metadata during the preparation of the schemes before query, where tz is the default tz provided by TB.

  1. for the second question,

you can tweak the Timestamp32(Option< String >) -> Timestamp32(Option< Int32 >) something like for performance, because it is not necessary to parse the offset from string for every function call.

I think it is not enough to store only an offset in the data type, because the tz name is also needed in some functions. Here is an example from the CH's document:

SELECT toDateTime('2019-01-01 00:00:00', 'UTC') AS time_utc,
    toTypeName(time_utc) AS type_utc,
    toInt32(time_utc) AS int32utc,
    toTimeZone(time_utc, 'Asia/Yekaterinburg') AS time_yekat,
    toTypeName(time_yekat) AS type_yekat,
    toInt32(time_yekat) AS int32yekat,
    toTimeZone(time_utc, 'US/Samoa') AS time_samoa,
    toTypeName(time_samoa) AS type_samoa,
    toInt32(time_samoa) AS int32samoa
FORMAT Vertical;

And the result will be:

Row 1:
──────
time_utc:   2019-01-01 00:00:00
type_utc:   DateTime('UTC')
int32utc:   1546300800
time_yekat: 2019-01-01 05:00:00
type_yekat: DateTime('Asia/Yekaterinburg')
int32yekat: 1546300800
time_samoa: 2018-12-31 13:00:00
type_samoa: DateTime('US/Samoa')
int32samoa: 1546300800

If only the tz offset is stored in the metadata, the column type_yekat will be DateTime(18000) instead. Even if we reconstruct the timezone name from the offset 18000, we probably get something like DateTime('+05:00') or DateTime('Etc/GMT-5'), which dose not meet the user's expectation.

What if we make the Timestamp32(Option<String>) to Timestamp32(Option<(String, i32)>)? I think the extra memory cost will not be so large since the type is stored only in the metadata.


BTW, all the changes made to Timestamp32 might also be applied on Timestamp of A-DF, since CH also support DateTime64.

@frank-king
Copy link
Contributor

frank-king commented Jul 2, 2021

But there is still a problem. If no tz specified when inserting into a table (named A) and the column's type also has no tz, CH/TB should re-interpret the time with a default tz (saying GMT-8) of the server.

But if the server restarted with the tz changed (saying GMT+5), should the timestamp stored in A be interpreted as time with tz of GMT+5 or GMT-8?

If the tz interpretation changed as the server's tz changed, the above strategy will be incorrect to handle this. And the tz might be handled in such way:

Timestamp32(None)      => to_some_part(array)      // parse the array with default tz provided by TB,
Timestamp32(Some(Utc)) => to_some_part(array, Utc) // parse the array with Utc (which is the current behavior of A-DF),
Timestamp32(Some(tz))  => to_some_part(array, tz)  // for other tz

I have no experience on using CH. What do you think of this problem?

@jinmingjian
Copy link
Contributor Author

jinmingjian commented Jul 3, 2021

@whjpji all considerations are great! It is not necessary to keep 100% same to CH, but keep possible compatibility still wanted.

I think the extra memory cost will not be so large since the type is stored only in the metadata.

yeah, try it.

if the server restarted with the tz changed (saying GMT+5), should the timestamp stored in A be interpreted as time with tz of GMT+5 or GMT-8?

if the column's (A's) type also has no tz, the timestamp stored in A be interpreted as time with tz of GMT+5 in kinds of datetime calculations, but the timestamp itself is always unix timestamp/epoch in CH.

Timezone agnostic unix timestamp is stored in tables, and the timezone is used to transform it to text format or back during data import/export or to make calendar calculations on the values (example: toDate, toHour functions et cetera).

Timestamp32(Some(Utc)) => to_some_part(array, Utc) // parse the array with Utc (which is the current behavior of A-DF),

this treatment is unnecessary as above mentioned.

@jinmingjian
Copy link
Contributor Author

We have committed a primary solution in #166. This makes the timezone behavior of TB compatible with that of CH. This behavior may be improved in the future. For now, I just close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants