-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
interpolated_average can't pull values through empty buckets #548
Comments
So what you'd like to do, I suppose is be able to do something like: SELECT
signal_id,
time_value,
toolkit_experimental.interpolated_average(
tws,
time_value,
'3600 s',
LAG_IGNORE_NULLS(tws) OVER (Partition By signal_id ORDER BY time_value),
LEAD_IGNORE_NULLS(tws) OVER (Partition By signal_id ORDER BY time_value)
)
FROM (
SELECT
signal_id,
time_bucket_gapfill('3600 s', ts) AS time_value,
time_weight('LOCF', ts, measurement_value) AS tws
FROM measurements
Where ts BETWEEN '2020-01-01T00:00:00' AND '2020-01-01T05:00:00'
And signal_id = 9
GROUP BY signal_id, time_value
) t In the gapfill query, you'd end up with null values where there was no data and then when you use the window functions, they'd go back to the last populated value in order to calculate the result and roll it forward. So this way works for now, but this would definitely be easier and should be something we work on at some point, as the 7 step approach feels a bit harder 😂. (But I'm glad it worked as a workaround!) Did I capture that correctly? Another, possibly better option could be to work on getting a new interpolation function into gapfill so you could do it directly there, that might be the simplest way from a UX perspective ie: SELECT
signal_id,
time_bucket_gapfill('3600 s', ts) AS time_value,
interpolated_average(time_weight('LOCF', ts, measurement_value), '3600 s')
FROM measurements
Where ts BETWEEN '2020-01-01T00:00:00' AND '2020-01-01T05:00:00'
And signal_id = 9
GROUP BY signal_id, time_value However, that will be more difficult to implement, and, at least as of now, would only work with |
Thank you for your answer. Do i understand you correctly that the methods LAG_IGNORE_NULLS and LEAD_IGNORE_NULLS are not yet existing ? Yes up to now it looks like, that our approach can capture the wanted behavior. Yes, something like your last query would be very useful :-). |
I have the same problem, Example: with t1 as (
SELECT
time_bucket_gapfill('60 s', time) AS dt,
locf(last(value, time)) as value
FROM {storename}.sensor_data
WHERE master_key = {sensor_id}
AND time >= '{first_day}'
AND time <= '{today}'
GROUP BY dt
ORDER BY dt
), t2 as (
SELECT
time_bucket('15 minutes'::INTERVAL, dt, '{today}'::TIMESTAMPTZ) as ts,
time_weight('locf', dt, value) as tw
FROM t1
GROUP BY ts
)
SELECT time, average(tw)
FROM t2 ORDER BY ts |
Yes. those don't exist, they're something we'd think about introducing in order to do this. Or the other thing, we'll have to see where the tradeoff falls for how we'd do that. |
Unfortunately, this does lose some information because you're downsampling with last first, it also is probably a bit less efficient because to compensate for that you're doing a smaller bucket with gapfill which then means you may need to process more data. But it could very well work for some folks! But we'll see what we can do so that neither of these workarounds are necessary! |
@davidkohn88 |
Looking at this again, I think I found a more efficient workaround: with
t1 as (SELECT
time_bucket_gapfill('{resampled_interval} s', ts) AS tsl,
signal_id,
locf(last(measurement_value,ts)) as locf_value,
time_weight('LOCF', ts, measurement_value) AS tws
From measurements
Where ts BETWEEN '{fall_back_time}' AND '{datetime_to}' and signal_id in ({signals})
GROUP BY signal_id, time_value_gap_fill),
t2 as (
select signal_id, ts, time_weight('LOCF', ts, locf_value) AS tws
from t1 where tws IS NULL
and time_value_gap_fill >= '{datetime_from-datetime.timedelta(seconds=resampled_interval)}'
),
t3 as (
select signal_id, ts, tws from t1 WHERE tws IS NOT NULL
UNION ALL
select signal_id, ts, tws from t2),
t4 as (
SELECT ts, signal_id,
toolkit_experimental.interpolated_average(
tws,
ts,
'{resampled_interval} s',
LAG(tws) OVER (Partition By signal_id ORDER BY interval_label),
LEAD(tws) OVER (Partition By signal_id ORDER BY interval_label)) as time_weighted_average
from t3
)
Select ts + '{resampled_interval} s' ,
signal_id,
time_weighted_average from t4 where ts < '{datetime_to}'
and ts > '{datetime_from-datetime.timedelta(seconds=resampled_interval)}'; Was working on this with Tara, I think she's gonna drop a slightly simpler version here as well, but wanted to have this and see if it worked for you @emanuel-joos |
I had the same issue. I have customer cellular data sessions info that stream in as irregularly sampled byte rates. And, I often want to pull down the total amount of bytes used within a regular interval. I was able to write a super inefficient query just using WITH
-- transform to rate data uniform at 1-second intervals without gaps
rate_seconds AS (
SELECT
device_id,
time_bucket_gapfill('1 s', time) AS time,
locf(LAST(bytes_per_second, time)) AS bytes
FROM public.sample_usage
GROUP BY 1,2
ORDER BY 1,2
)
-- downsample to intervals over 1-second & sum per-second rates
SELECT
device_id,
time_bucket('10 minute', time) AS time,
SUM(bytes) AS bytes
FROM rate_seconds
GROUP BY 1,2
ORDER BY 1,2; After working with @davidkohn88, he came up with the following workaround using WITH t1 as(
SELECT
device_id,
time_bucket_gapfill('10 minute', time) AS time,
locf(last(bytes_per_second, time)) as locf_value,
last(bytes_per_second, time) as indicator_empty_bucket,
time_weight('locf', time, bytes_per_second) AS bps
FROM public.sample_usage
GROUP BY 1,2),
t2 as(
SELECT
device_id,
time,
time_weight('locf', locf_value) as bps
FROM t1
WHERE indicator_empty_bucket IS NULL
GROUP BY 1, 2),
t3 as(
SELECT
device_id,
time,
bps
FROM t1
WHERE indicator_empty_bucket IS NOT NULL
UNION ALL
SELECT device_id, time, bps FROM t2)
SELECT
device_id,
time,
toolkit_experimental.interpolated_integral(
bps,
time,
'10 minute',
LAG(bps) OVER (Partition by device_id ORDER BY time),
LEAD(bps) OVER (Partition by device_id ORDER BY time),
'seconds'
)
FROM t3; |
Relevant system information:
Describe the bug
Time weighted average fails if buckets without values exist. It just ignores buckets without data (output has less rows than expected) and for buckets with data that are adjacent to buckets without data it computes wrong averages.
To Reproduce
Workaround
with:
This should work out of the box. The workaround query also works for multiple signals.
The text was updated successfully, but these errors were encountered: