Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chronograf query does not return all records from per-program-timing table #19789

Closed
tao-stones opened this issue Sep 11, 2021 · 7 comments · Fixed by #20623
Closed

Chronograf query does not return all records from per-program-timing table #19789

tao-stones opened this issue Sep 11, 2021 · 7 comments · Fixed by #20623
Assignees

Comments

@tao-stones
Copy link
Contributor

Problem

Validator calls datapoint_info() to insert multiple records to same table within a millisecond, when querying InfluxDn, only the last insertion in query result.

For example, from the log of validator that reports metrics to tds:

[2021-09-10T17:40:23.785238238Z INFO  solana_metrics::metrics] datapoint: per_program_timings slot=93422612i pubkey="Vote111111111111111111111111111111111111111" execute_us=171398i accumulated_units=0i count=1102i
[2021-09-10T17:40:23.785260354Z INFO  solana_metrics::metrics] datapoint: per_program_timings slot=93422612i pubkey="8tfDNiaEyrV6Q1U4DEXrEigs9DoDtkugzFbybENEbCDz" execute_us=4153i accumulated_units=5023i count=2i
[2021-09-10T17:40:23.785265142Z INFO  solana_metrics::metrics] datapoint: per_program_timings slot=93422612i pubkey="all" execute_us=175551i accumulated_units=5023i count=1104i

within same millisecond, it added 3 records to pre_program_timings for slot 93422612 with different values.

when run query in chrnograf:

SELECT * FROM "tds"."autogen"."per_program_timings" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND "host_id"='C655ertkV4BUt9Cso2HVc5vQ2qoV4MKdktXZSVUPyK5Y' AND slot = 93422612;

It only returns the last insertion.

The question is if all three records were added to InfluxDb but query only returns last one, or the database is gated somehow only the last insertion added to DB.

Proposed Solution

@tao-stones
Copy link
Contributor Author

@joeaba some background, I noticed this when I am querying InfluxDB to get per_program_timing records for BPF programs (as part of issue #19627 ), noticed the query results not exactly match with log. If you need reproduce it, I think it is not hard, let me know. Thanks!

@joeaba
Copy link
Contributor

joeaba commented Sep 14, 2021

@axleiro can you please help me take a look? The data points exist so we should be able to query them

@ryoqun
Copy link
Member

ryoqun commented Sep 19, 2021

@taozhu-chicago this is due to our ./metrics code is truncating to ms granularity.... maybe we should fix this. I stumbled on this as well for other metrics. Firstly, we have to know why the code isn't doing ns granularity (influxdb's default behavior).

@denispalab
Copy link
Contributor

Ryo is right here, after some R&D I found that the actual data inside the database regarding that particular “slot no.” is not 3 but only a single entry.. so the query is actually correctly fetching all the data that is present in database. We might have several cases like this

@tao-stones
Copy link
Contributor Author

Thanks @ryoqun @axleiro for looking into it. Would be nice to keep Influx's default ns granularity, or at least at us is storage is an issue.

behzadnouri added a commit to behzadnouri/solana that referenced this issue Oct 12, 2021
Current datapoint_info! is apparently overwriting itself when run inside
a loop. For example in
https://github.com/solana-labs/solana/blob/005d6863f/core/src/window_service.rs#L101-L107
only one of the slots will show up in influxdb.

This is apparently because of metrics code using milliseconds as the
timestamp, as mentioned here:
solana-labs#19789 (comment)
behzadnouri added a commit to behzadnouri/solana that referenced this issue Oct 12, 2021
Current datapoint_info! is apparently overwriting itself when run inside
a loop. For example in
https://github.com/solana-labs/solana/blob/005d6863f/core/src/window_service.rs#L101-L107
only one of the slots will show up in influxdb.

This is apparently because of metrics code using milliseconds as the
timestamp, as mentioned here:
solana-labs#19789 (comment)
@behzadnouri
Copy link
Contributor

I sent out #20623 to switch to nanoseconds, as this problem is impacting this metric as well:
https://github.com/solana-labs/solana/blob/005d6863f/core/src/window_service.rs#L101-L107

behzadnouri added a commit that referenced this issue Oct 13, 2021
#20623)

Current datapoint_info! is apparently overwriting itself when run inside
a loop. For example in
https://github.com/solana-labs/solana/blob/005d6863f/core/src/window_service.rs#L101-L107
only one of the slots will show up in influxdb.

This is apparently because of metrics code using milliseconds as the
timestamp, as mentioned here:
#19789 (comment)
mergify bot pushed a commit that referenced this issue Oct 13, 2021
#20623)

Current datapoint_info! is apparently overwriting itself when run inside
a loop. For example in
https://github.com/solana-labs/solana/blob/005d6863f/core/src/window_service.rs#L101-L107
only one of the slots will show up in influxdb.

This is apparently because of metrics code using milliseconds as the
timestamp, as mentioned here:
#19789 (comment)

(cherry picked from commit cd87525)
mergify bot added a commit that referenced this issue Oct 13, 2021
#20623) (#20659)

Current datapoint_info! is apparently overwriting itself when run inside
a loop. For example in
https://github.com/solana-labs/solana/blob/005d6863f/core/src/window_service.rs#L101-L107
only one of the slots will show up in influxdb.

This is apparently because of metrics code using milliseconds as the
timestamp, as mentioned here:
#19789 (comment)

(cherry picked from commit cd87525)

Co-authored-by: behzad nouri <behzadnouri@gmail.com>
dankelleher pushed a commit to identity-com/solana that referenced this issue Nov 24, 2021
solana-labs#20623)

Current datapoint_info! is apparently overwriting itself when run inside
a loop. For example in
https://github.com/solana-labs/solana/blob/005d6863f/core/src/window_service.rs#L101-L107
only one of the slots will show up in influxdb.

This is apparently because of metrics code using milliseconds as the
timestamp, as mentioned here:
solana-labs#19789 (comment)
@github-actions
Copy link
Contributor

This issue has been automatically locked since there has not been any activity in past 7 days after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants