New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: limit queries to ds #2446
Conversation
34ce6ab
to
2b690e8
Compare
Codecov ReportBase: 38.56% // Head: 38.67% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #2446 +/- ##
==========================================
+ Coverage 38.56% 38.67% +0.11%
==========================================
Files 166 166
Lines 36636 36674 +38
==========================================
+ Hits 14127 14182 +55
+ Misses 21595 21576 -19
- Partials 914 916 +2
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
947650e
to
81306fd
Compare
for _, ds := range dsList { | ||
unprocessedJobs, err := jd.getUnprocessedJobsDS(ctx, ds, true, params) | ||
if dsLimit > 0 && dsQueryCount >= dsLimit { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be useful to have another counter stat here to know how often we are reaching limits our
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the default max dsLimit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default is no limit
@BonapartePC let's add a test verifying this new jobsDB configuration option |
7350900
to
9b13f5f
Compare
return res.Jobs | ||
} | ||
|
||
t.Run("Test cache with ds limit as one", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
bb118cb
to
4cf6599
Compare
4cf6599
to
3647c0e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you @BonapartePC for addressing all comments!
dsQueryCount := 0 | ||
var dsLimit int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] for consistency and staying idiomatic
dsQueryCount := 0 | |
var dsLimit int | |
var dsQueryCount int | |
var dsLimit int |
if dsHit { | ||
dsQueryCount++ | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only increase dsQueryCount
when a dsHit takes place, and the decision to stop is based on that.
What if dsQueryCount:
- was measured regardless of queries hit ?
- we stopped early once the dsQueryCount was reached and we had non-zero data.
Letting
The current approach makes sense, but I feel it doesn't penalise queries early enough when we have sparse entries.
Assume we have 200 tables and data are only on in table 3:
- first one
- one in the middle
- one in the end
Now, this algorithm will scan all 200 tables twice until the cache kicks in and avoid the lookup in some tables.
With the approaches I am suggesting, it will require 5 queries for the tables to be scanned, but those queries will be bounded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lvrach a dsHit
happens whenever we query the dataset, regardless if the query returns results or not.
The only occasions where dsHit
will be returned as false
are:
- If we have a cache hit (no jobs & no query)
- An error is returned for whatever reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, my bad it wasn't clear from the diff.
A small comment about what getProcessedJobsDS
returns would be useful here.
Description
Two changes are made in this PR:
Notion Ticket
https://www.notion.so/rudderstacks/Add-stats-for-the-number-of-DS-per-query-41c957dab7eb414397cd7653a7c65701
https://www.notion.so/rudderstacks/Limit-the-number-of-DS-queries-per-jobsdb-Get-4db41289310940a8b49ae2af7b36f6f3
Security