-
-
Notifications
You must be signed in to change notification settings - Fork 626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 1787 extended stats #2247
Issue 1787 extended stats #2247
Conversation
@PSeitz can you review? |
src/aggregation/metric/stats.rs
Outdated
pub struct IntermediateStats { | ||
stats: IntermediateExtendedStats, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the memory overhead for e.g. terms -> avg aggregation is too high
Performance may also be impacted
cargo +nightly bench --features unstable aggregation::agg_bench
This is just a ping. I'm working on it, I created a dataset and a simple application that uses several gb of memory for computing some statistics. I'm using the main branch as the memory consumption reference and working on a solution for keeping it the same. |
Just as a heads up, please no macros. Generics would be fine, if the complexity stays low. Otherwise we just can keep it seperate |
Some notes:
|
Hello, I'd like to know how to proceed with this pull request, it has been pending for a while. |
Sorry I forgot about this PR. It's fine to ping occasionally. Can you resolve the merge conflicts? |
Merge is done, do you have an example for benchmark that also reports memory consumption? I ran |
Thanks! memory reporting is coming with this PR: |
Hi, quick update: I added an extended stats bench in the corresponding file, my fork is merged with main; I'm waiting for #2378 to be merged into main before aligning with with it and then pushing it to this pr. |
new version merged with main plus an update bench for extendedstats |
The memory consumption increased by 10% for those two benchmarks. I don't think you should pay for extended stats if you are not using it
|
Hi, I dug into the issue and there 2 reason for the additional memory consumption:
Should I revert even the modification on stats in order to restore the previous memory usage? Thanks |
src/aggregation/metric/stats.rs
Outdated
/// they extends stats adding variance, standard deviation | ||
/// and bound informations | ||
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)] | ||
pub struct ExtendedStats { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move the extended stats stuff to extended_stats.rs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I need to change IntermediateStats
visibility members and his collect
method. if it's ok, I can refactor, do you prefer me adding accessor methods o declaring the fields as pub
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can add pub(crate)
src/aggregation/metric/stats.rs
Outdated
} | ||
/// Merges the other stats intermediate result into self. | ||
pub fn merge_fruits(&mut self, other: IntermediateExtendedStats) { | ||
if other.intermediate_stats.count != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be an early exit not nesting everything
src/aggregation/metric/stats.rs
Outdated
/// Merges the other stats intermediate result into self. | ||
pub fn merge_fruits(&mut self, other: IntermediateExtendedStats) { | ||
if other.intermediate_stats.count != 0 { | ||
if self.intermediate_stats.count == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be an mem::replace + early exit and not nesting everything
src/aggregation/metric/stats.rs
Outdated
if other.intermediate_stats.count == 1 { | ||
self.collect(other.intermediate_stats.sum); | ||
} else if self.intermediate_stats.count == 1 { | ||
let sum = self.intermediate_stats.sum; | ||
self.intermediate_stats = other.intermediate_stats; | ||
self.sum_of_squares = other.sum_of_squares; | ||
self.sum_of_squares_elastic = other.sum_of_squares_elastic; | ||
self.mean = other.mean; | ||
self.delta_sum_for_squares_elastic = other.delta_sum_for_squares_elastic; | ||
self.collect(sum); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these cases are way to special and should be handled by the generic version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I don't understand it, I try to better clarify the current situation.
I created a version with a new segment collector for extended stats which reduces the extra memory increase, but I did not pushed yet, because I'm waiting for a response about kahan summation. In the new version the generic component is gone, so I cannot apply the change.
If you were referring to some other modification I ask you for some more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we can use kahan summation.
What I mean is that these branches, they complexity they add, instead using the version that can handle all counts, needs to be justified.
if other.intermediate_stats.count == 1 {
self.collect(other.intermediate_stats.sum);
} else if self.intermediate_stats.count == 1 {
let sum = self.intermediate_stats.sum;
self.intermediate_stats = other.intermediate_stats;
self.sum_of_squares = other.sum_of_squares;
self.sum_of_squares_elastic = other.sum_of_squares_elastic;
self.mean = other.mean;
self.delta_sum_for_squares_elastic = other.delta_sum_for_squares_elastic;
self.collect(sum);
src/aggregation/metric/stats.rs
Outdated
let max = if self.intermediate_stats.count == 0 { | ||
None | ||
} else { | ||
Some(self.intermediate_stats.max) | ||
}; | ||
let avg = if self.intermediate_stats.count == 0 { | ||
None | ||
} else { | ||
Some(self.mean) | ||
}; | ||
let sum_of_squares = if self.intermediate_stats.count == 0 { | ||
None | ||
} else { | ||
Some(self.sum_of_squares_elastic) | ||
}; | ||
let variance = if self.intermediate_stats.count <= 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all these different but same ifs are error prone and should be collapsed into one condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, my code was inspired by the IntermediateStats one
tantivy/src/aggregation/metric/stats.rs
Lines 119 to 133 in aa26ff5
let min = if self.count == 0 { | |
None | |
} else { | |
Some(self.min) | |
}; | |
let max = if self.count == 0 { | |
None | |
} else { | |
Some(self.max) | |
}; | |
let avg = if self.count == 0 { | |
None | |
} else { | |
Some(self.sum / (self.count as f64)) | |
}; |
should I refactor that one as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also not nice, but there are all the same check so it's easy to proofread
Hi, thanks a lot for your review, this submission contains the code modifications. |
(left_val, right_val, epsilon_val) => { | ||
let diff = (left_val - right_val).abs(); | ||
|
||
if diff > *epsilon_val { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can the code above call this part with epsilon 0.0005
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I got it. Are asking for a change of epsilon in extended_stats tests or for a change in the macro using a fixed value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the macro has two code paths, one with a parameter and one with epsilon 0.0005
. The 0.0005
path can call the variant with the parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, done, thank you.
sum_of_squares_elastic: 0.0, | ||
delta_sum_for_squares_elastic: 0.0, | ||
mean: 0.0, | ||
sigma: 2.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you document why this is 2.0 ?
sum_of_squares_elastic: f64, | ||
/// delta for sum of squares as computed by elastic search needed for the Kahan algorithm | ||
delta_sum_for_squares_elastic: f64, | ||
// The mean an intermediate value need for calculating the variance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That comment seems broken
/// The variance of the fast field values. `None` if count is less then 2. | ||
pub variance: Option<f64>, | ||
/// The variance population of the fast field values, always equal to variance. `None` if count | ||
/// is less then 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
less than
if count is 0 or 1
is maybe easier to read
"std_deviation" => Ok(self.std_deviation), | ||
"std_deviation_sampling" => Ok(self.std_deviation_sampling), | ||
"std_deviation_population" => Ok(self.std_deviation_population), | ||
"std_deviation_bounds.lower" => Ok(self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should match the elastic search name, not the internal name of the property
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I'm missing something, but the names seem to match the ElasticSearch ones https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-extendedstats-aggregation.html
variance, | ||
variance_population: variance, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are there two fields with the same value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically because of ElasticSearch implementation, the documentation (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-extendedstats-aggregation.html) states: The std_deviation and variance are calculated as population metrics so they are always the same as std_deviation_population and variance_population respectively.
}; | ||
let std_deviation = variance.map(|v| v.sqrt()); | ||
let std_deviation_sampling = variance_sampling.map(|v| v.sqrt()); | ||
let std_deviation_bounds = if std_deviation.is_none() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use if let Some(std_deviation, std_deviation_sampling) =
instead of is_some
+ unwrap?
This version should reflect the changes you requested with the exception of the stat labels; they seem to me the same as ElasticSearch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! nice job
Here is a pull request for #1787
Some notes:
assert_nearly_equals
which accepts an extra parameter specifying the max allowed difference between the two values