Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New metrics #3

Merged
merged 10 commits into from
Dec 15, 2016
Merged

New metrics #3

merged 10 commits into from
Dec 15, 2016

Conversation

mattevans
Copy link
Collaborator

This change removes avg as a metric and splits it out into three new metrics (mean, median and mode). Also implemented is sum, cardinality and stdev

@mattevans mattevans self-assigned this Dec 13, 2016
Copy link
Owner

@snikch snikch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few changes to be made.

I have to apologise for not leading by example with the comments. I dropped the ball in the measurers. It'd be good to add some comments in the appropriate format for each of the measurers.

return nil
}

numbers := []float64{}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So a couple of questions about the mode.

  1. Do we need to do two loops? Looks like you could cover this all in a single loop.
  2. Why do we return an empty loop if the modes are all of the numbers / one? I haven't dealt with modes too much but wouldn't it still be relevant to have the only possible value returned regardless?

If you could add some comments re: mode that'd be good too - in fact we should quickly explain what each of the averages are - I always get them mixed up, so it'd be nice not to have to think too hard.

@@ -91,3 +183,76 @@ func (a *max) Result() interface{} {
result, _ := a.amount.Float64()
return result
}

// Cadinality
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling.

}

func (a *cardinality) AddDatum(datum interface{}) {
a.size++
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cardinality should be the total unique elements added, so this should really be a map[interface{}]bool where the values are comparable, and the Result should call len(that_map).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can't just use interface{} because it doesn't consider a == b, then perhaps make an actual interface

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you could just do a known type switch, i.e

switch t := datum.(type) {
  case string:
    a.values[t]++
…
}

// Even slice? Get the mean of the middle values.
if len(a.list)%2 == 0 {
prev, _ := a.list[middle-1].Float64()
median = (median + prev) / 2
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth doing decimal math here.

sort.Sort(decimalSortNumerical(a.list))

// Determine median value.
middle := len(a.list) / 2
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably do this in an else below, rather than double working.

@@ -245,8 +246,22 @@ func (p *queryProcessor) measure() {
// Now add all of the data to the measurer.
for j := range bucket.sourceRows {
row := bucket.sourceRows[j]

// If measurer is cardinality, we can +1 not worrying about field value.
if (reflect.TypeOf(m) == reflect.TypeOf(&cardinality{})) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should do this for a type check instead:if _, ok := m.(*cardinality); ok {

But.... this doesn't really apply, since cardinality should be unique, I'm not sure this shortcut works any more.

@mattevans
Copy link
Collaborator Author

mattevans commented Dec 14, 2016

Thanks @snikch - Really appreciate it.

  • Added commenting
  • Fixed smaller issues (decimal math, removed extra loop, etc)
  • cardinality metric now actually calculates the cardinality. 👍
  • Added valueCount metric, which is exactly that.

Questions on mode metric:

Why do we return an empty loop if the modes are all of the numbers / one? I haven't dealt with modes too much but wouldn't it still be relevant to have the only possible value returned regardless?

I was always under the impression mode was the value(s) that occur/repeat most often within a set. If all values have the same number of occurrences, then they aren't technically a mode? After a quick google, some say they should be considered mode, others say they shouldn't. Now I'm not so sure. hahaha. Thoughts? I'll ask Lance if he has a preference for either option.

Examples:

[1,1,2,3,4,5,5,5]
Mode = 5
[1,1,1,2,3,3,3]
Mode = [1,3]
[1,1,1,2,2,2,3,3,3]
Mode = nil
[1,2,3,4,5]
Mode = nil

If the dataset is one (referenced as tip in the code), then it means no repeating values were found. Now that I've added comments (and cleaned it up), it should make a bit more sense. Sorry about that.

In fact we should quickly explain what each of the averages are - I always get them mixed up, so it'd be nice not to have to think too hard.

Added comments to each metric.


Metricable string values

With cardinality (and now valueCount) measurers, these needed to be run against StringCells. You pointed out the shortcut wasn't going to work once cardinality was implemented correctly. That shortcut is gone!

To resolve, I'm passing the measurer{} to IsMetricable() (see here).

Previously, the type of cell determined if the field was metricable. Passing the measurer, allows us to run certain metrics against certain cells. As it stands, only needed for StringCells, but could be handy for different metrics in the future.

Do you think this is an alright solution? Maybe I should define what measurers can be used on what cells outside of each XCell implementation?

@mattevans
Copy link
Collaborator Author

mattevans commented Dec 15, 2016

Going to merge this after our discussion/look over last night.
Can make further adjustments if needed.

Hope your Xmas party is in full swing!

3880216

@mattevans mattevans merged commit 1f5dc61 into master Dec 15, 2016
@mattevans mattevans deleted the new-metrics branch December 15, 2016 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants