Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram with two values causes failure #24

Closed
schneems opened this issue Jan 6, 2020 · 2 comments · Fixed by #28
Closed

Histogram with two values causes failure #24

schneems opened this issue Jan 6, 2020 · 2 comments · Fixed by #28

Comments

@schneems
Copy link
Contributor

schneems commented Jan 6, 2020

# fails
require 'stringio'
require 'unicode_plot'

values = [1, 2]

plot = UnicodePlot.histogram(values, title: "Histogram")
plot.render($stdout)

Gives me:

Traceback (most recent call last):
        7: from /Users/rschneeman/.rubies/ruby-2.7.0/bin/irb:23:in `<main>'
        6: from /Users/rschneeman/.rubies/ruby-2.7.0/bin/irb:23:in `load'
        5: from /Users/rschneeman/.rubies/ruby-2.7.0/lib/ruby/gems/2.7.0/gems/irb-1.2.1/exe/irb:11:in `<top (required)>'
        4: from (irb):5
        3: from /Users/rschneeman/.gem/ruby/2.7.0/gems/unicode_plot-0.0.3/lib/unicode_plot/histogram.rb:42:in `histogram'
        2: from /Users/rschneeman/.gem/ruby/2.7.0/gems/unicode_plot-0.0.3/lib/unicode_plot/barplot.rb:103:in `barplot'
        1: from /Users/rschneeman/.gem/ruby/2.7.0/gems/unicode_plot-0.0.3/lib/unicode_plot/barplot.rb:103:in `min'
ArgumentError (comparison of NilClass with 1 failed)

Ruby 2.7.0

It looks like when values get passed to BarChart the arguments include a nil:

[["\e[90m[\e[0m1.0\e[90m, \e[0m2.0\e[90m)\e[0m", "\e[90m[\e[0m2.0\e[90m, \e[0m3.0\e[90m)\e[0m", "\e[90m[\e[0m3.0\e[90m, \e[0m4.0\e[90m)\e[0m"], [1, nil, 1]]

Notice the nil in the debug output.

This doesn't appear to happen when rendering a histogram with 1 or 3 items.

Update: It looks like this only happens on specific input numbers, for example this works with no error:

# works 
require 'stringio'
require 'unicode_plot'

values = [0.017029, 0.018725]

plot = UnicodePlot.histogram(values, title: "Histogram")
plot.render($stdout)

I'm guessing the edge condition has something to do with the calculated bins.

schneems added a commit to zombocom/derailed_benchmarks that referenced this issue Jan 6, 2020
Turns out that reducing a whole bunch of numbers to a single value (average for example, or median) means that we're getting rid of a huge amount of information. One way to add context back in without drowning users in raw data is to include a histogram in the output. 

With histograms in the output the user can see the distributions of their two runs to make better informed decisions about the validity of the data.

Here's an example histogram:

```
                              Histogram
              ┌                                        ┐
   [3.1, 3.2) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 23
   [3.2, 3.3) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 14
   [3.3, 3.4) ┤▇▇▇▇▇▇▇▇ 5
   [3.4, 3.5) ┤▇▇▇▇▇ 3
   [3.5, 3.6) ┤▇▇ 1
   [3.6, 3.7) ┤▇▇▇▇▇▇▇▇ 5
   [3.7, 3.8) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇ 8
   [3.8, 3.9) ┤▇▇▇ 2
              └                                        ┘
                              Frequency
```

Here's how the output looks in a test run:

```
👎👎👎(NOT Statistically Significant) 👎👎👎

[3054e1d584] "Merge pull request #36506 from kamipo/group_by_with_order_by_virtual_count_attribute" - (0.0201745 seconds)
  SLOWER 🐢🐢🐢 by:
      0.8833x [older/newer]
    -13.2063% [(older - newer) / older * 100]
[80f989aece] "Remove duplicated attribute alias resolution in `_select!`" - (0.017821 seconds)

Iterations per sample: 10
Samples: 2
Test type: Kolmogorov Smirnov
Confidence level: 95.0 %
Is significant? (max > critical): false
D critical: 1.7308183826022854
D max: 1.0

Histogram - [3054e1d584] "Merge pull request #36506 from kamipo/group_by_with_order_by_virtual_count_attribute"
                  ┌                                        ┐
   [0.015, 0.02 ) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
   [0.02 , 0.025) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
                  └                                        ┘
                                  Frequency

Histogram - [80f989aece] "Remove duplicated attribute alias resolution in `_select!`"
                  ┌                                        ┐
   [0.017, 0.018) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
   [0.018, 0.019) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
                  └                                        ┘
                                  Frequency


Output: tmp/compare_branches/2020-01-06-12-23-1578335006-352084000/results.txt
```

> Note it's not that interesting with only two samples



## Caveats

The histograms that are presented do not have identical/correct axis. So while the shape of the histograms are correct in comparison to each other, you cannot place them side by side for an accurate representation because they have different bin sizes and start from a different  value. 

In the short term I do think that the visual data being present with this caveat is better than nothing. In the long term support would need to be added to `unicode_plot` to support this behavior.

## Blockers

There is a bug that raises an error when 2 values are given to generate a histogram if the values have some special undetermined relationship. This is tracked here:

red-data-tools/unicode_plot.rb#24
schneems added a commit to zombocom/derailed_benchmarks that referenced this issue Jan 6, 2020
Turns out that reducing a whole bunch of numbers to a single value (average for example, or median) means that we're getting rid of a huge amount of information. One way to add context back in without drowning users in raw data is to include a histogram in the output. 

With histograms in the output the user can see the distributions of their two runs to make better informed decisions about the validity of the data.

Here's an example histogram:

```
                              Histogram
              ┌                                        ┐
   [3.1, 3.2) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 23
   [3.2, 3.3) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 14
   [3.3, 3.4) ┤▇▇▇▇▇▇▇▇ 5
   [3.4, 3.5) ┤▇▇▇▇▇ 3
   [3.5, 3.6) ┤▇▇ 1
   [3.6, 3.7) ┤▇▇▇▇▇▇▇▇ 5
   [3.7, 3.8) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇ 8
   [3.8, 3.9) ┤▇▇▇ 2
              └                                        ┘
                              Frequency
```

Here's how the output looks in a test run:

```
👎👎👎(NOT Statistically Significant) 👎👎👎

[3054e1d584] "Merge pull request #36506 from kamipo/group_by_with_order_by_virtual_count_attribute" - (0.0201745 seconds)
  SLOWER 🐢🐢🐢 by:
      0.8833x [older/newer]
    -13.2063% [(older - newer) / older * 100]
[80f989aece] "Remove duplicated attribute alias resolution in `_select!`" - (0.017821 seconds)

Iterations per sample: 10
Samples: 2
Test type: Kolmogorov Smirnov
Confidence level: 95.0 %
Is significant? (max > critical): false
D critical: 1.7308183826022854
D max: 1.0

Histogram - [3054e1d584] "Merge pull request #36506 from kamipo/group_by_with_order_by_virtual_count_attribute"
                  ┌                                        ┐
   [0.015, 0.02 ) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
   [0.02 , 0.025) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
                  └                                        ┘
                                  Frequency

Histogram - [80f989aece] "Remove duplicated attribute alias resolution in `_select!`"
                  ┌                                        ┐
   [0.017, 0.018) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
   [0.018, 0.019) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
                  └                                        ┘
                                  Frequency


Output: tmp/compare_branches/2020-01-06-12-23-1578335006-352084000/results.txt
```

> Note it's not that interesting with only two samples



## Caveats

The histograms that are presented do not have identical/correct axis. So while the shape of the histograms are correct in comparison to each other, you cannot place them side by side for an accurate representation because they have different bin sizes and start from a different  value. 

In the short term I do think that the visual data being present with this caveat is better than nothing. In the long term support would need to be added to `unicode_plot` to support this behavior.

## Blockers

There is a bug that raises an error when 2 values are given to generate a histogram if the values have some special undetermined relationship. This is tracked here:

red-data-tools/unicode_plot.rb#24
mrkn added a commit that referenced this issue Jan 11, 2020
mrkn added a commit that referenced this issue Jan 11, 2020
enumerable-statistics 2.0.0 has a bug of hitogram which is the cause
of the issue #24.
@mrkn mrkn closed this as completed in #28 Jan 11, 2020
@mrkn
Copy link
Member

mrkn commented Jan 11, 2020

@schneems I released the version 0.0.4. Please use it.

$ irb -Ilib -runicode_plot --simple-prompt
>> UnicodePlot.histogram([1, 2], title: "Histogram").render
                              Histogram
              ┌                                        ┐
   [1.0, 1.5) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
   [1.5, 2.0) ┤ 0
   [2.0, 2.5) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
              └                                        ┘
                              Frequency
=> nil
>>

@schneems
Copy link
Contributor Author

schneems commented Jan 13, 2020 via email

schneems added a commit to zombocom/derailed_benchmarks that referenced this issue Jan 14, 2020
Turns out that reducing a whole bunch of numbers to a single value (average for example, or median) means that we're getting rid of a huge amount of information. One way to add context back in without drowning users in raw data is to include a histogram in the output. 

With histograms in the output the user can see the distributions of their two runs to make better informed decisions about the validity of the data.

Here's an example histogram:

```
                              Histogram
              ┌                                        ┐
   [3.1, 3.2) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 23
   [3.2, 3.3) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 14
   [3.3, 3.4) ┤▇▇▇▇▇▇▇▇ 5
   [3.4, 3.5) ┤▇▇▇▇▇ 3
   [3.5, 3.6) ┤▇▇ 1
   [3.6, 3.7) ┤▇▇▇▇▇▇▇▇ 5
   [3.7, 3.8) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇ 8
   [3.8, 3.9) ┤▇▇▇ 2
              └                                        ┘
                              Frequency
```

Here's how the output looks in a test run:

```
👎👎👎(NOT Statistically Significant) 👎👎👎

[3054e1d584] "Merge pull request #36506 from kamipo/group_by_with_order_by_virtual_count_attribute" - (0.0201745 seconds)
  SLOWER 🐢🐢🐢 by:
      0.8833x [older/newer]
    -13.2063% [(older - newer) / older * 100]
[80f989aece] "Remove duplicated attribute alias resolution in `_select!`" - (0.017821 seconds)

Iterations per sample: 10
Samples: 2
Test type: Kolmogorov Smirnov
Confidence level: 95.0 %
Is significant? (max > critical): false
D critical: 1.7308183826022854
D max: 1.0

Histogram - [3054e1d584] "Merge pull request #36506 from kamipo/group_by_with_order_by_virtual_count_attribute"
                  ┌                                        ┐
   [0.015, 0.02 ) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
   [0.02 , 0.025) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
                  └                                        ┘
                                  Frequency

Histogram - [80f989aece] "Remove duplicated attribute alias resolution in `_select!`"
                  ┌                                        ┐
   [0.017, 0.018) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
   [0.018, 0.019) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
                  └                                        ┘
                                  Frequency


Output: tmp/compare_branches/2020-01-06-12-23-1578335006-352084000/results.txt
```

> Note it's not that interesting with only two samples



## Caveats

The histograms that are presented do not have identical/correct axis. So while the shape of the histograms are correct in comparison to each other, you cannot place them side by side for an accurate representation because they have different bin sizes and start from a different  value. 

In the short term I do think that the visual data being present with this caveat is better than nothing. In the long term support would need to be added to `unicode_plot` to support this behavior.

## Blockers

There is a bug that raises an error when 2 values are given to generate a histogram if the values have some special undetermined relationship. This is tracked here:

red-data-tools/unicode_plot.rb#24
schneems added a commit to zombocom/derailed_benchmarks that referenced this issue Jan 14, 2020
Turns out that reducing a whole bunch of numbers to a single value (average for example, or median) means that we're getting rid of a huge amount of information. One way to add context back in without drowning users in raw data is to include a histogram in the output. 

With histograms in the output the user can see the distributions of their two runs to make better informed decisions about the validity of the data.

Here's an example histogram:

```
                              Histogram
              ┌                                        ┐
   [3.1, 3.2) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 23
   [3.2, 3.3) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 14
   [3.3, 3.4) ┤▇▇▇▇▇▇▇▇ 5
   [3.4, 3.5) ┤▇▇▇▇▇ 3
   [3.5, 3.6) ┤▇▇ 1
   [3.6, 3.7) ┤▇▇▇▇▇▇▇▇ 5
   [3.7, 3.8) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇ 8
   [3.8, 3.9) ┤▇▇▇ 2
              └                                        ┘
                              Frequency
```

Here's how the output looks in a test run:

```
👎👎👎(NOT Statistically Significant) 👎👎👎

[3054e1d584] "Merge pull request #36506 from kamipo/group_by_with_order_by_virtual_count_attribute" - (0.0201745 seconds)
  SLOWER 🐢🐢🐢 by:
      0.8833x [older/newer]
    -13.2063% [(older - newer) / older * 100]
[80f989aece] "Remove duplicated attribute alias resolution in `_select!`" - (0.017821 seconds)

Iterations per sample: 10
Samples: 2
Test type: Kolmogorov Smirnov
Confidence level: 95.0 %
Is significant? (max > critical): false
D critical: 1.7308183826022854
D max: 1.0

Histogram - [3054e1d584] "Merge pull request #36506 from kamipo/group_by_with_order_by_virtual_count_attribute"
                  ┌                                        ┐
   [0.015, 0.02 ) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
   [0.02 , 0.025) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
                  └                                        ┘
                                  Frequency

Histogram - [80f989aece] "Remove duplicated attribute alias resolution in `_select!`"
                  ┌                                        ┐
   [0.017, 0.018) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
   [0.018, 0.019) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
                  └                                        ┘
                                  Frequency


Output: tmp/compare_branches/2020-01-06-12-23-1578335006-352084000/results.txt
```

> Note it's not that interesting with only two samples



## Caveats

The histograms that are presented do not have identical/correct axis. So while the shape of the histograms are correct in comparison to each other, you cannot place them side by side for an accurate representation because they have different bin sizes and start from a different  value. 

In the short term I do think that the visual data being present with this caveat is better than nothing. In the long term support would need to be added to `unicode_plot` to support this behavior.

## Blockers

There is a bug that raises an error when 2 values are given to generate a histogram if the values have some special undetermined relationship. This is tracked here:

red-data-tools/unicode_plot.rb#24
schneems added a commit to zombocom/derailed_benchmarks that referenced this issue Mar 24, 2020
Turns out that reducing a whole bunch of numbers to a single value (average for example, or median) means that we're getting rid of a huge amount of information. One way to add context back in without drowning users in raw data is to include a histogram in the output. 

With histograms in the output the user can see the distributions of their two runs to make better informed decisions about the validity of the data.

Here's an example histogram:

```
                              Histogram
              ┌                                        ┐
   [3.1, 3.2) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 23
   [3.2, 3.3) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 14
   [3.3, 3.4) ┤▇▇▇▇▇▇▇▇ 5
   [3.4, 3.5) ┤▇▇▇▇▇ 3
   [3.5, 3.6) ┤▇▇ 1
   [3.6, 3.7) ┤▇▇▇▇▇▇▇▇ 5
   [3.7, 3.8) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇ 8
   [3.8, 3.9) ┤▇▇▇ 2
              └                                        ┘
                              Frequency
```

Here's how the output looks in a test run:

```
👎👎👎(NOT Statistically Significant) 👎👎👎

[3054e1d584] "Merge pull request #36506 from kamipo/group_by_with_order_by_virtual_count_attribute" - (0.0201745 seconds)
  SLOWER 🐢🐢🐢 by:
      0.8833x [older/newer]
    -13.2063% [(older - newer) / older * 100]
[80f989aece] "Remove duplicated attribute alias resolution in `_select!`" - (0.017821 seconds)

Iterations per sample: 10
Samples: 2
Test type: Kolmogorov Smirnov
Confidence level: 95.0 %
Is significant? (max > critical): false
D critical: 1.7308183826022854
D max: 1.0

Histogram - [3054e1d584] "Merge pull request #36506 from kamipo/group_by_with_order_by_virtual_count_attribute"
                  ┌                                        ┐
   [0.015, 0.02 ) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
   [0.02 , 0.025) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
                  └                                        ┘
                                  Frequency

Histogram - [80f989aece] "Remove duplicated attribute alias resolution in `_select!`"
                  ┌                                        ┐
   [0.017, 0.018) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
   [0.018, 0.019) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
                  └                                        ┘
                                  Frequency


Output: tmp/compare_branches/2020-01-06-12-23-1578335006-352084000/results.txt
```

> Note it's not that interesting with only two samples



## Caveats

The histograms that are presented do not have identical/correct axis. So while the shape of the histograms are correct in comparison to each other, you cannot place them side by side for an accurate representation because they have different bin sizes and start from a different  value. 

In the short term I do think that the visual data being present with this caveat is better than nothing. In the long term support would need to be added to `unicode_plot` to support this behavior.

## Blockers

There is a bug that raises an error when 2 values are given to generate a histogram if the values have some special undetermined relationship. This is tracked here:

red-data-tools/unicode_plot.rb#24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants