Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram for numeric columns #244

Closed
paulklemm opened this issue Jan 17, 2019 · 5 comments

Comments

@paulklemm
Copy link

commented Jan 17, 2019

Is there any way of displaying a histogram for numeric values with visidata? I typically plot the histograms of numeric attributes when I get a new data set to see what the distribution is and if there are any outliers.

Ideally the histogram would have a parameter defining the width of each bin.

@paulklemm paulklemm changed the title Histogram of numeric columns Histogram for numeric columns Jan 17, 2019

@saulpw

This comment has been minimized.

Copy link
Owner

commented Jan 18, 2019

Yes! I've been wanting this too. But I got stuck on the numeric binning code and never got back to it. Maybe it's time to clean that up.

Ideally the histogram would have a parameter defining the width of each bin.

The width of each bin, or the number of bins? Does each bin always have the same width, or would you want to specify ranges (like age 18-34, 35-44, 45-54)?

@paulklemm

This comment has been minimized.

Copy link
Author

commented Jan 18, 2019

The width of each bin, or the number of bins?

Hm, ideally I think it would be either one. (1) Setting the number of bins partitions the range according to the maximum width of equally sized bins. (2) Setting the width of the bin creates enough bins to capture the whole range. My use-case typically is (1), setting the number of bins.

Does each bin always have the same width, or would you want to specify ranges (like age 18-34, 35-44, 45-54)?

That is very interesting, especially for not normally distributed data. I work a lot with genetic data and here you typically have a high amount of lowly abundant genes and then the highly expressed genes that stretch the distribution. It would be cool to have rather small bin sizes for the lowly abundant genes and high bin sizes for the highly expressed genes.

But for now equally sized bins would already help me a lot.

@anjakefala

This comment has been minimized.

Copy link
Collaborator

commented Jan 30, 2019

Hi @paulklemm!

@saulpw made a first pass at implementing numeric binning (including date ranges!) in this commit e114f60. Please note that columns have to be typed as numeric (with #, %, $ or @) in order for them to be numerically binned.

On default, the following heuristic will be used to calculate the widths of each equally sized bin. Alternatively, you can set the histogram_bins option either in the OptionsSheet (press O) or in your ~/.visidatarc.

I have found a few hiccups with playing vd scripts with it as is it is, that we will need to fix before the feature can be shipped, but otherwise it should be ready for you to start playing with it from the develop branch. =) If you could, please give it a go and let us know how it feels.

@paulklemm

This comment has been minimized.

Copy link
Author

commented Jan 31, 2019

Will check it out asap 👍

@paulklemm

This comment has been minimized.

Copy link
Author

commented Feb 1, 2019

I tried it and it works like a charm! Thank you so much, this will help me a lot!

@saulpw saulpw closed this Mar 13, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.