-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataframe#sort speed up #67
Conversation
This uses There's one pitfall though. Whereas earlier it used to be like: df.sort [:a, :b], by: {a: lambda { |a,b| a.abs <=> b.abs }, b: lambda { |a,b| a.to_i <=> b.to_i }} Now one is allowed to do it this way only df.sort [:a, :b], by: {a: lambda { |a| a.abs }, b: lambda { |a| a.to_i }} This is because The speed up is massive. The time it takes to sort the example given in sorting benchmark with vector of size 1000 (not 10000) is |
self.index = Daru::Index.new(idx) | ||
# Following three lines are very slow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following lines find the sorted vector given the sorted index. I haven't figure out a way to do this effectively.
Is this some sort of mistake. How can sorting two columns take so much less time than sorting single column? |
Don't think it's a mistake. Did you try running that script on your computer? |
I ran it on vector size of 1000 rather than 10000 and it took 53 seconds. |
This is a good fix that can have far reaching implications in speed and usability, but I would like to confirm if it's alright to change the way the blocks are accessed. @gnilrets @MohawkJohn @dansbits could use your advice here. Also, @lokeshh, will it be possible maintain backwards compatibility with the original interface to Also, will this be extensible enough to allow sorting of missing data too? |
I might have gone wrong in the benchmarks while copy-pasting and editing or something. |
@@ -3,7 +3,7 @@ | |||
require 'benchmark' | |||
require 'daru' | |||
|
|||
vector = Daru::Vector.new(10000.times.map.to_a.shuffle) | |||
vector = Daru::Vector.new(1000.times.map.to_a.shuffle) | |||
df = Daru::DataFrame.new({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let the benchmarking code be the same in the final commit. Also, just append your results at the end of the file without deleting anything so we can keep track of speed over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will do that once everything is done.
Wow... there's been some very massive increase. I ran the Sort a Vector without any args 0.080000 0.010000 0.090000 ( 0.087908)
Sort vector in descending order with custom <=> operator 0.130000 0.000000 0.130000 ( 0.182502)
Sort single column of DataFrame 2.190000 0.120000 2.310000 ( 3.241833)
Sort two columns of DataFrame 2.390000 0.060000 2.450000 ( 3.500302)
Sort two columns with custom operators in different orders of DataFrame 2.400000 0.060000 2.460000 ( 3.473831) |
Beautiful! |
I've avoided sorting dataframes and vectors due to the very poor performance, so anything that makes a dramatic improvement is a great step in the right direction. |
@lokeshh you just need to make the tests pass now :) |
I've added support for I think it's also possible to mimic old functionality but would require little more work by the user. |
def create_logic_blocks vector_order, by={}, ascending | ||
# Display nils at top | ||
universal_block_ascending = lambda { |a| a or -Float::INFINITY } | ||
universal_block_decending = lambda { |a| a or Float::INFINITY } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clever :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this work when the vector contains strings and nils or other objects and nils?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. @lokeshh in the examples, the vectors being sorted are strictly numerical. Could you try with another example that contains objects other than just numbers and nils?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's failing with non-numeric data. I have a way to solve this but it's not very clean. It would involve creating a new class (see here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. I have got one more solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Implemented!
Tests failures are similar to that in #63 |
Should I solve them the same way by sorting through index?
|
Yep. Consistency is important. Also check my comment on the code, I have a feeling it's an edge case. |
@@ -59,6 +59,8 @@ def map!(&block) | |||
# Store a hash of labels for values. Supplementary only. Recommend using index | |||
# for proper usage. | |||
attr_accessor :labels | |||
# Store vector data in an array | |||
attr_accessor :data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for seeing this so late but this should be attr_reader
. You don't want to allow users to manipulate the @data
variable directly.
Just this one change and we're done.
Oh! I see what you meant earlier now. This is not an edge case I think. I
simply haven't introduced the feature of handling nils automatically when
user provided a block. I assumed user would do that when he will provide
the block.
|
Then please document that in the method docs. This will be a great daru specific feature, I could't see the other dataframe libraries use it. Also change the |
What will be a great feature? Are you in favor of keeping the
responsibility of handling nils to the user when he provides the block?
|
Sorting with the block function even when nils are present. If handling nils in daru is going to slow down the sorting significantly, then let the user take care of it. But we need to make sure that it's well documented. |
Ok, I will document it. I was thinking that we can also have an option like |
Can we confirm how much of a slowdown handling nils is going to cost? Nils are so common in the data I've had to deal with on a daily basis for last 8 years and 3 jobs. In order to use Daru consistently, I'd probably have to monkey patch Daru or add some other Daru wrapper objects in my projects to handle nils. I'm not convinced the GC involved with creating Sortable objects every time one has to sort is really going to be an issue. I'm not sure the best way to test this either. |
For this benchmark, handling nils take |
@gnilrets As far as this PR goes, Daru is going to handle nils given no block. @v0dro and me were discussing about whether nils should be handled automatically when user provides a block or not. It's not difficult to handle nils for the user. For example, if user want to handle nils and have a block like So, I was proposing a solution that user can pass an option to specify whether he wants the nils to be automatically handled given a block or not. Nils will be automatically handled given no block. What do you say? |
@lokeshh - I misunderstood. Thanks for the clarification! |
That works. It satisfies needs of both novice and advanced users. Having an option to choose if nils should be handled automatically or not would be great. Make sure you document your effort well. Once you're done, you can update the daru notebooks at sciruby-notebooks to provide further clarification where necessary. |
Replace handmade sort with Array#sort Add support for sorting with nils Add test for sorting with non-numeric types Resolve conflict by Index when all attributes are same Update benchmarks Update docs Add support for handling nils when block given
Done. Do I need iRuby to run the sciruby-notebooks? |
Yes.
|
In reference to #39