Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe#sort speed up #67

Merged
merged 1 commit into from
Mar 9, 2016
Merged

Dataframe#sort speed up #67

merged 1 commit into from
Mar 9, 2016

Conversation

lokeshh
Copy link
Member

@lokeshh lokeshh commented Mar 1, 2016

In reference to #39

@lokeshh
Copy link
Member Author

lokeshh commented Mar 1, 2016

This uses Array#sort_by to replace handmade sort. This has been made possible because Array#sort_by can take multiple attributes to sort the array by.

There's one pitfall though. Whereas earlier it used to be like:

df.sort [:a, :b], by: {a: lambda { |a,b| a.abs <=> b.abs }, b: lambda { |a,b| a.to_i <=> b.to_i }}

Now one is allowed to do it this way only

df.sort [:a, :b], by: {a: lambda { |a| a.abs }, b: lambda { |a| a.to_i }}

This is because Array#sort_by can only accept these sort of blocks. The earlier format provided more flexibility to the user. This is one reason this is not a good fix but I don't think user would ever want to do something which he can't do with the other format. I am not sure. What do you say?

The speed up is massive. The time it takes to sort the example given in sorting benchmark with vector of size 1000 (not 10000) is 12 from 54. It could be even brought down because sorting the index just takes .03 (see here).

self.index = Daru::Index.new(idx)
# Following three lines are very slow
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following lines find the sorted vector given the sorted index. I haven't figure out a way to do this effectively.

@lokeshh
Copy link
Member Author

lokeshh commented Mar 1, 2016

Is this some sort of mistake. How can sorting two columns take so much less time than sorting single column?

@v0dro
Copy link
Member

v0dro commented Mar 1, 2016

Don't think it's a mistake. Did you try running that script on your computer?

@lokeshh
Copy link
Member Author

lokeshh commented Mar 1, 2016

I ran it on vector size of 1000 rather than 10000 and it took 53 seconds.

@v0dro
Copy link
Member

v0dro commented Mar 1, 2016

This is a good fix that can have far reaching implications in speed and usability, but I would like to confirm if it's alright to change the way the blocks are accessed.

@gnilrets @MohawkJohn @dansbits could use your advice here.

Also, @lokeshh, will it be possible maintain backwards compatibility with the original interface to :by? Do research that option too. I think it can be achieved by assuming that the methods users will call on a and b in the block ({|a,b| a.method <=> b.method }) would be the same, so maybe we just call method inside your block on the single block, and thereby not break the interface. An error will be raised if the two methods are different.

Also, will this be extensible enough to allow sorting of missing data too?

@v0dro
Copy link
Member

v0dro commented Mar 1, 2016

I might have gone wrong in the benchmarks while copy-pasting and editing or something.

@@ -3,7 +3,7 @@
require 'benchmark'
require 'daru'

vector = Daru::Vector.new(10000.times.map.to_a.shuffle)
vector = Daru::Vector.new(1000.times.map.to_a.shuffle)
df = Daru::DataFrame.new({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let the benchmarking code be the same in the final commit. Also, just append your results at the end of the file without deleting anything so we can keep track of speed over time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will do that once everything is done.

@lokeshh
Copy link
Member Author

lokeshh commented Mar 1, 2016

Wow... there's been some very massive increase. I ran the benchmarks/sorting.rb for original 10,000 vectors and here's the result:

Sort a Vector without any args  0.080000   0.010000   0.090000 (  0.087908)
Sort vector in descending order with custom <=> operator  0.130000   0.000000   0.130000 (  0.182502)
Sort single column of DataFrame  2.190000   0.120000   2.310000 (  3.241833)
Sort two columns of DataFrame  2.390000   0.060000   2.450000 (  3.500302)
Sort two columns with custom operators in different orders of DataFrame  2.400000   0.060000   2.460000 (  3.473831)

@v0dro
Copy link
Member

v0dro commented Mar 1, 2016

Beautiful!

@lokeshh lokeshh changed the title Dataframe#sort speed up [WIP]Dataframe#sort speed up Mar 1, 2016
@gnilrets
Copy link
Contributor

gnilrets commented Mar 2, 2016

I've avoided sorting dataframes and vectors due to the very poor performance, so anything that makes a dramatic improvement is a great step in the right direction.

@v0dro
Copy link
Member

v0dro commented Mar 2, 2016

@lokeshh you just need to make the tests pass now :)

@lokeshh lokeshh changed the title [WIP]Dataframe#sort speed up Dataframe#sort speed up Mar 2, 2016
@lokeshh
Copy link
Member Author

lokeshh commented Mar 3, 2016

I've added support for nil's and because of one new performance upgrade it can sort dataframes consisting 10^5 rows in 3 seconds.

I think it's also possible to mimic old functionality but would require little more work by the user.

def create_logic_blocks vector_order, by={}, ascending
# Display nils at top
universal_block_ascending = lambda { |a| a or -Float::INFINITY }
universal_block_decending = lambda { |a| a or Float::INFINITY }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clever :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work when the vector contains strings and nils or other objects and nils?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. @lokeshh in the examples, the vectors being sorted are strictly numerical. Could you try with another example that contains objects other than just numbers and nils?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's failing with non-numeric data. I have a way to solve this but it's not very clean. It would involve creating a new class (see here).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gnilrets suggested something like this in #62

I have a concern that creating so many objects would create GC bottlenecks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. I have got one more solution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Implemented!

@v0dro
Copy link
Member

v0dro commented Mar 5, 2016

Tests failures are similar to that in #63

@lokeshh
Copy link
Member Author

lokeshh commented Mar 5, 2016

Should I solve them the same way by sorting through index?
On Mar 5, 2016 11:58 AM, "Sameer Deshmukh" notifications@github.com wrote:

Tests failures are similar to that in #63
#63


Reply to this email directly or view it on GitHub
#67 (comment).

@v0dro
Copy link
Member

v0dro commented Mar 5, 2016

Yep. Consistency is important.

Also check my comment on the code, I have a feeling it's an edge case.

@lokeshh
Copy link
Member Author

lokeshh commented Mar 5, 2016

Done. I need to arrange the code better so it looks good and understandable. @gnilrets @v0dro If you find some discrepancies or have any concerns please let me know.

@@ -59,6 +59,8 @@ def map!(&block)
# Store a hash of labels for values. Supplementary only. Recommend using index
# for proper usage.
attr_accessor :labels
# Store vector data in an array
attr_accessor :data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for seeing this so late but this should be attr_reader. You don't want to allow users to manipulate the @data variable directly.

Just this one change and we're done.

@lokeshh
Copy link
Member Author

lokeshh commented Mar 6, 2016 via email

@v0dro
Copy link
Member

v0dro commented Mar 6, 2016

Then please document that in the method docs. This will be a great daru specific feature, I could't see the other dataframe libraries use it.

Also change the attr_accessor to attr_reader.

@lokeshh
Copy link
Member Author

lokeshh commented Mar 6, 2016 via email

@v0dro
Copy link
Member

v0dro commented Mar 6, 2016

Sorting with the block function even when nils are present.

If handling nils in daru is going to slow down the sorting significantly, then let the user take care of it. But we need to make sure that it's well documented.

@lokeshh
Copy link
Member Author

lokeshh commented Mar 7, 2016

If handling nils in daru is going to slow down the sorting significantly, then let the user take care of it. But we need to make sure that it's well documented.

Ok, I will document it. I was thinking that we can also have an option like handle_nils: [true, false] that user can provide to tell whether he wants the nils to be automatically managed when he passes a block or not. Sometimes the user might want to handle the nils in his own way. For example, he might want them to be displayed in the end. In that case the user can handle the nils himself. What do you say?

@gnilrets
Copy link
Contributor

gnilrets commented Mar 7, 2016

Can we confirm how much of a slowdown handling nils is going to cost? Nils are so common in the data I've had to deal with on a daily basis for last 8 years and 3 jobs. In order to use Daru consistently, I'd probably have to monkey patch Daru or add some other Daru wrapper objects in my projects to handle nils.

I'm not convinced the GC involved with creating Sortable objects every time one has to sort is really going to be an issue. I'm not sure the best way to test this either.

@lokeshh
Copy link
Member Author

lokeshh commented Mar 7, 2016

For this benchmark, handling nils take 1.72 seconds (for two columns) and sorting without handling nils take 1.24 seconds.

@lokeshh
Copy link
Member Author

lokeshh commented Mar 7, 2016

I'd probably have to monkey patch Daru or add some other Daru wrapper objects in my projects to handle nils.

@gnilrets As far as this PR goes, Daru is going to handle nils given no block. @v0dro and me were discussing about whether nils should be handled automatically when user provides a block or not.

It's not difficult to handle nils for the user. For example, if user want to handle nils and have a block like lambda { |a| a.abs }, he just needs to change this block to lambda { |a| [(a.nil?)?0:1, a.abs] }, this will sort with nils at top. On the other hand one can do lambda { |a| [(a.nil?)?1:0, a.abs] } if he wants to have nils at the bottom.

So, I was proposing a solution that user can pass an option to specify whether he wants the nils to be automatically handled given a block or not. Nils will be automatically handled given no block. What do you say?

@gnilrets
Copy link
Contributor

gnilrets commented Mar 7, 2016

@lokeshh - I misunderstood. Thanks for the clarification!

@v0dro
Copy link
Member

v0dro commented Mar 7, 2016

That works. It satisfies needs of both novice and advanced users. Having an option to choose if nils should be handled automatically or not would be great. Make sure you document your effort well. Once you're done, you can update the daru notebooks at sciruby-notebooks to provide further clarification where necessary.

Replace handmade sort with Array#sort

Add support for sorting with nils

Add test for sorting with non-numeric types

Resolve conflict by Index when all attributes are same

Update benchmarks

Update docs

Add support for handling nils when block given
@lokeshh
Copy link
Member Author

lokeshh commented Mar 8, 2016

Done. Do I need iRuby to run the sciruby-notebooks?

@v0dro
Copy link
Member

v0dro commented Mar 8, 2016

Yes.

  • Clone the repo and modify according to what you need.
  • Run the notebook with a fresh iRuby kernel and make sure the whole notebook works without any errors.
  • Push changes.

v0dro added a commit that referenced this pull request Mar 9, 2016
@v0dro v0dro merged commit 11d9a5f into SciRuby:master Mar 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants