Dataframe#sort speed up #67

lokeshh · 2016-03-01T17:01:27Z

In reference to #39

lokeshh · 2016-03-01T17:14:53Z

This uses Array#sort_by to replace handmade sort. This has been made possible because Array#sort_by can take multiple attributes to sort the array by.

There's one pitfall though. Whereas earlier it used to be like:

df.sort [:a, :b], by: {a: lambda { |a,b| a.abs <=> b.abs }, b: lambda { |a,b| a.to_i <=> b.to_i }}

Now one is allowed to do it this way only

df.sort [:a, :b], by: {a: lambda { |a| a.abs }, b: lambda { |a| a.to_i }}

This is because Array#sort_by can only accept these sort of blocks. The earlier format provided more flexibility to the user. This is one reason this is not a good fix but I don't think user would ever want to do something which he can't do with the other format. I am not sure. What do you say?

The speed up is massive. The time it takes to sort the example given in sorting benchmark with vector of size 1000 (not 10000) is 12 from 54. It could be even brought down because sorting the index just takes .03 (see here).

lokeshh · 2016-03-01T17:26:32Z

lib/daru/dataframe.rb

      self.index = Daru::Index.new(idx)
+      # Following three lines are very slow 


Following lines find the sorted vector given the sorted index. I haven't figure out a way to do this effectively.

lokeshh · 2016-03-01T17:32:53Z

Is this some sort of mistake. How can sorting two columns take so much less time than sorting single column?

v0dro · 2016-03-01T17:38:31Z

Don't think it's a mistake. Did you try running that script on your computer?

lokeshh · 2016-03-01T17:41:37Z

I ran it on vector size of 1000 rather than 10000 and it took 53 seconds.

v0dro · 2016-03-01T17:48:04Z

This is a good fix that can have far reaching implications in speed and usability, but I would like to confirm if it's alright to change the way the blocks are accessed.

@gnilrets @MohawkJohn @dansbits could use your advice here.

Also, @lokeshh, will it be possible maintain backwards compatibility with the original interface to :by? Do research that option too. I think it can be achieved by assuming that the methods users will call on a and b in the block ({|a,b| a.method <=> b.method }) would be the same, so maybe we just call method inside your block on the single block, and thereby not break the interface. An error will be raised if the two methods are different.

Also, will this be extensible enough to allow sorting of missing data too?

v0dro · 2016-03-01T17:48:34Z

I might have gone wrong in the benchmarks while copy-pasting and editing or something.

v0dro · 2016-03-01T17:50:28Z

benchmarks/sorting.rb

@@ -3,7 +3,7 @@
 require 'benchmark'
 require 'daru'

-vector = Daru::Vector.new(10000.times.map.to_a.shuffle)
+vector = Daru::Vector.new(1000.times.map.to_a.shuffle)
 df = Daru::DataFrame.new({


Let the benchmarking code be the same in the final commit. Also, just append your results at the end of the file without deleting anything so we can keep track of speed over time.

Ok, will do that once everything is done.

lokeshh · 2016-03-01T18:01:33Z

Wow... there's been some very massive increase. I ran the benchmarks/sorting.rb for original 10,000 vectors and here's the result:

Sort a Vector without any args  0.080000   0.010000   0.090000 (  0.087908)
Sort vector in descending order with custom <=> operator  0.130000   0.000000   0.130000 (  0.182502)
Sort single column of DataFrame  2.190000   0.120000   2.310000 (  3.241833)
Sort two columns of DataFrame  2.390000   0.060000   2.450000 (  3.500302)
Sort two columns with custom operators in different orders of DataFrame  2.400000   0.060000   2.460000 (  3.473831)

v0dro · 2016-03-01T18:02:02Z

Beautiful!

gnilrets · 2016-03-02T03:48:53Z

I've avoided sorting dataframes and vectors due to the very poor performance, so anything that makes a dramatic improvement is a great step in the right direction.

v0dro · 2016-03-02T06:53:46Z

@lokeshh you just need to make the tests pass now :)

lokeshh · 2016-03-03T11:57:39Z

I've added support for nil's and because of one new performance upgrade it can sort dataframes consisting 10^5 rows in 3 seconds.

I think it's also possible to mimic old functionality but would require little more work by the user.

v0dro · 2016-03-03T16:23:28Z

lib/daru/dataframe.rb

+    def create_logic_blocks vector_order, by={}, ascending
+      # Display nils at top
+      universal_block_ascending = lambda { |a| a or -Float::INFINITY }
+      universal_block_decending = lambda { |a| a or Float::INFINITY }


Will this work when the vector contains strings and nils or other objects and nils?

Good catch. @lokeshh in the examples, the vectors being sorted are strictly numerical. Could you try with another example that contains objects other than just numbers and nils?

It's failing with non-numeric data. I have a way to solve this but it's not very clean. It would involve creating a new class (see here).

@gnilrets suggested something like this in #62

I have a concern that creating so many objects would create GC bottlenecks.

No problem. I have got one more solution.

Done. Implemented!

v0dro · 2016-03-05T06:28:13Z

Tests failures are similar to that in #63

lokeshh · 2016-03-05T07:36:42Z

Should I solve them the same way by sorting through index?
On Mar 5, 2016 11:58 AM, "Sameer Deshmukh" notifications@github.com wrote:

Tests failures are similar to that in #63
#63

—
Reply to this email directly or view it on GitHub
#67 (comment).

v0dro · 2016-03-05T08:23:24Z

Yep. Consistency is important.

Also check my comment on the code, I have a feeling it's an edge case.

lokeshh · 2016-03-05T18:24:13Z

Done. I need to arrange the code better so it looks good and understandable. @gnilrets @v0dro If you find some discrepancies or have any concerns please let me know.

v0dro · 2016-03-06T09:42:56Z

lib/daru/vector.rb

@@ -59,6 +59,8 @@ def map!(&block)
    # Store a hash of labels for values. Supplementary only. Recommend using index
    # for proper usage.
    attr_accessor :labels
+    # Store vector data in an array
+    attr_accessor :data


Sorry for seeing this so late but this should be attr_reader. You don't want to allow users to manipulate the @data variable directly.

Just this one change and we're done.

lokeshh · 2016-03-06T11:16:38Z

Oh! I see what you meant earlier now. This is not an edge case I think. I simply haven't introduced the feature of handling nils automatically when user provided a block. I assumed user would do that when he will provide the block.

v0dro · 2016-03-06T17:49:52Z

Then please document that in the method docs. This will be a great daru specific feature, I could't see the other dataframe libraries use it.

Also change the attr_accessor to attr_reader.

lokeshh · 2016-03-06T18:02:24Z

What will be a great feature? Are you in favor of keeping the responsibility of handling nils to the user when he provides the block?

v0dro · 2016-03-06T18:23:50Z

Sorting with the block function even when nils are present.

If handling nils in daru is going to slow down the sorting significantly, then let the user take care of it. But we need to make sure that it's well documented.

lokeshh · 2016-03-07T16:43:04Z

If handling nils in daru is going to slow down the sorting significantly, then let the user take care of it. But we need to make sure that it's well documented.

Ok, I will document it. I was thinking that we can also have an option like handle_nils: [true, false] that user can provide to tell whether he wants the nils to be automatically managed when he passes a block or not. Sometimes the user might want to handle the nils in his own way. For example, he might want them to be displayed in the end. In that case the user can handle the nils himself. What do you say?

gnilrets · 2016-03-07T16:51:53Z

Can we confirm how much of a slowdown handling nils is going to cost? Nils are so common in the data I've had to deal with on a daily basis for last 8 years and 3 jobs. In order to use Daru consistently, I'd probably have to monkey patch Daru or add some other Daru wrapper objects in my projects to handle nils.

I'm not convinced the GC involved with creating Sortable objects every time one has to sort is really going to be an issue. I'm not sure the best way to test this either.

lokeshh · 2016-03-07T17:11:07Z

For this benchmark, handling nils take 1.72 seconds (for two columns) and sorting without handling nils take 1.24 seconds.

lokeshh · 2016-03-07T17:23:11Z

I'd probably have to monkey patch Daru or add some other Daru wrapper objects in my projects to handle nils.

@gnilrets As far as this PR goes, Daru is going to handle nils given no block. @v0dro and me were discussing about whether nils should be handled automatically when user provides a block or not.

It's not difficult to handle nils for the user. For example, if user want to handle nils and have a block like lambda { |a| a.abs }, he just needs to change this block to lambda { |a| [(a.nil?)?0:1, a.abs] }, this will sort with nils at top. On the other hand one can do lambda { |a| [(a.nil?)?1:0, a.abs] } if he wants to have nils at the bottom.

So, I was proposing a solution that user can pass an option to specify whether he wants the nils to be automatically handled given a block or not. Nils will be automatically handled given no block. What do you say?

gnilrets · 2016-03-07T17:24:06Z

@lokeshh - I misunderstood. Thanks for the clarification!

v0dro · 2016-03-07T18:58:43Z

That works. It satisfies needs of both novice and advanced users. Having an option to choose if nils should be handled automatically or not would be great. Make sure you document your effort well. Once you're done, you can update the daru notebooks at sciruby-notebooks to provide further clarification where necessary.

Replace handmade sort with Array#sort Add support for sorting with nils Add test for sorting with non-numeric types Resolve conflict by Index when all attributes are same Update benchmarks Update docs Add support for handling nils when block given

lokeshh · 2016-03-08T17:49:13Z

Done. Do I need iRuby to run the sciruby-notebooks?

v0dro · 2016-03-08T19:09:46Z

Yes.

Clone the repo and modify according to what you need.
Run the notebook with a fresh iRuby kernel and make sure the whole notebook works without any errors.
Push changes.

Dataframe#sort speed up

lokeshh reviewed Mar 1, 2016
View reviewed changes

v0dro reviewed Mar 1, 2016
View reviewed changes

lokeshh changed the title ~~Dataframe#sort speed up~~ [WIP]Dataframe#sort speed up Mar 1, 2016

lokeshh changed the title ~~[WIP]Dataframe#sort speed up~~ Dataframe#sort speed up Mar 2, 2016

v0dro reviewed Mar 3, 2016
View reviewed changes

v0dro reviewed Mar 6, 2016
View reviewed changes

Improve DataFrame#sort

cb9e44a

Replace handmade sort with Array#sort Add support for sorting with nils Add test for sorting with non-numeric types Resolve conflict by Index when all attributes are same Update benchmarks Update docs Add support for handling nils when block given

This was referenced Mar 9, 2016

Generic solution for sorting with nils? #62

Closed

Sorting vectors and dataframes with nils #39

Closed

v0dro added a commit that referenced this pull request Mar 9, 2016

Merge pull request #67 from lokeshh/sort_dataframe

11d9a5f

Dataframe#sort speed up

v0dro merged commit 11d9a5f into SciRuby:master Mar 9, 2016

		self.index = Daru::Index.new(idx)
		# Following three lines are very slow

Dataframe#sort speed up #67

Dataframe#sort speed up #67

Conversation

lokeshh commented Mar 1, 2016

lokeshh commented Mar 1, 2016

Choose a reason for hiding this comment

lokeshh commented Mar 1, 2016

v0dro commented Mar 1, 2016

lokeshh commented Mar 1, 2016

v0dro commented Mar 1, 2016

v0dro commented Mar 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lokeshh commented Mar 1, 2016

v0dro commented Mar 1, 2016

gnilrets commented Mar 2, 2016

v0dro commented Mar 2, 2016

lokeshh commented Mar 3, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

v0dro commented Mar 5, 2016

lokeshh commented Mar 5, 2016

v0dro commented Mar 5, 2016

lokeshh commented Mar 5, 2016

Choose a reason for hiding this comment

lokeshh commented Mar 6, 2016 via email

v0dro commented Mar 6, 2016

lokeshh commented Mar 6, 2016 via email

v0dro commented Mar 6, 2016

lokeshh commented Mar 7, 2016

gnilrets commented Mar 7, 2016

lokeshh commented Mar 7, 2016

lokeshh commented Mar 7, 2016

gnilrets commented Mar 7, 2016

v0dro commented Mar 7, 2016

lokeshh commented Mar 8, 2016

v0dro commented Mar 8, 2016