Update PrePush Modified Files and Lines with Multiple Branches Push #548

taufek · 2018-02-07T22:06:42Z

In pre-push hook, it receives multiple ref ranges when we pushed multiple branches to remote.

Below is pushing multiple branches plus setting up the upstream branch on remote.

command:
git push origin -u tj-test tj-test-2

stdin:
refs/heads/tj-test 62738e20036f58060a8f94ff8cd73733acb6561b refs/heads/tj-test 0000000000000000000000000000000000000000
refs/heads/tj-test-2 feb40b515f598723fe043b7cb9510f28c287d642 refs/heads/tj-test-2 0000000000000000000000000000000000000000

Below is pushing multiple branches which already has upstream branch on remote.

command:
git push origin tj-test tj-test-2

stdin:
refs/heads/tj-test 61bc13ab2e717633441855d534904c5036f342dd refs/heads/tj-test 62738e20036f58060a8f94ff8cd73733acb6561b
refs/heads/tj-test-2 1e371d2c345e00c34410372dad2d7602aca99908 refs/heads/tj-test-2 feb40b515f598723fe043b7cb9510f28c287d642

With these multiple ref ranges, we should read all of them during pre-push modified_files and modified_lines_in_file.

sds

Hey @taufek, thanks for your continued iteration here.

Overall looks fine, but I did have a concern about performance which I'm hoping you can answer. Thanks!

sds · 2018-02-08T20:29:31Z

lib/overcommit/hook_context/pre_push.rb

-      "#{pushed_refs[0].remote_sha1}..#{pushed_refs[0].local_sha1}"
+      @modified_lines[file] = pushed_refs.map do |pushed_ref|
+        pushed_ref.modified_lines_in_file(file)
+      end.to_set.flatten


Can you help me understand why you are using to_set.flatten here?

modified_lines_in_file returns a Set, which you map over to produce an array of sets that you then call to_set on, followed by flatten. This can be potentially expensive for a large number of files and/or modifies lines.

Consider the following benchmark:

require 'benchmark' require 'set' iterations = 10 sets = 100.times.map do Array.new(1_000) { rand(0..1_000) }.to_set end Benchmark.bmbm do |x| x.report('.to_set.flatten') do iterations.times do sets.to_set.flatten end end x.report('each_with_object{ merge }') do iterations.times do sets.each_with_object(Set.new) { |item, set| set.merge(item) } end end end puts 'Are they equivalent?' puts sets.to_set.flatten == sets.each_with_object(Set.new) { |item, set| set.merge(item) }

Rehearsal ------------------------------------------------------------- .to_set.flatten 0.340000 0.000000 0.340000 ( 0.344824) each_with_object{ merge } 0.020000 0.000000 0.020000 ( 0.016745) ---------------------------------------------------- total: 0.360000sec user system total real .to_set.flatten 0.360000 0.000000 0.360000 ( 0.361935) each_with_object{ merge } 0.020000 0.000000 0.020000 ( 0.016793) Are they equivalent? true

In a nutshell, it would be significantly faster for large sets of files/lines if we didn't use to_set.flatten, but I want to make sure I understand your reasoning here in case I missed something.

flatten is called because I assume the expected output is non-nested Set for modified_lines_in_file and non-nested Array for modified_files.

For example, modified_lines_in_files with 2 sets on PushedRef (when I pushed 2 branches) might output [<Set(1,2)>, <Set(3,4)>]. Then callingto_set.flatten, converts it to <Set(1,2,3,4)>.

I definitely will go with your suggestion if the output is the same since it has better performance.

sds · 2018-02-08T20:33:17Z

lib/overcommit/hook_context/pre_push.rb


    def modified_files
-      @modified_files ||= Overcommit::GitRepo.modified_files(refs: ref_range)
+      @modified_files ||= pushed_refs.map(&:modified_files).flatten


See my comments below for how you may want to speed this up for large ref ranges containing many large file sets.

taufek · 2018-02-09T01:49:01Z

lib/overcommit/hook_context/pre_push.rb

      end
    end

    def modified_files


@sds ,

I want to ask your opinion on this one. If I push changes on the same file on separate branch/local_ref:

Should we convert the output to unique list of files.

[‘file_one’]

Or

Should we distinguish the file from different local_ref by appending sort of prefix. For example:

[‘<local_ref_1>:file_one’, ‘<local_ref_2>:file_one’]

Or

Do nothing. Let there be duplicate of file from different local_ref.

[‘file_one’, ‘file_one’]

Good question, @taufek. If I try to put myself in the shoes of a developer writing a pre-push hook, the least-surprising behavior to me would be net set of differences between the local and remote branches, and thus the unique list of files (option 1) seems reasonable to me.

If you have other ideas about why others would be better, let me know!

Yup. Unique list of files make sense to me. Thanks.

taufek · 2018-02-10T10:39:34Z

@sds ,

It's ready for re-review.

In `pre-push` hook, it receives multiple ref ranges when we pushed multiple branches to remote. Below is pushing multiple branches plus setting up the upstream branch on remote. ``` command: git push origin -u tj-test tj-test-2 stdin: refs/heads/tj-test 62738e2 refs/heads/tj-test 0000000000000000000000000000000000000000 refs/heads/tj-test-2 feb40b5 refs/heads/tj-test-2 0000000000000000000000000000000000000000 ``` Below is pushing multiple branches which already has upstream branch on remote. ``` command: git push origin tj-test tj-test-2 stdin: refs/heads/tj-test 61bc13ab2e717633441855d534904c5036f342dd refs/heads/tj-test 62738e2 refs/heads/tj-test-2 1e371d2c345e00c34410372dad2d7602aca99908 refs/heads/tj-test-2 feb40b5 ``` With these multiple ref ranges, we should read all of them during pre-push `modified_files` and `modified_lines_in_file`.

taufek · 2018-02-10T15:50:39Z

lib/overcommit/hook_context/pre_push.rb


    def modified_files
-      @modified_files ||= Overcommit::GitRepo.modified_files(refs: ref_range)
+      @modified_files ||= pushed_refs.map(&:modified_files).flatten.uniq


require 'benchmark' require 'set' iterations = 10 data = 100.times.map do Array.new(1_000) { rand(0..1_000) } end Benchmark.bmbm do |x| x.report('array.flatten.uniq') do iterations.times do data.flatten.uniq end end x.report('array.each_with_object{ merge }.to_a') do iterations.times do data.each_with_object(Set.new) do |datum, set| set.merge(datum) end.to_a end end end puts 'Are they equivalent?' puts data.flatten.uniq.sort == data.each_with_object(Set.new) { |datum, set| set.merge(datum) }.sort Rehearsal ------------------------------------------------------------------------ array.flatten.uniq 0.100000 0.010000 0.110000 ( 0.097410) array.each_with_object{ merge }.to_a 0.140000 0.000000 0.140000 ( 0.148073) --------------------------------------------------------------- total: 0.250000sec user system total real array.flatten.uniq 0.090000 0.000000 0.090000 ( 0.100320) array.each_with_object{ merge }.to_a 0.140000 0.000000 0.140000 ( 0.151008) Are they equivalent? true

Seems like for this case, calling flatten.uniq is fast.

sds · 2018-02-10T20:18:47Z

spec/overcommit/hook_context/pre_push_spec.rb

          `git commit -m "Update Branch 2" 2>&1 > #{File::NULL}`

-          should include(*%w[added-1 update-me added-2].map { |file| File.expand_path(file) })
+          should == %w[added-1 added-2 update-me].map { |file| File.expand_path(file) }


One thing worth noting is that using == here is slightly more brittle, as it prevents us from changing our implementation (which could return the file names in a different order) without breaking tests.

At the end of the day, our goal is to make sure we're returning the correct set of files, not files in a specified order, which is why we used include. However, perhaps you changed this because we weren't checking for unintended files to be included. In that case, you could use a combination of include along with verifying the count to make sure no additional files are in the set.

This is not a big deal, so I'm fine merging this as-is, but it's important we're not overly-specific in tests.

I’ll take a note for future reference. Thanks.

sds · 2018-02-10T20:20:36Z

Thanks @taufek!

sds reviewed Feb 8, 2018

View reviewed changes

taufek commented Feb 9, 2018

View reviewed changes

taufek force-pushed the tj-multi-branches branch 2 times, most recently from e645f1e to 97fa56c Compare February 10, 2018 10:38

taufek force-pushed the tj-multi-branches branch from 97fa56c to 5a5dc4a Compare February 10, 2018 15:49

taufek commented Feb 10, 2018

View reviewed changes

sds approved these changes Feb 10, 2018

View reviewed changes

sds merged commit 5af4021 into sds:master Feb 10, 2018

taufek deleted the tj-multi-branches branch February 11, 2018 03:35

Update PrePush Modified Files and Lines with Multiple Branches Push #548

Update PrePush Modified Files and Lines with Multiple Branches Push #548

Uh oh!

Conversation

taufek commented Feb 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sds left a comment

Choose a reason for hiding this comment

Uh oh!

sds Feb 8, 2018

Choose a reason for hiding this comment

Uh oh!

taufek Feb 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sds Feb 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taufek Feb 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sds Feb 9, 2018

Choose a reason for hiding this comment

Uh oh!

taufek Feb 9, 2018

Choose a reason for hiding this comment

Uh oh!

taufek commented Feb 10, 2018

Uh oh!

taufek Feb 10, 2018

Choose a reason for hiding this comment

Uh oh!

sds Feb 10, 2018

Choose a reason for hiding this comment

Uh oh!

taufek Feb 11, 2018

Choose a reason for hiding this comment

Uh oh!

sds commented Feb 10, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

taufek commented Feb 7, 2018 •

edited

Loading

taufek Feb 9, 2018 •

edited

Loading

sds Feb 8, 2018 •

edited

Loading

taufek Feb 9, 2018 •

edited

Loading