Skip to content

Conversation

@taufek
Copy link
Contributor

@taufek taufek commented Feb 7, 2018

In pre-push hook, it receives multiple ref ranges when we pushed multiple branches to remote.

Below is pushing multiple branches plus setting up the upstream branch on remote.

command:
git push origin -u tj-test tj-test-2

stdin:
refs/heads/tj-test 62738e20036f58060a8f94ff8cd73733acb6561b refs/heads/tj-test 0000000000000000000000000000000000000000
refs/heads/tj-test-2 feb40b515f598723fe043b7cb9510f28c287d642 refs/heads/tj-test-2 0000000000000000000000000000000000000000

Below is pushing multiple branches which already has upstream branch on remote.

command:
git push origin tj-test tj-test-2

stdin:
refs/heads/tj-test 61bc13ab2e717633441855d534904c5036f342dd refs/heads/tj-test 62738e20036f58060a8f94ff8cd73733acb6561b
refs/heads/tj-test-2 1e371d2c345e00c34410372dad2d7602aca99908 refs/heads/tj-test-2 feb40b515f598723fe043b7cb9510f28c287d642

With these multiple ref ranges, we should read all of them during pre-push modified_files and modified_lines_in_file.

Copy link
Owner

@sds sds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @taufek, thanks for your continued iteration here.

Overall looks fine, but I did have a concern about performance which I'm hoping you can answer. Thanks!

"#{pushed_refs[0].remote_sha1}..#{pushed_refs[0].local_sha1}"
@modified_lines[file] = pushed_refs.map do |pushed_ref|
pushed_ref.modified_lines_in_file(file)
end.to_set.flatten
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand why you are using to_set.flatten here?

modified_lines_in_file returns a Set, which you map over to produce an array of sets that you then call to_set on, followed by flatten. This can be potentially expensive for a large number of files and/or modifies lines.

Consider the following benchmark:

require 'benchmark'
require 'set'

iterations = 10
sets = 100.times.map do
  Array.new(1_000) { rand(0..1_000) }.to_set
end

Benchmark.bmbm do |x|
  x.report('.to_set.flatten') do
    iterations.times do
      sets.to_set.flatten
    end
  end

  x.report('each_with_object{ merge }') do
    iterations.times do
      sets.each_with_object(Set.new) { |item, set| set.merge(item) }
    end
  end
end

puts 'Are they equivalent?'
puts sets.to_set.flatten == sets.each_with_object(Set.new) { |item, set| set.merge(item) }
Rehearsal -------------------------------------------------------------
.to_set.flatten             0.340000   0.000000   0.340000 (  0.344824)
each_with_object{ merge }   0.020000   0.000000   0.020000 (  0.016745)
---------------------------------------------------- total: 0.360000sec

                                user     system      total        real
.to_set.flatten             0.360000   0.000000   0.360000 (  0.361935)
each_with_object{ merge }   0.020000   0.000000   0.020000 (  0.016793)
Are they equivalent?
true

In a nutshell, it would be significantly faster for large sets of files/lines if we didn't use to_set.flatten, but I want to make sure I understand your reasoning here in case I missed something.

Copy link
Contributor Author

@taufek taufek Feb 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flatten is called because I assume the expected output is non-nested Set for modified_lines_in_file and non-nested Array for modified_files.

For example, modified_lines_in_files with 2 sets on PushedRef (when I pushed 2 branches) might output [<Set(1,2)>, <Set(3,4)>]. Then callingto_set.flatten, converts it to <Set(1,2,3,4)>.

I definitely will go with your suggestion if the output is the same since it has better performance.


def modified_files
@modified_files ||= Overcommit::GitRepo.modified_files(refs: ref_range)
@modified_files ||= pushed_refs.map(&:modified_files).flatten
Copy link
Owner

@sds sds Feb 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments below for how you may want to speed this up for large ref ranges containing many large file sets.

end
end

def modified_files
Copy link
Contributor Author

@taufek taufek Feb 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sds ,

I want to ask your opinion on this one. If I push changes on the same file on separate branch/local_ref:

  1. Should we convert the output to unique list of files.
[‘file_one’]

Or

  1. Should we distinguish the file from different local_ref by appending sort of prefix. For example:
[‘<local_ref_1>:file_one’, ‘<local_ref_2>:file_one’]

Or

  1. Do nothing. Let there be duplicate of file from different local_ref.
[‘file_one’, ‘file_one’]

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, @taufek. If I try to put myself in the shoes of a developer writing a pre-push hook, the least-surprising behavior to me would be net set of differences between the local and remote branches, and thus the unique list of files (option 1) seems reasonable to me.

If you have other ideas about why others would be better, let me know!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. Unique list of files make sense to me. Thanks.

@taufek taufek force-pushed the tj-multi-branches branch 2 times, most recently from e645f1e to 97fa56c Compare February 10, 2018 10:38
@taufek
Copy link
Contributor Author

taufek commented Feb 10, 2018

@sds ,

It's ready for re-review.

In `pre-push` hook, it receives multiple ref ranges when we pushed multiple
branches to remote.

Below is pushing multiple branches plus setting up the upstream branch on remote.
```
command:
git push origin -u tj-test tj-test-2

stdin:
refs/heads/tj-test 62738e2 refs/heads/tj-test 0000000000000000000000000000000000000000
refs/heads/tj-test-2 feb40b5 refs/heads/tj-test-2 0000000000000000000000000000000000000000
```

Below is pushing multiple branches which already has upstream branch on remote.
```
command:
git push origin tj-test tj-test-2

stdin:
refs/heads/tj-test 61bc13ab2e717633441855d534904c5036f342dd refs/heads/tj-test 62738e2
refs/heads/tj-test-2 1e371d2c345e00c34410372dad2d7602aca99908 refs/heads/tj-test-2 feb40b5
```

With these multiple ref ranges, we should read all of them during pre-push `modified_files` and `modified_lines_in_file`.

def modified_files
@modified_files ||= Overcommit::GitRepo.modified_files(refs: ref_range)
@modified_files ||= pushed_refs.map(&:modified_files).flatten.uniq
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require 'benchmark'
require 'set'

iterations = 10
data = 100.times.map do
  Array.new(1_000) { rand(0..1_000) }
end

Benchmark.bmbm do |x|
  x.report('array.flatten.uniq') do
    iterations.times do
      data.flatten.uniq
    end
  end

  x.report('array.each_with_object{ merge }.to_a') do
    iterations.times do
      data.each_with_object(Set.new) do |datum, set|
        set.merge(datum)
      end.to_a
    end
  end
end

puts 'Are they equivalent?'
puts data.flatten.uniq.sort == data.each_with_object(Set.new) { |datum, set| set.merge(datum) }.sort

Rehearsal ------------------------------------------------------------------------
array.flatten.uniq                     0.100000   0.010000   0.110000 (  0.097410)
array.each_with_object{ merge }.to_a   0.140000   0.000000   0.140000 (  0.148073)
--------------------------------------------------------------- total: 0.250000sec

                                           user     system      total        real
array.flatten.uniq                     0.090000   0.000000   0.090000 (  0.100320)
array.each_with_object{ merge }.to_a   0.140000   0.000000   0.140000 (  0.151008)
Are they equivalent?
true

Seems like for this case, calling flatten.uniq is fast.

`git commit -m "Update Branch 2" 2>&1 > #{File::NULL}`

should include(*%w[added-1 update-me added-2].map { |file| File.expand_path(file) })
should == %w[added-1 added-2 update-me].map { |file| File.expand_path(file) }
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing worth noting is that using == here is slightly more brittle, as it prevents us from changing our implementation (which could return the file names in a different order) without breaking tests.

At the end of the day, our goal is to make sure we're returning the correct set of files, not files in a specified order, which is why we used include. However, perhaps you changed this because we weren't checking for unintended files to be included. In that case, you could use a combination of include along with verifying the count to make sure no additional files are in the set.

This is not a big deal, so I'm fine merging this as-is, but it's important we're not overly-specific in tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll take a note for future reference. Thanks.

@sds sds merged commit 5af4021 into sds:master Feb 10, 2018
@sds
Copy link
Owner

sds commented Feb 10, 2018

Thanks @taufek!

@taufek taufek deleted the tj-multi-branches branch February 11, 2018 03:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants